Installing NVIDIA Drivers with CUDA in a Warewulf Image
Introduction
Running CUDA in a Warewulf environment requires the proper setup of NVIDIA drivers and the CUDA toolkit. It is recommended to install the NVIDIA drivers directly within the Warewulf image while distributing the CUDA toolkit via NFS.
Problem
One challenge when installing CUDA drivers directly into the Warewulf image is the increase in image size. Since the entire image is loaded into memory, this can cause booting issues especially in single-stage boot environments. This also reduces the memory available for applications. By moving the CUDA toolkit to external storage, these issues can be potentially avoided.
Resolution
Installing NVIDIA Drivers in the image
Follow this guide to setup an image with Rocky Linux 9 and Nvidia drivers.
Setting up the CUDA NFS share
On your host node (or the server that will host the CUDA library), first download the appropriate CUDA version:
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
chmod +x cuda_12.8.1_570.124.06_linux.run
Next, extract the toolkit:
./cuda_12.8.1_570.124.06_linux.run --toolkit --override --silent
By default, the files are extracted to /usr/local/cuda-12.8
and a symbolic link is created at /usr/local/cuda
, which is the location we will use. If you wish to extract the toolkit to a different location, append the option --toolkitpath=<path>
.
Now, add the CUDA toolkit folder as an export with NFS. First, install the NFS utilities and start the NFS server:
sudo dnf -y install nfs-utils
sudo systemctl enable --now nfs-server
Then, add the export by appending the following line to your /etc/exports
file. Replace WAREWULF_IPRANGE
with the range or subnet of your Warewulf cluster network. Alternatively, you may use *
to allow access from all IP addresses:
echo "/usr/local/cuda WAREWULF_IPRANGE(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
Finally, refresh the NFS exports:
sudo exportfs -ra
Creating a systemd mount unit for CUDA
With the NFS export set up, you can create a systemd
unit file to mount the CUDA toolkit on your nodes. Begin by creating a new overlay for this purpose:
sudo wwctl overlay create cuda-nfs
Next, create the systemd
unit file. The naming convention requires that the filename contains the full mount path with slashes replaced by dashes. In this case, the file name should be usr-local-cuda.mount
for mounting at /usr/local/cuda
.
Open the file for editing:
sudo wwctl overlay edit cuda-nfs -p /etc/systemd/system/usr-local-cuda.mount
And add the following content, replacing your_nfs_server
with your server's address:
[Unit]
Description=Mount NFS share for CUDA
After=network-online.target
Requires=network-online.target
[Mount]
What=your_nfs_server:/usr/local/cuda # Change this to your server and export path
Where=/usr/local/cuda
Type=nfs
Options=defaults,noatime,nolock
[Install]
WantedBy=multi-user.target
Set appropriate permissions on the unit file:
sudo wwctl overlay chmod cuda-nfs /etc/systemd/system/usr-local-cuda.mount 0644
Next, create the symlink to ensure that the mount is activated at boot. In Warewulf 4.6.0
and above, we can create a symlink from within a template:
sudo wwctl overlay edit cuda-nfs -p /etc/systemd/system/multi-user.target.wants/usr-local-cuda.mount.ww
Add the following template line to the file:
{{ softlink "/etc/systemd/system/usr-local-cuda.mount" }}
Applying the overlay
Add the overlay as a system overlay to the default profile, by appending it to the existing overlay list:
sudo wwctl profile set default -O $(sudo wwctl profile list default -a | grep SystemOverlay | awk '{ print $3 }'),cuda-nfs
After updating the profile, rebuild the overlays:
sudo wwctl overlay build
Once the nodes have booted, the CUDA toolkit should now be mounted and available.
Notes
- Ensure that your firewall settings allow NFS traffic.
- Adjust paths and options as needed based on your specific environment and security requirements.
References & Related Articles (Optional)
Install Nvidia Drivers in a Warewulf Container
CUDA Downloads
CUDA Installation Guide
Warewulf Overlay Documentation
Warewulf Profile Documentation
Warewulf Image Documentation
systemd Mount Documentation
Rocky Linux 9 Warewulf NVIDIA Containerfile
Rocky Linux NFS Guide