ArticlesWarewulf

Installing NVIDIA Drivers with CUDA in a Warewulf Image

Introduction

Running CUDA in a Warewulf environment requires the proper setup of NVIDIA drivers and the CUDA toolkit. It is recommended to install the NVIDIA drivers directly within the Warewulf image while distributing the CUDA toolkit via NFS.

Problem

One challenge when installing CUDA drivers directly into the Warewulf image is the increase in image size. Since the entire image is loaded into memory, this can cause booting issues especially in single-stage boot environments. This also reduces the memory available for applications. By moving the CUDA toolkit to external storage, these issues can be potentially avoided.

Resolution

Installing NVIDIA Drivers in the image

Follow this guide to setup an image with Rocky Linux 9 and Nvidia drivers.

Setting up the CUDA NFS share

On your host node (or the server that will host the CUDA library), first download the appropriate CUDA version:

wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
chmod +x cuda_12.8.1_570.124.06_linux.run

Next, extract the toolkit:

./cuda_12.8.1_570.124.06_linux.run --toolkit --override --silent

By default, the files are extracted to /usr/local/cuda-12.8 and a symbolic link is created at /usr/local/cuda, which is the location we will use. If you wish to extract the toolkit to a different location, append the option --toolkitpath=<path>.

Now, add the CUDA toolkit folder as an export with NFS. First, install the NFS utilities and start the NFS server:

sudo dnf -y install nfs-utils
sudo systemctl enable --now nfs-server

Then, add the export by appending the following line to your /etc/exports file. Replace WAREWULF_IPRANGE with the range or subnet of your Warewulf cluster network. Alternatively, you may use * to allow access from all IP addresses:

echo "/usr/local/cuda WAREWULF_IPRANGE(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports

Finally, refresh the NFS exports:

sudo exportfs -ra

Creating a systemd mount unit for CUDA

With the NFS export set up, you can create a systemd unit file to mount the CUDA toolkit on your nodes. Begin by creating a new overlay for this purpose:

sudo wwctl overlay create cuda-nfs

Next, create the systemd unit file. The naming convention requires that the filename contains the full mount path with slashes replaced by dashes. In this case, the file name should be usr-local-cuda.mount for mounting at /usr/local/cuda.

Open the file for editing:

sudo wwctl overlay edit cuda-nfs -p /etc/systemd/system/usr-local-cuda.mount

And add the following content, replacing your_nfs_server with your server's address:

[Unit]
Description=Mount NFS share for CUDA
After=network-online.target
Requires=network-online.target

[Mount]
What=your_nfs_server:/usr/local/cuda  # Change this to your server and export path
Where=/usr/local/cuda
Type=nfs
Options=defaults,noatime,nolock

[Install]
WantedBy=multi-user.target

Set appropriate permissions on the unit file:

sudo wwctl overlay chmod cuda-nfs /etc/systemd/system/usr-local-cuda.mount 0644

Next, create the symlink to ensure that the mount is activated at boot. In Warewulf 4.6.0 and above, we can create a symlink from within a template:

sudo wwctl overlay edit cuda-nfs -p /etc/systemd/system/multi-user.target.wants/usr-local-cuda.mount.ww

Add the following template line to the file:

{{ softlink "/etc/systemd/system/usr-local-cuda.mount" }}

Applying the overlay

Add the overlay as a system overlay to the default profile, by appending it to the existing overlay list:

sudo wwctl profile set default -O $(sudo wwctl profile list default -a | grep SystemOverlay | awk '{ print $3 }'),cuda-nfs

After updating the profile, rebuild the overlays:

sudo wwctl overlay build

Once the nodes have booted, the CUDA toolkit should now be mounted and available.

Notes

  • Ensure that your firewall settings allow NFS traffic.
  • Adjust paths and options as needed based on your specific environment and security requirements.

References & Related Articles (Optional)

Install Nvidia Drivers in a Warewulf Container CUDA Downloads
CUDA Installation Guide
Warewulf Overlay Documentation
Warewulf Profile Documentation
Warewulf Image Documentation
systemd Mount Documentation
Rocky Linux 9 Warewulf NVIDIA Containerfile
Rocky Linux NFS Guide