ArticlesWarewulf

Install NVIDIA Drivers in a Warewulf Container

Introduction

There are multiple ways to create an image for Warewulf that includes the NVIDIA drivers. This article will explore how to work with a Containerfile or modify a regular image.

Prerequisites

This guide assumes that Warewulf server has been installed and configured, and at least one node has been deployed.

Instructions

Create an image using a Containerfile

Warewulf supports creating images directly from containers, enabling you to build a custom container that includes the required NVIDIA drivers. The Warewulf-images GitHub repository has an example Containerfile that we can work from. Start by installing Podman on the server you want to build the image on. To keep things simple, we will install this on our Warewulf host server to streamline the import process:

sudo dnf install -y podman

Next, we'll create a folder to add our Containerfile to:

mkdir rockylinux-9-nvidia
cd rockylinux-9-nvidia

Now create a new file called Containerfile and add the following to it:

FROM ghcr.io/warewulf/warewulf-rockylinux:9

RUN dnf -y install dnf-plugins-core epel-release kernel-headers \
    && dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(arch)/cuda-rhel9.repo \
    && dnf -y module install nvidia-driver:latest-dkms \
    && dnf -y install datacenter-gpu-manager \
    && dnf clean all \
    && for dir in /usr/src/kernels/*; do dkms autoinstall --kernelver $(basename $dir); done \
    && dkms status

Feel free to modify this as needed to suit your environment. Next, we can build this image:

podman build -t rockylinux-9-nvidia:v1 .

Feel free to replace v1 with a tag or versioning system that better fits your personal preference or corporate policy if needed. If you see an error cannot apply additional memory protection after relocation: Permission denied, you may need to rerun with SELinux disabled or configured:

podman build --security-opt label=disable -t rockylinux-9-nvidia:v1 .

And then rerun the build from before. Once the build finishes, we can save this to a tar file and then import it into Warewulf:

podman save -o rockylinux-9-nvidia-v1.tar localhost/rockylinux-9-nvidia:v1
wwctl image import file://rockylinux-9-nvidia-v1.tar rockylinux-9-nvidia-v1

Once the import finishes, you can now assign this image to a profile or node as you would any other image. You can find an example of this in the following section within this guide. Another benefit of using Containerfiles, is their ability to be integrated into a CI/CD pipeline to enable version control and automated building.

Modifying a regular image

Warewulf includes the ability to enter into an image via a shell. This allows you to modify the container image and install components like NVIDIA drivers easily and quickly. We first need to start by either copying an existing container or downloading a new container. You can use wwctl container list to view your current images if you would like to build off an existing container. You can, of course, edit any existing container; however, we always recommend working from a copy or creating a backup first.

In this example, we are going to start by downloading a fresh copy of Rocky Linux 9. We can do so by downloading an image from the Warewulf repository:

wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rockylinux-nvidia-9

In this example, the name of our container will be rockylinux-nvidia-9. Feel free to change this to match your environment or to add additional information such as the date or version. Once the container is finished downloading, we can enter into this container's shell and install the NVIDIA drivers:

wwctl container shell rockylinux-nvidia-9

You can find more detailed instructions for installing NVIDIA drivers here. The following are example steps for Rocky Linux 9:

# Install necessary packages and add the NVIDIA repository
dnf -y install dnf-plugins-core epel-release kernel-headers
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(arch)/cuda-rhel9.repo
# Install the latest NVIDIA driver
dnf -y module install nvidia-driver:latest-dkms

We can verify this installed correctly by checking for the dynamically loaded kernel:

dkms status | grep nvidia
nvidia/555.42.02, 5.14.0-427.20.1.el9_4.x86_64, x86_64: installed

Once you have finished installing the driver and any other applications you need, type exit to leave the container. Warewulf should rebuild the container once you are finished, however you can run the following command manually to verify the container was built:

wwctl container build

Running this command will not return anything if the image has already been built/updated. If applicable, we can sync local users into the container. You can read more about the syncuser subcommand here.

wwctl container syncuser --write rockylinux-nvidia-9

Once the syncuser command completes, we can assign this new image to a node. Assigning it directly to a node allows us to boot and test our newly created container before pushing it out to a profile. You can skip directly to assigning this to a profile if you desire.

wwctl node set node1 --container rockylinux-nvidia-9

Once you have tested your container image, you can assign the container to the necessary profile. In our example, we are sticking with the default profile:

wwctl profile set default --container rockylinux-nvidia-9

Finally, we can build the overlay:

wwctl overlay build

Upon reboot, your nodes will now log in with the newly created container image! You can find more information about Warewulf containers as well as more advanced configuration examples and instructions in the Warewulf documentation.

References & related articles

Warewulf Documentation
Warewulf Rocky Linux 9 NVIDIA Container Example
Hands on Warewulf: Solving Cluster Provisioning & Management
Warewulf: Deep Dive, Use Cases, and Examples