Enabling NVIDIA GPU ECC on Warewulf-Managed Nodes
Introduction
NVIDIA GPU Error Correcting Code (ECC) memory detects and corrects single-bit memory errors and detects double-bit errors, improving reliability for compute workloads. ECC is toggled with nvidia-smi -e 1 (enable) or nvidia-smi -e 0 (disable), and the change takes effect after a reboot.
On Warewulf-managed compute nodes, the ECC setting needs to be applied automatically at boot. This article covers how to create a Warewulf overlay that enables ECC via a systemd one-shot service, and how to work around a known issue on certain Ampere workstation GPU models where ECC remains stuck in a "Pending" state after reboot.
This guide assumes that NVIDIA drivers are already installed in the Warewulf image and that nvidia-smi is available on the compute nodes. See Install NVIDIA Drivers in a Warewulf Image for setup instructions.
Problem
Running nvidia-smi -e 1 sets the ECC mode to "Pending: Enabled," but after rebooting the node, the ECC mode reverts to "Current: Disabled" with the pending flag still set. The change never takes effect.
This has been observed on Ampere-generation workstation GPU models such as the RTX A4000 and RTX A5000. The nvidia-smi -q -d ECC output shows the following pattern across all attached GPU devices:
ECC Mode
Current : Disabled
Pending : Enabled
The root cause is that the nvidia_drm and nvidia_modeset kernel modules hold the GPU device open during boot, which prevents the NVIDIA driver from applying the queued ECC configuration change during initialization.
Resolution
The fix involves two parts: a systemd one-shot service to run nvidia-smi -e 1 at boot, and a modprobe blacklist to prevent the modules that block the ECC change from loading.
Create the Warewulf overlay
Create a dedicated overlay for the ECC configuration:
wwctl overlay create nvidia-ecc
Add the systemd service
Create the one-shot service unit file in the overlay:
wwctl overlay edit -p nvidia-ecc /etc/systemd/system/ww-nvidia-ecc.service
Add the following contents:
[Unit]
Description=Enable ECC on NVIDIA GPUs via Warewulf
ConditionPathExists=/usr/bin/nvidia-smi
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/nvidia-smi -e 1
[Install]
WantedBy=multi-user.target
Set the correct ownership and permissions:
wwctl overlay chmod nvidia-ecc /etc/systemd/system/ww-nvidia-ecc.service 0644
wwctl overlay chown nvidia-ecc /etc/systemd/system/ww-nvidia-ecc.service root root
Enable the service at boot
To enable the service, create a symlink in multi-user.target.wants using the Warewulf softlink template function. Create a .ww template file:
wwctl overlay edit -p nvidia-ecc /etc/systemd/system/multi-user.target.wants/ww-nvidia-ecc.service.ww
Add this single line as the contents:
{{ softlink "/etc/systemd/system/ww-nvidia-ecc.service" }}
When Warewulf builds the overlay, this template renders into a real symlink on the node, which is what enables the unit at boot.
Blacklist modules that may block ECC activation
On some Ampere workstation GPU models, the nvidia_drm and nvidia_modeset modules can prevent the driver from applying the pending ECC change during initialization. If ECC remains in a "Pending" state after reboot, blacklist these modules by adding a modprobe configuration file to the same overlay:
wwctl overlay edit -p nvidia-ecc /etc/modprobe.d/nvidia-ecc.conf
Add the following contents:
blacklist nvidia_drm
blacklist nvidia_modeset
Set the correct ownership and permissions on this file as well:
wwctl overlay chmod nvidia-ecc /etc/modprobe.d/nvidia-ecc.conf 0644
wwctl overlay chown nvidia-ecc /etc/modprobe.d/nvidia-ecc.conf root root
Add the overlay and rebuild
Add the nvidia-ecc overlay to the relevant node or profile, then rebuild the overlays:
wwctl overlay build
After the node reboots, confirm that ECC is enabled:
nvidia-smi -q -d ECC
The output should show:
ECC Mode
Current : Enabled
Pending : Enabled
The overlay can remain in place on subsequent boots since nvidia-smi -e 1 is a no-op when ECC is already enabled.
Notes
The module blacklist for nvidia_drm and nvidia_modeset is only needed when the ECC change is stuck in a "Pending" state. This is a known issue specific to Ampere workstation GPU models such as the RTX A4000, RTX A5000, and similar cards. Data center GPU models such as the A100 and H100 typically do not require this workaround.
If the node requires nvidia_drm or nvidia_modeset for display output, blacklisting these modules disables GPU-accelerated display. For headless compute nodes, which is the typical Warewulf use case, this has no functional impact.
References & related articles
Install NVIDIA Drivers in a Warewulf Image
NVIDIA SMI Documentation
NVIDIA Developer Forum: How to enable ECC on RTX A4000