Diagnosing and Mitigating Soft Lockup Issues on Rocky Linux
Introduction
This article addresses soft lockup issues that can occur during application execution on Rocky Linux.
The recommendations provided here are based on a Slurm environment with high CPU utilization from a Python application, causing CPU lockups.
This guide focuses on Rocky Linux 8.x and 9.x.
Problem
Soft lockup messages are observed from an application with high CPU utilization.
Symptoms
System logs from dmesg
display lockup messages such as BUG: soft lockup - CPU#18 stuck for 20s! [python3:104938]
during program execution.
dmesg
also includes raw_spin_unlock_irqrestore
messages, indicating that a process may be holding a lock for an extended period and causing contention of system resources.
The timing of the soft lockup messages are closely tied to the watchdog threshold
setting, which can be checked under /proc/sys/kernel/watchdog_thresh
.
Resolution
Improperly configured /etc/sysconfig/grub
file
- Verify that the
/etc/sysconfig/grub
file is configured correctly with options such asconsole=ttyS1
orconsole=ttyS1,9600
, and consider increasing the baud rate to115200
or disabling serial console logging.
High load average
-
High load averages, often exceeding the number of available logical CPU cores, can cause further issues.
-
It is worth installing the
sysstat
package withsudo dnf install -y sysstat
and checking the generated/var/log/sa
files. -
If the load average is higher than the total number of logical CPU cores available, that can also cause the
soft lockup
messages.
The watchdog_thresh kernel parameter configuration
-
The value set in
watchdog_thresh
determines the threshold interval the kernel uses to decide when a CPU is “stuck.” -
By default, the
watchdog_thresh
value is set to10
seconds. -
The soft lockup threshold is twice the value of the
watchdog_thresh
parameter. -
The general advice is not to change the default
watchdog_thresh
value, as this can mask hard lockups that occur. -
Verify the current
watchdog_thresh
value with the following command:
cat /proc/sys/kernel/watchdog_thresh
Application tuning
- Investigate the behavior of the application to identify if it is monopolizing CPU resources, and tweak it to reduce CPU load.
Tuned profile selection
- Apply the
tuned
profile optimized for high-performance computing workloads by running:
sudo dnf install -y tuned
sudo systemctl enable --now tuned
sudo tuned-adm profile hpc-compute
sudo tuned-adm active
CPU time limitations
- Limit CPU time for user processes by updating
/etc/security/limits.conf
. An example is below:
username soft cpu 60
username hard cpu 120
- In the above example, the
soft
limit (this can be moved up or down by the user within thehard
limit) is60
seconds and thehard
limit is enforced by the kernel and is set to120
seconds. Thehard
limit cannot be increased by the user.
Vmcore dump analysis
- While configuring
kdump
is outside the scope of this article, once configured, trigger a crash using the following commands:
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
-
By default, the
vmcore
dump will be available under/var/crash
. -
Going through the
vmcore
dump using thecrash
utility, will also help you understand what is causing the softlockup issues.
Root cause
Soft lockups occurs when the kernel scheduler is starved and cannot perform its duties within the allotted watchdog threshold.
A high system workload, particularly from a misbehaving application, can saturate CPU resources and prevent task switching from occurring.
System misconfigurations, such as improper serial console settings or not setting the correct tuned
profile, can also cause lockups to occur.
References & related articles
limits.conf man page
Linux limits.conf man page
Kernel Crash Dump documentation
watchdog_thresh documentation