Diagnosing and Mitigating Soft Lockup Issues on Rocky Linux
Introduction
This article addresses soft lockup issues that can occur during application execution on Rocky Linux.
The recommendations provided here are based on a Slurm environment with high CPU utilization from a Python application, causing CPU lockups.
This guide focuses on Rocky Linux 8.x and 9.x.
Problem
Soft lockup messages are observed from an application with high CPU utilization.
Symptoms
System logs from dmesg display lockup messages such as BUG: soft lockup - CPU#18 stuck for 20s! [python3:104938] during program execution.
dmesg also includes raw_spin_unlock_irqrestore messages, indicating that a process may be holding a lock for an extended period and causing contention of system resources.
The timing of the soft lockup messages are closely tied to the watchdog threshold setting, which can be checked under /proc/sys/kernel/watchdog_thresh.
Resolution
Improperly configured /etc/sysconfig/grub file
- Verify that the
/etc/sysconfig/grubfile is configured correctly with options such asconsole=ttyS1orconsole=ttyS1,9600, and consider increasing the baud rate to115200or disabling serial console logging.
High load average
-
High load averages, often exceeding the number of available logical CPU cores, can cause further issues.
-
It is worth installing the
sysstatpackage withsudo dnf install -y sysstatand checking the generated/var/log/safiles. -
If the load average is higher than the total number of logical CPU cores available, that can also cause the
soft lockupmessages.
The watchdog_thresh kernel parameter configuration
-
The value set in
watchdog_threshdetermines the threshold interval the kernel uses to decide when a CPU is “stuck.” -
By default, the
watchdog_threshvalue is set to10seconds. -
The soft lockup threshold is twice the value of the
watchdog_threshparameter. -
The general advice is not to change the default
watchdog_threshvalue, as this can mask hard lockups that occur. -
Verify the current
watchdog_threshvalue with the following command:
cat /proc/sys/kernel/watchdog_thresh
Application tuning
- Investigate the behavior of the application to identify if it is monopolizing CPU resources, and tweak it to reduce CPU load.
Tuned profile selection
- Apply the
tunedprofile optimized for high-performance computing workloads by running:
sudo dnf install -y tuned
sudo systemctl enable --now tuned
sudo tuned-adm profile hpc-compute
sudo tuned-adm active
CPU time limitations
- Limit CPU time for user processes by updating
/etc/security/limits.conf. An example is below:
username soft cpu 60
username hard cpu 120
- In the above example, the
softlimit (this can be moved up or down by the user within thehardlimit) is60seconds and thehardlimit is enforced by the kernel and is set to120seconds. Thehardlimit cannot be increased by the user.
Vmcore dump analysis
- While configuring
kdumpis outside the scope of this article, once configured, trigger a crash using the following commands:
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
-
By default, the
vmcoredump will be available under/var/crash. -
Going through the
vmcoredump using thecrashutility, will also help you understand what is causing the softlockup issues.
Root cause
Soft lockups occurs when the kernel scheduler is starved and cannot perform its duties within the allotted watchdog threshold.
A high system workload, particularly from a misbehaving application, can saturate CPU resources and prevent task switching from occurring.
System misconfigurations, such as improper serial console settings or not setting the correct tuned profile, can also cause lockups to occur.
References & related articles
limits.conf man page
Linux limits.conf man page
Kernel Crash Dump documentation
watchdog_thresh documentation