ArticlesRocky Linux

Diagnosing and Mitigating Soft Lockup Issues on Rocky Linux

Introduction

This article addresses soft lockup issues that can occur during application execution on Rocky Linux.

The recommendations provided here are based on a Slurm environment with high CPU utilization from a Python application, causing CPU lockups.

This guide focuses on Rocky Linux 8.x and 9.x.

Problem

Soft lockup messages are observed from an application with high CPU utilization.

Symptoms

System logs from dmesg display lockup messages such as BUG: soft lockup - CPU#18 stuck for 20s! [python3:104938] during program execution.

dmesg also includes raw_spin_unlock_irqrestore messages, indicating that a process may be holding a lock for an extended period and causing contention of system resources.

The timing of the soft lockup messages are closely tied to the watchdog threshold setting, which can be checked under /proc/sys/kernel/watchdog_thresh.

Resolution

Improperly configured /etc/sysconfig/grub file

  • Verify that the /etc/sysconfig/grub file is configured correctly with options such as console=ttyS1 or console=ttyS1,9600, and consider increasing the baud rate to 115200 or disabling serial console logging.

High load average

  • High load averages, often exceeding the number of available logical CPU cores, can cause further issues.

  • It is worth installing the sysstat package with sudo dnf install -y sysstat and checking the generated /var/log/sa files.

  • If the load average is higher than the total number of logical CPU cores available, that can also cause the soft lockup messages.

The watchdog_thresh kernel parameter configuration

  • The value set in watchdog_thresh determines the threshold interval the kernel uses to decide when a CPU is “stuck.”

  • By default, the watchdog_thresh value is set to 10 seconds.

  • The soft lockup threshold is twice the value of the watchdog_thresh parameter.

  • The general advice is not to change the default watchdog_thresh value, as this can mask hard lockups that occur.

  • Verify the current watchdog_thresh value with the following command:

cat /proc/sys/kernel/watchdog_thresh

Application tuning

  • Investigate the behavior of the application to identify if it is monopolizing CPU resources, and tweak it to reduce CPU load.

Tuned profile selection

  • Apply the tuned profile optimized for high-performance computing workloads by running:
sudo dnf install -y tuned
sudo systemctl enable --now tuned
sudo tuned-adm profile hpc-compute
sudo tuned-adm active

CPU time limitations

  • Limit CPU time for user processes by updating /etc/security/limits.conf. An example is below:
username soft cpu 60  
username hard cpu 120
  • In the above example, the soft limit (this can be moved up or down by the user within the hard limit) is 60 seconds and the hard limit is enforced by the kernel and is set to 120 seconds. The hard limit cannot be increased by the user.

Vmcore dump analysis

  • While configuring kdump is outside the scope of this article, once configured, trigger a crash using the following commands:
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
  • By default, the vmcore dump will be available under /var/crash.

  • Going through the vmcore dump using the crash utility, will also help you understand what is causing the softlockup issues.

Root cause

Soft lockups occurs when the kernel scheduler is starved and cannot perform its duties within the allotted watchdog threshold.

A high system workload, particularly from a misbehaving application, can saturate CPU resources and prevent task switching from occurring.

System misconfigurations, such as improper serial console settings or not setting the correct tuned profile, can also cause lockups to occur.

References & related articles

limits.conf man page
Linux limits.conf man page
Kernel Crash Dump documentation
watchdog_thresh documentation