ArticlesRocky Linux

Tips and Tricks for Analyzing a vmcore dump with the Crash Utility

Introduction

At CIQ, customers will often create tickets due to a kernel crash they experienced. As part of the investigation process, the Support Team asks for a vmcore dump to be uploaded to CIQ's Secure File Exchange (SFE) service for analysis.

For more information on how to upload a vmcore dump or other files to the SFE service, please consult the How to Upload Files to the CIQ Secure File Exchange (SFE) Service KB article.

This article will detail some of the steps that can be taken with the Crash Utility for vmcore dump analysis.

For further reading, the author highly recommends the following resources:

Example issue

A customer created a ticket with CIQ, after their CentOS 7.9 node using CIQ Bridge had a kernel panic and rebooted. What follows will be how to setup the Crash Utility and the steps taken to debug the issue.

Crash Utility environment setup

  • Set up a Rocky Linux 9.x installation on a node size of your choice.

  • Update all packages:

sudo dnf update -y
  • Install the kexec-tools package:
sudo dnf install -y kexec-tools 
  • From the node that went into a kernel panic state, find the kernel version:
uname -r
  • With the kernel version identified, search and download the equivalent version of the kernel-debuginfo-common package. In the CentOS 7.9 example listed above, the kernel version was 3.10.0-1160.119.1.el7.x86_64, thus the required package was kernel-debuginfo-common-x86_64-3.10.0-1160.119.1.el7.x86_64.rpm.

  • Similarly, the same version of the kernel-debuginfo package is also needed. Again referencing the CentOS 7.9 example, downloading the kernel-debuginfo-3.10.0-1160.119.1.el7.x86_64.rpm package is required.

  • Note for Rocky Linux users - to find the appropriate kernel-debuginfo-common and kernel-debuginfo packages, go to The Rocky Linux Vault. Using Rocky Linux 9.5 with the x86_64 architecture as an example, the kernel-debuginfo-common package is found in the devel repository and kernel-debuginfo is in the devel repository under debug.

  • Install both kernel-debuginfo-common and kernel-debuginfo packages using the localinstall command. The order the packages are installed in also matters:

dnf localinstall -y <path_to_kernel-debuginfo-common_rpm>
dnf localinstall -y <path_to_kernel-debuginfo_rpm>
  • Note for CentOS users - the vmcore dump will be generated as a vmcore.flat file. This is not natively readable by the Crash Utility and thus has to be converted into a vmcore dump file. Perform the following command to convert the file:
/usr/sbin/makedumpfile -R ./vmcore < ./vmcore.flat
  • Finally, start the Crash Utility by setting the vmlinux version as an argument under your /usr/lib/debug/lib/modules/<kernel_version> directory and pointing to the vmcore dump file:
crash /usr/lib/debug/lib/modules/<kernel_version>/vmlinux <path_to_vmcore_dump>

Example crash dump analysis

  • Starting the Crash Utility will bring up an overview of the vmcore dump, including the date the dump was taken, node name, memory availability, kernel version, and the kernel panic reason.

  • An example kernel panic that CIQ often sees from customers running Kubernetes is blocked tasks:

"Kernel panic - not syncing: hung_task: blocked tasks"
  • From the initial overview, the amount of memory available was plentiful during the kernel panic. However, as you will see later on in the article, having plentiful memory does not rule out a memory-related issue:
MEMORY: 1011.7 GB
  • For deeper analysis on memory usage and availability, the kmem -i command can be ran and produces an output similar to the below example:
crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  261010406     995.7 GB         ----
         FREE  228447706     871.5 GB   87% of TOTAL MEM
         USED  32562700     124.2 GB   12% of TOTAL MEM
       SHARED  1626752       6.2 GB    0% of TOTAL MEM
      BUFFERS     3277      12.8 MB    0% of TOTAL MEM
       CACHED  20211378      77.1 GB    7% of TOTAL MEM
         SLAB   765701       2.9 GB    0% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP  1048575         4 GB         ----
    SWAP USED        0            0    0% of TOTAL SWAP
    SWAP FREE  1048575         4 GB  100% of TOTAL SWAP

 COMMIT LIMIT  131553778     501.8 GB         ----
    COMMITTED  26620006     101.5 GB   20% of TOTAL LIMIT
  • Running the sys -i command shows more details about the hardware, as well as BIOS information, and product tags. Sensitive information such as serial numbers have been removed in the following example:
crash> sys -i
        DMI_BIOS_VENDOR: HPE
       DMI_BIOS_VERSION: A43
          DMI_BIOS_DATE: 12/03/2021
         DMI_SYS_VENDOR: HPE
       DMI_PRODUCT_NAME: ProLiant DL325 Gen10 Plus
    DMI_PRODUCT_VERSION:
     DMI_PRODUCT_SERIAL: -
       DMI_PRODUCT_UUID: -
       DMI_BOARD_VENDOR: HPE
         DMI_BOARD_NAME: ProLiant DL325 Gen10 Plus
      DMI_BOARD_VERSION:
       DMI_BOARD_SERIAL: -
    DMI_BOARD_ASSET_TAG: -
     DMI_CHASSIS_VENDOR: HPE
       DMI_CHASSIS_TYPE: 23
    DMI_CHASSIS_VERSION:
     DMI_CHASSIS_SERIAL: -
  DMI_CHASSIS_ASSET_TAG: -
  • The bt command produces a backtrace output for the process that caused the kernel panic. The succeeding example shows process 650 running command khungtaskd was responsible for the kernel panic:
crash> bt
PID: 650      TASK: ffff9b2ffbb59080  CPU: 47   COMMAND: "khungtaskd"
 #0 [ffff9b2ffb663cb0] machine_kexec at ffffffffb2c69854
 #1 [ffff9b2ffb663d10] __crash_kexec at ffffffffb2d29f12
 #2 [ffff9b2ffb663de0] panic at ffffffffb33ab713
 #3 [ffff9b2ffb663e60] watchdog at ffffffffb2d56bfe
 #4 [ffff9b2ffb663ec8] kthread at ffffffffb2ccb621
  • More information on khungtaskd can be found at this article, however a summary can be observed beneath:

The kernel hung task is based on a single kernel thread named as khungtaskd, which monitors processes in the TASK_UNINTERRUPTIBLE status. If a process stuck in D state during the period specified by kernel.hung_task_timeout_secs (defaults to 120 seconds), the stack information of this hung task process will be printed.

  • To verify that the kernel is configured to crash in a khungtaskd state, the p command (to print the value of an expression) with the sysctl_hung_task_panic argument can be used:
crash> p sysctl_hung_task_panic
sysctl_hung_task_panic = $1 = 1
  • Similarly the timeout can also be checked using the p sysctl_hung_task_timeout_secs command:
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $2 = 360
  • To view more information about a particular process, the ps command plus the PID number as an argument can be used:
crash> ps 650
      PID    PPID  CPU       TASK        ST  %MEM      VSZ      RSS  COMM
>     650       2  47  ffff9b2ffbb59080  RU   0.0        0        0  [khungtaskd]
  • Finding out more information about which tasks were blocked, can be done with the log command and a grep for the word INFO. In the subsequent example, systemd is seen as being blocked:
crash> log | grep INFO
[3970064.996286] INFO: task systemd:1 blocked for more than 360 seconds.
  • For showing the stack traces of all blocked tasks in a TASK_UNINTERRUPTIBLE state, the foreach UN bt command is recommended:
foreach UN bt
  • Further filtering is also possible on all blocked tasks with this command string. The command picks tasks #2, #3, and #4 from each process, places them into three columns, counts the amount of stack trace messages occurrences and sorts from highest to lowest:
foreach UN bt | awk '/#[2-4] /{print $3,$5}' | paste - - - | sort | uniq -c | sort -nr
  • An example is here:
crash> foreach UN bt | awk '/#[2-4] /{print $3,$5}' | paste - - - | sort | uniq -c | sort -nr
     46 __mutex_lock_slowpath ffffffffb33b6ca7  mutex_lock ffffffffb33b602f     proc_cgroup_show ffffffffb2d35146
     33 __mutex_lock_slowpath ffffffffb33b6ca7  mutex_lock ffffffffb33b602f     cgroup_lock_live_group ffffffffb2d2ed49
      2 __mutex_lock_killable_slowpath ffffffffb33b6520 mutex_lock_killable ffffffffb33b65fe    iterate_dir ffffffffb2e71d00
      1 schedule_timeout ffffffffb33b5831       wait_for_completion ffffffffb33b805d    wait_rcu_gp ffffffffb2cc828e
      1 __mutex_lock_slowpath ffffffffb33b6ca7  mutex_lock ffffffffb33b602f     proc_cgroupstats_show ffffffffb2d2f2be
      1 __mutex_lock_slowpath ffffffffb33b6ca7  mutex_lock ffffffffb33b602f     cgroup_release_agent ffffffffb2d2ffe5
      1 __mutex_lock_slowpath ffffffffb33b6ca7  mutex_lock ffffffffb33b602f     cgroup_free_fn ffffffffb2d34308
  • From the above output, a high volume of mutex_lock_slowpath messages can be viewed. Researching the error attributes this to either a hardware issue or low resource availability (CPU / Memory / I/O).

  • The full list of unique blocked tasks can be found by running foreach UN bt | grep COMMAND | awk '{print $NF}' | sort -u:

crash> foreach UN bt | grep COMMAND | awk '{print $NF}' | sort -u
"cadvisor"
"dockerd"
"filebeat"
"kubelet"
"kworker/113:0"
"kworker/31:1"
"process-exporte"
"runc"
"systemd"
"systemd-journal"
  • To dig further into a blocked task, use the set command with the PID as the argument. The ensuing example sets the focus on the blocked task systemd:
crash> set 1
    PID: 1
COMMAND: "systemd"
   TASK: ffff9a32db050000  [THREAD_INFO: ffff9a32dbb9c000]
    CPU: 42
  STATE: TASK_UNINTERRUPTIBLE
  • Then run bt -l for more verbose backtrace messages. In the above-mentioned mutex_lock_slowpath example, the process that was running before the mutex_lock was engaged was proc_cgroup_show:
crash> bt -l
#2 [ffff9a32dbb9fda8] __mutex_lock_slowpath at ffffffffb33b6ca7
    /usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/include/linux/spinlock.h: 337
#3 [ffff9a32dbb9fe08] mutex_lock at ffffffffb33b602f
    /usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/arch/x86/include/asm/current.h: 14
#4 [ffff9a32dbb9fe20] proc_cgroup_show at ffffffffb2d35146
    /usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/kernel/cgroup.c: 4706
  • The line number from the cgroup.c file being referenced by proc_cgroup_show was 4706. It is possible to view the contents of the source code in the vmcore dump with the list or l command:
crash> l /usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/kernel/cgroup.c: 4706
4701
4702            retval = 0;
4703
4704            mutex_lock(&cgroup_mutex);
4705
4706            for_each_active_root(root) {
4707                    struct cgroup_subsys *ss;
4708                    struct cgroup *cgrp;
4709                    int count = 0;
4710

There is a global mutex, cgroup_mutex, used by the cgroup system. This should be taken by anything that wants to modify a cgroup. It may also be taken to prevent cgroups from being modified, but more specific locks may be more appropriate in that situation.

  • Since Cgroups handle the allocation of resources, the mutex lock observed here points towards a lack of resources on the node.

  • As an aside, if the mutex_lock code pointed to mutex_lock(&dentry->d_inode->i_mutex);, that would highlight an inode-related error on the storage medium and further investigation would be required there.

  • Finally, after running the log command to dump the system message buffer, multiple Hardware Error messages were observed related to the customer's memory DIMMs:

crash> log
<output_redacted>
[3968484.124456] {176586}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 20480
[3968484.124460] {176586}[Hardware Error]: It has been corrected by h/w and requires no further action
[3968484.124461] {176586}[Hardware Error]: event severity: corrected
[3968484.124463] {176586}[Hardware Error]:  Error 0, type: corrected
[3968484.124464] {176586}[Hardware Error]:  fru_text: DIMM# Sourced
[3968484.124465] {176586}[Hardware Error]:   section_type: memory error
[3968484.124466] {176586}[Hardware Error]:   error_status: 0x0000000000040400
[3968484.124468] {176586}[Hardware Error]:   physical_address: 0x000000f877d52000
[3968484.124470] {176586}[Hardware Error]:   node: 0 card: 3 module: 1 rank: 2 bank: 6 row: 57311 column: 0
[3968484.124471] {176586}[Hardware Error]:  Error 1, type: corrected
[3968484.124472] {176586}[Hardware Error]:  fru_text: DIMM# Sourced
[3968484.124473] {176586}[Hardware Error]:   section_type: memory error
[3968484.124474] {176586}[Hardware Error]:   error_status: 0x0000000000040400
[3968484.124475] {176586}[Hardware Error]:   physical_address: 0x000000f7f9b63180
[3968484.124477] {176586}[Hardware Error]:   node: 0 card: 3 module: 1 rank: 2 bank: 6 row: 56806 column: 16
[3968484.124501] [Hardware Error]: Corrected error, no action required.
[3968484.125383] [Hardware Error]: CPU:0 (17:31:0) MC1_STATUS[-|CE|-|-|AddrV|-|-|-|-]: 0x940000000000009f
[3968484.126221] [Hardware Error]: Error Addr: 0x000000f877d52000, IPID: 0x0000000000000000
[3968484.126907] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 0
[3968484.127593] [Hardware Error]: Instruction Fetch Unit Error: microtag probe port parity error.
[3968484.128280] [Hardware Error]: cache level: L3/GEN, tx: RESV

Example issue resolution

Ultimately it was found that the lack of resources were due to Hardware Errors in the customer's memory DIMMs, thus causing a contention for resources resulting in tasks being blocked for over 360 seconds and therefore invoking a kernel panic.

Conclusion

vmcore dump analysis is a vast and complicated process. There are almost a limitless number of ways that the kernel can crash and the root cause may not always be the same from system to system. The author hopes that the example analysis above for blocked tasks can help to demystify some of the research required and make it easier to identify a root cause if you run into a similar situation.