Tips and Tricks for Analyzing a vmcore dump with the Crash Utility
Introduction
At CIQ, customers will often create tickets due to a kernel crash they experienced. As part of the investigation process, the Support Team asks for a vmcore dump to be uploaded to CIQ's Secure File Exchange (SFE) service for analysis.
For more information on how to upload a vmcore dump or other files to the SFE service, please consult the How to Upload Files to the CIQ Secure File Exchange (SFE) Service KB article.
This article will detail some of the steps that can be taken with the Crash Utility for vmcore dump analysis.
For further reading, the author highly recommends the following resources:
Example issue
A customer created a ticket with CIQ, after their CentOS 7.9 node using CIQ Bridge had a kernel panic and rebooted. What follows will be how to setup the Crash Utility and the steps taken to debug the issue.
Crash Utility environment setup
-
Set up a Rocky Linux 9.x installation on a node size of your choice.
-
Update all packages:
sudo dnf update -y
- Install the
kexec-toolspackage:
sudo dnf install -y kexec-tools
- From the node that went into a kernel panic state, find the kernel version:
uname -r
-
With the kernel version identified, search and download the equivalent version of the
kernel-debuginfo-commonpackage. In the CentOS 7.9 example listed above, the kernel version was3.10.0-1160.119.1.el7.x86_64, thus the required package waskernel-debuginfo-common-x86_64-3.10.0-1160.119.1.el7.x86_64.rpm. -
Similarly, the same version of the
kernel-debuginfopackage is also needed. Again referencing the CentOS 7.9 example, downloading thekernel-debuginfo-3.10.0-1160.119.1.el7.x86_64.rpmpackage is required. -
Note for Rocky Linux users - to find the appropriate
kernel-debuginfo-commonandkernel-debuginfopackages, go to The Rocky Linux Vault. Using Rocky Linux 9.5 with the x86_64 architecture as an example, thekernel-debuginfo-commonpackage is found in the devel repository andkernel-debuginfois in the devel repository under debug. -
Install both
kernel-debuginfo-commonandkernel-debuginfopackages using thelocalinstallcommand. The order the packages are installed in also matters:
dnf localinstall -y <path_to_kernel-debuginfo-common_rpm>
dnf localinstall -y <path_to_kernel-debuginfo_rpm>
- Note for CentOS users - the
vmcoredump will be generated as avmcore.flatfile. This is not natively readable by the Crash Utility and thus has to be converted into avmcoredump file. Perform the following command to convert the file:
/usr/sbin/makedumpfile -R ./vmcore < ./vmcore.flat
- Finally, start the Crash Utility by setting the
vmlinuxversion as an argument under your/usr/lib/debug/lib/modules/<kernel_version>directory and pointing to thevmcoredump file:
crash /usr/lib/debug/lib/modules/<kernel_version>/vmlinux <path_to_vmcore_dump>
Example crash dump analysis
-
Starting the Crash Utility will bring up an overview of the
vmcoredump, including the date the dump was taken, node name, memory availability, kernel version, and the kernel panic reason. -
An example kernel panic that CIQ often sees from customers running Kubernetes is
blocked tasks:
"Kernel panic - not syncing: hung_task: blocked tasks"
- From the initial overview, the amount of memory available was plentiful during the kernel panic. However, as you will see later on in the article, having plentiful memory does not rule out a memory-related issue:
MEMORY: 1011.7 GB
- For deeper analysis on memory usage and availability, the
kmem -icommand can be ran and produces an output similar to the below example:
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 261010406 995.7 GB ----
FREE 228447706 871.5 GB 87% of TOTAL MEM
USED 32562700 124.2 GB 12% of TOTAL MEM
SHARED 1626752 6.2 GB 0% of TOTAL MEM
BUFFERS 3277 12.8 MB 0% of TOTAL MEM
CACHED 20211378 77.1 GB 7% of TOTAL MEM
SLAB 765701 2.9 GB 0% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 1048575 4 GB ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 1048575 4 GB 100% of TOTAL SWAP
COMMIT LIMIT 131553778 501.8 GB ----
COMMITTED 26620006 101.5 GB 20% of TOTAL LIMIT
- Running the
sys -icommand shows more details about the hardware, as well as BIOS information, and product tags. Sensitive information such as serial numbers have been removed in the following example:
crash> sys -i
DMI_BIOS_VENDOR: HPE
DMI_BIOS_VERSION: A43
DMI_BIOS_DATE: 12/03/2021
DMI_SYS_VENDOR: HPE
DMI_PRODUCT_NAME: ProLiant DL325 Gen10 Plus
DMI_PRODUCT_VERSION:
DMI_PRODUCT_SERIAL: -
DMI_PRODUCT_UUID: -
DMI_BOARD_VENDOR: HPE
DMI_BOARD_NAME: ProLiant DL325 Gen10 Plus
DMI_BOARD_VERSION:
DMI_BOARD_SERIAL: -
DMI_BOARD_ASSET_TAG: -
DMI_CHASSIS_VENDOR: HPE
DMI_CHASSIS_TYPE: 23
DMI_CHASSIS_VERSION:
DMI_CHASSIS_SERIAL: -
DMI_CHASSIS_ASSET_TAG: -
- The
btcommand produces a backtrace output for the process that caused the kernel panic. The succeeding example shows process650running commandkhungtaskdwas responsible for the kernel panic:
crash> bt
PID: 650 TASK: ffff9b2ffbb59080 CPU: 47 COMMAND: "khungtaskd"
#0 [ffff9b2ffb663cb0] machine_kexec at ffffffffb2c69854
#1 [ffff9b2ffb663d10] __crash_kexec at ffffffffb2d29f12
#2 [ffff9b2ffb663de0] panic at ffffffffb33ab713
#3 [ffff9b2ffb663e60] watchdog at ffffffffb2d56bfe
#4 [ffff9b2ffb663ec8] kthread at ffffffffb2ccb621
- More information on
khungtaskdcan be found at this article, however a summary can be observed beneath:
The kernel hung task is based on a single kernel thread named as khungtaskd, which monitors processes in the TASK_UNINTERRUPTIBLE status. If a process stuck in D state during the period specified by kernel.hung_task_timeout_secs (defaults to 120 seconds), the stack information of this hung task process will be printed.
- To verify that the kernel is configured to crash in a
khungtaskdstate, thepcommand (to print the value of an expression) with thesysctl_hung_task_panicargument can be used:
crash> p sysctl_hung_task_panic
sysctl_hung_task_panic = $1 = 1
- Similarly the timeout can also be checked using the
p sysctl_hung_task_timeout_secscommand:
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $2 = 360
- To view more information about a particular process, the
pscommand plus the PID number as an argument can be used:
crash> ps 650
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 650 2 47 ffff9b2ffbb59080 RU 0.0 0 0 [khungtaskd]
- Finding out more information about which tasks were blocked, can be done with the
logcommand and agrepfor the wordINFO. In the subsequent example,systemdis seen as being blocked:
crash> log | grep INFO
[3970064.996286] INFO: task systemd:1 blocked for more than 360 seconds.
- For showing the stack traces of all blocked tasks in a
TASK_UNINTERRUPTIBLEstate, theforeach UN btcommand is recommended:
foreach UN bt
- Further filtering is also possible on all blocked tasks with this command string. The command picks tasks #2, #3, and #4 from each process, places them into three columns, counts the amount of stack trace messages occurrences and sorts from highest to lowest:
foreach UN bt | awk '/#[2-4] /{print $3,$5}' | paste - - - | sort | uniq -c | sort -nr
- An example is here:
crash> foreach UN bt | awk '/#[2-4] /{print $3,$5}' | paste - - - | sort | uniq -c | sort -nr
46 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f proc_cgroup_show ffffffffb2d35146
33 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f cgroup_lock_live_group ffffffffb2d2ed49
2 __mutex_lock_killable_slowpath ffffffffb33b6520 mutex_lock_killable ffffffffb33b65fe iterate_dir ffffffffb2e71d00
1 schedule_timeout ffffffffb33b5831 wait_for_completion ffffffffb33b805d wait_rcu_gp ffffffffb2cc828e
1 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f proc_cgroupstats_show ffffffffb2d2f2be
1 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f cgroup_release_agent ffffffffb2d2ffe5
1 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f cgroup_free_fn ffffffffb2d34308
-
From the above output, a high volume of
mutex_lock_slowpathmessages can be viewed. Researching the error attributes this to either a hardware issue or low resource availability (CPU / Memory / I/O). -
The full list of unique blocked tasks can be found by running
foreach UN bt | grep COMMAND | awk '{print $NF}' | sort -u:
crash> foreach UN bt | grep COMMAND | awk '{print $NF}' | sort -u
"cadvisor"
"dockerd"
"filebeat"
"kubelet"
"kworker/113:0"
"kworker/31:1"
"process-exporte"
"runc"
"systemd"
"systemd-journal"
- To dig further into a blocked task, use the
setcommand with the PID as the argument. The ensuing example sets the focus on the blocked tasksystemd:
crash> set 1
PID: 1
COMMAND: "systemd"
TASK: ffff9a32db050000 [THREAD_INFO: ffff9a32dbb9c000]
CPU: 42
STATE: TASK_UNINTERRUPTIBLE
- Then run
bt -lfor more verbose backtrace messages. In the above-mentionedmutex_lock_slowpathexample, the process that was running before themutex_lockwas engaged wasproc_cgroup_show:
crash> bt -l
#2 [ffff9a32dbb9fda8] __mutex_lock_slowpath at ffffffffb33b6ca7
/usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/include/linux/spinlock.h: 337
#3 [ffff9a32dbb9fe08] mutex_lock at ffffffffb33b602f
/usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/arch/x86/include/asm/current.h: 14
#4 [ffff9a32dbb9fe20] proc_cgroup_show at ffffffffb2d35146
/usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/kernel/cgroup.c: 4706
- The line number from the
cgroup.cfile being referenced byproc_cgroup_showwas4706. It is possible to view the contents of the source code in thevmcoredump with thelistorlcommand:
crash> l /usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/kernel/cgroup.c: 4706
4701
4702 retval = 0;
4703
4704 mutex_lock(&cgroup_mutex);
4705
4706 for_each_active_root(root) {
4707 struct cgroup_subsys *ss;
4708 struct cgroup *cgrp;
4709 int count = 0;
4710
- You can see from line
4704that themutex_lockis related to theCgroupsystem. From the kernel.org documentation on Cgroups:
There is a global mutex, cgroup_mutex, used by the cgroup system. This should be taken by anything that wants to modify a cgroup. It may also be taken to prevent cgroups from being modified, but more specific locks may be more appropriate in that situation.
-
Since
Cgroupshandle the allocation of resources, the mutex lock observed here points towards a lack of resources on the node. -
As an aside, if the
mutex_lockcode pointed tomutex_lock(&dentry->d_inode->i_mutex);, that would highlight an inode-related error on the storage medium and further investigation would be required there. -
Finally, after running the
logcommand to dump thesystem message buffer, multipleHardware Errormessages were observed related to the customer's memory DIMMs:
crash> log
<output_redacted>
[3968484.124456] {176586}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 20480
[3968484.124460] {176586}[Hardware Error]: It has been corrected by h/w and requires no further action
[3968484.124461] {176586}[Hardware Error]: event severity: corrected
[3968484.124463] {176586}[Hardware Error]: Error 0, type: corrected
[3968484.124464] {176586}[Hardware Error]: fru_text: DIMM# Sourced
[3968484.124465] {176586}[Hardware Error]: section_type: memory error
[3968484.124466] {176586}[Hardware Error]: error_status: 0x0000000000040400
[3968484.124468] {176586}[Hardware Error]: physical_address: 0x000000f877d52000
[3968484.124470] {176586}[Hardware Error]: node: 0 card: 3 module: 1 rank: 2 bank: 6 row: 57311 column: 0
[3968484.124471] {176586}[Hardware Error]: Error 1, type: corrected
[3968484.124472] {176586}[Hardware Error]: fru_text: DIMM# Sourced
[3968484.124473] {176586}[Hardware Error]: section_type: memory error
[3968484.124474] {176586}[Hardware Error]: error_status: 0x0000000000040400
[3968484.124475] {176586}[Hardware Error]: physical_address: 0x000000f7f9b63180
[3968484.124477] {176586}[Hardware Error]: node: 0 card: 3 module: 1 rank: 2 bank: 6 row: 56806 column: 16
[3968484.124501] [Hardware Error]: Corrected error, no action required.
[3968484.125383] [Hardware Error]: CPU:0 (17:31:0) MC1_STATUS[-|CE|-|-|AddrV|-|-|-|-]: 0x940000000000009f
[3968484.126221] [Hardware Error]: Error Addr: 0x000000f877d52000, IPID: 0x0000000000000000
[3968484.126907] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 0
[3968484.127593] [Hardware Error]: Instruction Fetch Unit Error: microtag probe port parity error.
[3968484.128280] [Hardware Error]: cache level: L3/GEN, tx: RESV
Example issue resolution
Ultimately it was found that the lack of resources were due to Hardware Errors in the customer's memory DIMMs, thus causing a contention for resources resulting in tasks being blocked for over 360 seconds and therefore invoking a kernel panic.
Conclusion
vmcore dump analysis is a vast and complicated process. There are almost a limitless number of ways that the kernel can crash and the root cause may not always be the same from system to system. The author hopes that the example analysis above for blocked tasks can help to demystify some of the research required and make it easier to identify a root cause if you run into a similar situation.