Tips and Tricks for Analyzing a vmcore dump with the Crash Utility
Introduction
At CIQ, customers will often create tickets due to a kernel crash they experienced. As part of the investigation process, the Support Team asks for a vmcore
dump to be uploaded to CIQ's Secure File Exchange (SFE) service for analysis.
For more information on how to upload a vmcore
dump or other files to the SFE service, please consult the How to Upload Files to the CIQ Secure File Exchange (SFE) Service KB article.
This article will detail some of the steps that can be taken with the Crash Utility for vmcore
dump analysis.
For further reading, the author highly recommends the following resources:
Example issue
A customer created a ticket with CIQ, after their CentOS 7.9 node using CIQ Bridge had a kernel panic and rebooted. What follows will be how to setup the Crash Utility and the steps taken to debug the issue.
Crash Utility environment setup
-
Set up a Rocky Linux 9.x installation on a node size of your choice.
-
Update all packages:
sudo dnf update -y
- Install the
kexec-tools
package:
sudo dnf install -y kexec-tools
- From the node that went into a kernel panic state, find the kernel version:
uname -r
-
With the kernel version identified, search and download the equivalent version of the
kernel-debuginfo-common
package. In the CentOS 7.9 example listed above, the kernel version was3.10.0-1160.119.1.el7.x86_64
, thus the required package waskernel-debuginfo-common-x86_64-3.10.0-1160.119.1.el7.x86_64.rpm
. -
Similarly, the same version of the
kernel-debuginfo
package is also needed. Again referencing the CentOS 7.9 example, downloading thekernel-debuginfo-3.10.0-1160.119.1.el7.x86_64.rpm
package is required. -
Note for Rocky Linux users - to find the appropriate
kernel-debuginfo-common
andkernel-debuginfo
packages, go to The Rocky Linux Vault. Using Rocky Linux 9.5 with the x86_64 architecture as an example, thekernel-debuginfo-common
package is found in the devel repository andkernel-debuginfo
is in the devel repository under debug. -
Install both
kernel-debuginfo-common
andkernel-debuginfo
packages using thelocalinstall
command. The order the packages are installed in also matters:
dnf localinstall -y <path_to_kernel-debuginfo-common_rpm>
dnf localinstall -y <path_to_kernel-debuginfo_rpm>
- Note for CentOS users - the
vmcore
dump will be generated as avmcore.flat
file. This is not natively readable by the Crash Utility and thus has to be converted into avmcore
dump file. Perform the following command to convert the file:
/usr/sbin/makedumpfile -R ./vmcore < ./vmcore.flat
- Finally, start the Crash Utility by setting the
vmlinux
version as an argument under your/usr/lib/debug/lib/modules/<kernel_version>
directory and pointing to thevmcore
dump file:
crash /usr/lib/debug/lib/modules/<kernel_version>/vmlinux <path_to_vmcore_dump>
Example crash dump analysis
-
Starting the Crash Utility will bring up an overview of the
vmcore
dump, including the date the dump was taken, node name, memory availability, kernel version, and the kernel panic reason. -
An example kernel panic that CIQ often sees from customers running Kubernetes is
blocked tasks
:
"Kernel panic - not syncing: hung_task: blocked tasks"
- From the initial overview, the amount of memory available was plentiful during the kernel panic. However, as you will see later on in the article, having plentiful memory does not rule out a memory-related issue:
MEMORY: 1011.7 GB
- For deeper analysis on memory usage and availability, the
kmem -i
command can be ran and produces an output similar to the below example:
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 261010406 995.7 GB ----
FREE 228447706 871.5 GB 87% of TOTAL MEM
USED 32562700 124.2 GB 12% of TOTAL MEM
SHARED 1626752 6.2 GB 0% of TOTAL MEM
BUFFERS 3277 12.8 MB 0% of TOTAL MEM
CACHED 20211378 77.1 GB 7% of TOTAL MEM
SLAB 765701 2.9 GB 0% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 1048575 4 GB ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 1048575 4 GB 100% of TOTAL SWAP
COMMIT LIMIT 131553778 501.8 GB ----
COMMITTED 26620006 101.5 GB 20% of TOTAL LIMIT
- Running the
sys -i
command shows more details about the hardware, as well as BIOS information, and product tags. Sensitive information such as serial numbers have been removed in the following example:
crash> sys -i
DMI_BIOS_VENDOR: HPE
DMI_BIOS_VERSION: A43
DMI_BIOS_DATE: 12/03/2021
DMI_SYS_VENDOR: HPE
DMI_PRODUCT_NAME: ProLiant DL325 Gen10 Plus
DMI_PRODUCT_VERSION:
DMI_PRODUCT_SERIAL: -
DMI_PRODUCT_UUID: -
DMI_BOARD_VENDOR: HPE
DMI_BOARD_NAME: ProLiant DL325 Gen10 Plus
DMI_BOARD_VERSION:
DMI_BOARD_SERIAL: -
DMI_BOARD_ASSET_TAG: -
DMI_CHASSIS_VENDOR: HPE
DMI_CHASSIS_TYPE: 23
DMI_CHASSIS_VERSION:
DMI_CHASSIS_SERIAL: -
DMI_CHASSIS_ASSET_TAG: -
- The
bt
command produces a backtrace output for the process that caused the kernel panic. The succeeding example shows process650
running commandkhungtaskd
was responsible for the kernel panic:
crash> bt
PID: 650 TASK: ffff9b2ffbb59080 CPU: 47 COMMAND: "khungtaskd"
#0 [ffff9b2ffb663cb0] machine_kexec at ffffffffb2c69854
#1 [ffff9b2ffb663d10] __crash_kexec at ffffffffb2d29f12
#2 [ffff9b2ffb663de0] panic at ffffffffb33ab713
#3 [ffff9b2ffb663e60] watchdog at ffffffffb2d56bfe
#4 [ffff9b2ffb663ec8] kthread at ffffffffb2ccb621
- More information on
khungtaskd
can be found at this article, however a summary can be observed beneath:
The kernel hung task is based on a single kernel thread named as khungtaskd, which monitors processes in the TASK_UNINTERRUPTIBLE status. If a process stuck in D state during the period specified by kernel.hung_task_timeout_secs (defaults to 120 seconds), the stack information of this hung task process will be printed.
- To verify that the kernel is configured to crash in a
khungtaskd
state, thep
command (to print the value of an expression) with thesysctl_hung_task_panic
argument can be used:
crash> p sysctl_hung_task_panic
sysctl_hung_task_panic = $1 = 1
- Similarly the timeout can also be checked using the
p sysctl_hung_task_timeout_secs
command:
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $2 = 360
- To view more information about a particular process, the
ps
command plus the PID number as an argument can be used:
crash> ps 650
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 650 2 47 ffff9b2ffbb59080 RU 0.0 0 0 [khungtaskd]
- Finding out more information about which tasks were blocked, can be done with the
log
command and agrep
for the wordINFO
. In the subsequent example,systemd
is seen as being blocked:
crash> log | grep INFO
[3970064.996286] INFO: task systemd:1 blocked for more than 360 seconds.
- For showing the stack traces of all blocked tasks in a
TASK_UNINTERRUPTIBLE
state, theforeach UN bt
command is recommended:
foreach UN bt
- Further filtering is also possible on all blocked tasks with this command string. The command picks tasks #2, #3, and #4 from each process, places them into three columns, counts the amount of stack trace messages occurrences and sorts from highest to lowest:
foreach UN bt | awk '/#[2-4] /{print $3,$5}' | paste - - - | sort | uniq -c | sort -nr
- An example is here:
crash> foreach UN bt | awk '/#[2-4] /{print $3,$5}' | paste - - - | sort | uniq -c | sort -nr
46 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f proc_cgroup_show ffffffffb2d35146
33 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f cgroup_lock_live_group ffffffffb2d2ed49
2 __mutex_lock_killable_slowpath ffffffffb33b6520 mutex_lock_killable ffffffffb33b65fe iterate_dir ffffffffb2e71d00
1 schedule_timeout ffffffffb33b5831 wait_for_completion ffffffffb33b805d wait_rcu_gp ffffffffb2cc828e
1 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f proc_cgroupstats_show ffffffffb2d2f2be
1 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f cgroup_release_agent ffffffffb2d2ffe5
1 __mutex_lock_slowpath ffffffffb33b6ca7 mutex_lock ffffffffb33b602f cgroup_free_fn ffffffffb2d34308
-
From the above output, a high volume of
mutex_lock_slowpath
messages can be viewed. Researching the error attributes this to either a hardware issue or low resource availability (CPU / Memory / I/O). -
The full list of unique blocked tasks can be found by running
foreach UN bt | grep COMMAND | awk '{print $NF}' | sort -u
:
crash> foreach UN bt | grep COMMAND | awk '{print $NF}' | sort -u
"cadvisor"
"dockerd"
"filebeat"
"kubelet"
"kworker/113:0"
"kworker/31:1"
"process-exporte"
"runc"
"systemd"
"systemd-journal"
- To dig further into a blocked task, use the
set
command with the PID as the argument. The ensuing example sets the focus on the blocked tasksystemd
:
crash> set 1
PID: 1
COMMAND: "systemd"
TASK: ffff9a32db050000 [THREAD_INFO: ffff9a32dbb9c000]
CPU: 42
STATE: TASK_UNINTERRUPTIBLE
- Then run
bt -l
for more verbose backtrace messages. In the above-mentionedmutex_lock_slowpath
example, the process that was running before themutex_lock
was engaged wasproc_cgroup_show
:
crash> bt -l
#2 [ffff9a32dbb9fda8] __mutex_lock_slowpath at ffffffffb33b6ca7
/usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/include/linux/spinlock.h: 337
#3 [ffff9a32dbb9fe08] mutex_lock at ffffffffb33b602f
/usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/arch/x86/include/asm/current.h: 14
#4 [ffff9a32dbb9fe20] proc_cgroup_show at ffffffffb2d35146
/usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/kernel/cgroup.c: 4706
- The line number from the
cgroup.c
file being referenced byproc_cgroup_show
was4706
. It is possible to view the contents of the source code in thevmcore
dump with thelist
orl
command:
crash> l /usr/src/debug/kernel-3.10.0-1160.119.1.el7/linux-3.10.0-1160.119.1.el7.x86_64/kernel/cgroup.c: 4706
4701
4702 retval = 0;
4703
4704 mutex_lock(&cgroup_mutex);
4705
4706 for_each_active_root(root) {
4707 struct cgroup_subsys *ss;
4708 struct cgroup *cgrp;
4709 int count = 0;
4710
- You can see from line
4704
that themutex_lock
is related to theCgroup
system. From the kernel.org documentation on Cgroups:
There is a global mutex, cgroup_mutex, used by the cgroup system. This should be taken by anything that wants to modify a cgroup. It may also be taken to prevent cgroups from being modified, but more specific locks may be more appropriate in that situation.
-
Since
Cgroups
handle the allocation of resources, the mutex lock observed here points towards a lack of resources on the node. -
As an aside, if the
mutex_lock
code pointed tomutex_lock(&dentry->d_inode->i_mutex);
, that would highlight an inode-related error on the storage medium and further investigation would be required there. -
Finally, after running the
log
command to dump thesystem message buffer
, multipleHardware Error
messages were observed related to the customer's memory DIMMs:
crash> log
<output_redacted>
[3968484.124456] {176586}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 20480
[3968484.124460] {176586}[Hardware Error]: It has been corrected by h/w and requires no further action
[3968484.124461] {176586}[Hardware Error]: event severity: corrected
[3968484.124463] {176586}[Hardware Error]: Error 0, type: corrected
[3968484.124464] {176586}[Hardware Error]: fru_text: DIMM# Sourced
[3968484.124465] {176586}[Hardware Error]: section_type: memory error
[3968484.124466] {176586}[Hardware Error]: error_status: 0x0000000000040400
[3968484.124468] {176586}[Hardware Error]: physical_address: 0x000000f877d52000
[3968484.124470] {176586}[Hardware Error]: node: 0 card: 3 module: 1 rank: 2 bank: 6 row: 57311 column: 0
[3968484.124471] {176586}[Hardware Error]: Error 1, type: corrected
[3968484.124472] {176586}[Hardware Error]: fru_text: DIMM# Sourced
[3968484.124473] {176586}[Hardware Error]: section_type: memory error
[3968484.124474] {176586}[Hardware Error]: error_status: 0x0000000000040400
[3968484.124475] {176586}[Hardware Error]: physical_address: 0x000000f7f9b63180
[3968484.124477] {176586}[Hardware Error]: node: 0 card: 3 module: 1 rank: 2 bank: 6 row: 56806 column: 16
[3968484.124501] [Hardware Error]: Corrected error, no action required.
[3968484.125383] [Hardware Error]: CPU:0 (17:31:0) MC1_STATUS[-|CE|-|-|AddrV|-|-|-|-]: 0x940000000000009f
[3968484.126221] [Hardware Error]: Error Addr: 0x000000f877d52000, IPID: 0x0000000000000000
[3968484.126907] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 0
[3968484.127593] [Hardware Error]: Instruction Fetch Unit Error: microtag probe port parity error.
[3968484.128280] [Hardware Error]: cache level: L3/GEN, tx: RESV
Example issue resolution
Ultimately it was found that the lack of resources were due to Hardware Errors
in the customer's memory DIMMs, thus causing a contention for resources resulting in tasks being blocked for over 360 seconds and therefore invoking a kernel panic.
Conclusion
vmcore
dump analysis is a vast and complicated process. There are almost a limitless number of ways that the kernel can crash and the root cause may not always be the same from system to system. The author hopes that the example analysis above for blocked tasks can help to demystify some of the research required and make it easier to identify a root cause if you run into a similar situation.