How to Perform Further SCSI Device Analysis with the PyKdump Crash Extension
Introduction
PyKdump is a powerful extension for the crash utility
, designed to integrate Python scripting into the crash analysis workflow. It enhances the crash utility
's capabilities by providing additional commands and tools that streamline the investigation of vmcore
dumps.
In particular, PyKdump provides invaluable insight when delving into SCSI device details. Its commands, such as scsishow --check
, provide additional information on SCSI command statuses, error codes, and device interactions that are not available in the standard crash utility
.
This article details the installation of the PyKdump extension and recommended SCSI commands to help analyze a vmcore
dump for a node's filesystem that has reached full capacity.
Problem
Kernel crashes can occur when a disk is filled to 100%, resulting in a prolonged process halt and eventual panic.
An sosreport
depending on the time it is taken, sometimes does not reveal filesystem overutilization. CIQ highly recommends to take the sosreport
as soon as the issue occurs.
If you are unsure as to how to generate an sosreport
, please see the full guide available here.
If the issue cannot be identified in the sosreport
, this necessitates further analysis of the vmcore
dump to pinpoint filesystem-related issues.
Symptoms
-
The system experiences a complete halt of processes for a period of time before the kernel panic ensues.
-
Despite the severe impact, the post-crash
sosreport
does not show any evidence of a filesystem filling up, making the root cause from the stance of thesosreport
difficult to track down.
Resolution
Prerequisites
-
Rocky Linux 8.10
-
This article assumes you are familiar with setting up and using the crash utility. For more information, please follow this guide.
PyKdump installation
-
You will need to install
Python 3.8.20
in order for PyKdump to compile successfully. -
In addition, you also need to build
crash 8.0.5
from source for PyKdump to work. -
Install the
Development Tools
group of packages:
dnf group install -y "Development Tools"
- Please run the below BASH script in order to compile
Python 3.8.20
,crash
and buildPyKdump
:
#!/bin/bash
dnf config-manager --set-enabled powertools
dnf install -y readline-devel
dnf install -y texinfo
dnf install -y lzo-devel
git clone git://git.code.sf.net/p/pykdump/code pykdump
wget https://www.python.org/ftp/python/3.8.20/Python-3.8.20.tar.xz
tar -xf ./Python-3.8.20.tar.xz
cd Python-3.8.20/
./configure CFLAGS=-fPIC --disable-shared
cp ../pykdump/Extension/Setup.local-3.8 ./Modules/Setup.local
make
cd ..
wget https://github.com/crash-utility/crash/archive/refs/tags/8.0.5.tar.gz
tar -xf 8.0.5.tar.gz
cd crash-8.0.5/
make lzo
cd ~/pykdump/Extension
./configure -p /root/Python-3.8.20 -c /root/crash-8.0.5
make
cp mpykdump.so ~/mpykdump.so
Pkdump usage
- Move the
crash
binary over to/usr/bin
:
sudo cp /path/to/crash-8.0.5 /usr/bin
- Start the
crash utility
with avmlinux
file andvmcore
dump selected using this command:
crash /path/to/vmlinux /path/to/vmcore
- Enable the PyKdump extension by running the following in the
crash utility
CLI:
crash> extend /root/mpykdump.so
Setting scroll off while initializing PyKdump
/root/mpykdump.so: shared object loaded
- Here are some recommended commands with which to find more information about
scsi
devices:
Recommended commands
To get a summary of all of your SCSI devices
scsishow --check
### Summary:
Task Errors/Warnings
------------------------------------------------
SCSI host checks: 0
SCSI device, command checks: 1
SCSI target checks: 0
** Execution took 0.05s (real) 0.05s (CPU)
To get the I/O requests pending for a device
scsishow -r
fa41a307e427e3s0 (10:0:0:1) timeout: 30000 deadline: 11590928508
Requests found in SCSI layer: 1
** Execution took 0.04s (real) 0.04s (CPU)
To get the last in-flight SCSI commands that were running at the time of the crash
scsishow -c
scsi_cmnd ff41a507e427e4f8 on scsi_device 0xffa41a307e427e3s0 (10:0:0:1) jiffies_at_alloc: 11590898508
To check SCSI host adapters for any that are in busy
, blocker
or failed
state
scsishow -s | grep host
NAME NAME Scsi_Host shost_data hostdata
host0 ahci ffa41a307e427e3s0 0 ff21b221c1429b58
host_busy : 0
host_blocked : 0
host_failed : 0
host_self_blocked : 0
shost_state : SHOST_RUNNING
To check for high IOERR-CNT
I/O error counts
scsishow -d | awk '{print $1,$NF}' | sort -nrk2
sda 3712821
To get the IOREQ-CNT
(I/O Request Count) and the IODONE-CNT
(I/O Done Count)
scsishow -d | awk '{print $1,$8,$9}' | sort -nrk2
device / IOREQ-CNT / IODONE-CNT
sdw 10500750 10500750
References & related articles
PyKdump User Documentation
Linux Kernel Crash Book by Igor Ljubuncic