ArticlesRocky Linux

How to Perform Further SCSI Device Analysis with the PyKdump Crash Extension

Introduction

PyKdump is a powerful extension for the crash utility, designed to integrate Python scripting into the crash analysis workflow. It enhances the crash utility's capabilities by providing additional commands and tools that streamline the investigation of vmcore dumps.

In particular, PyKdump provides invaluable insight when delving into SCSI device details. Its commands, such as scsishow --check, provide additional information on SCSI command statuses, error codes, and device interactions that are not available in the standard crash utility.

This article details the installation of the PyKdump extension and recommended SCSI commands to help analyze a vmcore dump for a node's filesystem that has reached full capacity.

Problem

Kernel crashes can occur when a disk is filled to 100%, resulting in a prolonged process halt and eventual panic.

An sosreport depending on the time it is taken, sometimes does not reveal filesystem overutilization. CIQ highly recommends to take the sosreport as soon as the issue occurs.

If you are unsure as to how to generate an sosreport, please see the full guide available here.

If the issue cannot be identified in the sosreport, this necessitates further analysis of the vmcore dump to pinpoint filesystem-related issues.

Symptoms

  • The system experiences a complete halt of processes for a period of time before the kernel panic ensues.

  • Despite the severe impact, the post-crash sosreport does not show any evidence of a filesystem filling up, making the root cause from the stance of the sosreport difficult to track down.

Resolution

Prerequisites

  • Rocky Linux 8.10

  • This article assumes you are familiar with setting up and using the crash utility. For more information, please follow this guide.

PyKdump installation

  • You will need to install Python 3.8.20 in order for PyKdump to compile successfully.

  • In addition, you also need to build crash 8.0.5 from source for PyKdump to work.

  • Install the Development Tools group of packages:

dnf group install -y "Development Tools"
  • Please run the below BASH script in order to compile Python 3.8.20, crash and build PyKdump:
#!/bin/bash

dnf config-manager --set-enabled powertools

dnf install -y readline-devel

dnf install -y texinfo

dnf install -y lzo-devel

git clone git://git.code.sf.net/p/pykdump/code pykdump

wget https://www.python.org/ftp/python/3.8.20/Python-3.8.20.tar.xz

tar -xf ./Python-3.8.20.tar.xz

cd Python-3.8.20/

./configure CFLAGS=-fPIC --disable-shared

cp ../pykdump/Extension/Setup.local-3.8 ./Modules/Setup.local

make

cd ..

wget https://github.com/crash-utility/crash/archive/refs/tags/8.0.5.tar.gz

tar -xf 8.0.5.tar.gz

cd crash-8.0.5/

make lzo

cd ~/pykdump/Extension

./configure -p /root/Python-3.8.20 -c /root/crash-8.0.5

make

cp mpykdump.so ~/mpykdump.so

Pkdump usage

  • Move the crash binary over to /usr/bin:
sudo cp /path/to/crash-8.0.5 /usr/bin
  • Start the crash utility with a vmlinux file and vmcore dump selected using this command:
crash /path/to/vmlinux /path/to/vmcore
  • Enable the PyKdump extension by running the following in the crash utility CLI:
crash> extend /root/mpykdump.so
Setting scroll off while initializing PyKdump
/root/mpykdump.so: shared object loaded
  • Here are some recommended commands with which to find more information about scsi devices:

Recommended commands

To get a summary of all of your SCSI devices

scsishow --check

### Summary:

    Task                             Errors/Warnings
    ------------------------------------------------
    SCSI host checks:                0
    SCSI device, command checks:     1
    SCSI target checks:              0

 ** Execution took   0.05s (real)   0.05s (CPU)

To get the I/O requests pending for a device

scsishow -r

fa41a307e427e3s0 (10:0:0:1)    timeout: 30000      deadline: 11590928508
Requests found in SCSI layer: 1

 ** Execution took   0.04s (real)   0.04s (CPU)

To get the last in-flight SCSI commands that were running at the time of the crash

scsishow -c

scsi_cmnd ff41a507e427e4f8 on scsi_device 0xffa41a307e427e3s0 (10:0:0:1) jiffies_at_alloc: 11590898508

To check SCSI host adapters for any that are in busy, blocker or failed state

scsishow -s | grep host

NAME      NAME                   Scsi_Host                shost_data               hostdata                
host0     ahci                   ffa41a307e427e3s0                       0         ff21b221c1429b58
   host_busy           : 0
   host_blocked        : 0
   host_failed         : 0
   host_self_blocked   : 0
   shost_state         : SHOST_RUNNING

To check for high IOERR-CNT I/O error counts

scsishow -d | awk '{print $1,$NF}' | sort -nrk2

sda 3712821

To get the IOREQ-CNT (I/O Request Count) and the IODONE-CNT (I/O Done Count)

scsishow -d | awk '{print $1,$8,$9}' | sort -nrk2

device / IOREQ-CNT / IODONE-CNT
sdw 10500750 10500750

References & related articles

PyKdump User Documentation
Linux Kernel Crash Book by Igor Ljubuncic