ArticlesRocky Linux

How to Identify Hardware Faults in Rocky Linux

Introduction

It is inevitable that hardware will fail and there are hundreds of possible scenarios.

This guide will detail troubleshooting steps which can be used to identify hardware errors and how to rectify them.

Problem

A customer informed CIQ that a node in their production cluster had spontaneously crashed and then rebooted itself.

Symptoms

Checking /var/log/messages from the sosreport, there were hundreds of [Hardware Error]: Hardware error from APEI Generic Hardware Error Source messages observed.

Resolution

Prerequisites

Access to the root user or a user with escalated privileges.

Creating an sosreport

When creating a ticket with CIQ, generating and attaching an sosreport to the ticket is very helpful. This allows the support team to start their investigation, without further having to ask the customer to generate one.

Install the sos package:

dnf install -y sos

Run the sos report --batch command to perform unattended data gather of your system.

Once complete, you will find the sosreport under /var/tmp/sosreport-hostname-yyyy-mm-dd-abcdefg.tar.xz. An example is below:

[root@Rocky-Linux-9-5-Test-Machine ~]# ls -l /var/tmp/ | grep sosreport
-rw-------. 1 root root 11094320 Dec 27 02:33 sosreport-Rocky-Linux-9-5-Test-Machine-2024-12-27-elpmjjq.tar.xz
-rw-r--r--. 1 root root       65 Dec 27 02:33 sosreport-Rocky-Linux-9-5-Test-Machine-2024-12-27-elpmjjq.tar.xz.sha256

Please upload the sosreport to the ticket.

Analysis of the issue

When observing behavior such as crashes or random reboots, the best log source is /var/log/messages - this records the global messages that are generated from the various services running on your server.

Hardware errors are usually identified by the kernel in the /var/log/messages log with a [Hardware Error] prefix. These messages come from the ACPI Platform Error Interface.

Using a combination of tools such as grep, awk, and uniq you can quickly find and triage repeating errors. The following in /var/log/messages from a customer's sosreport shows hundreds of memory-related errors:

grep "Hardware Error" ./sosreport-<NODE_NAME>-2024-12-24-gwwbeqs/var/log/messages | awk '{print $8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20}' | grep "memory error" | uniq -c
 710 section_type: memory error

In addition to when a Hardware Error occurs, you will also encounter messages of Corrected error, no action required. While the error may be temporarily corrected, if these messages continue, it is a sign of a more serious fault. In that scenario, it is best to contact the hardware vendor and replace the affected part outright.

Recommendations

In the above memory fault example, the recommendation to the customer was to change out their memory modules and they then contacted their hardware team for further analysis.

References & related articles

awk man page: https://man7.org/linux/man-pages/man1/awk.1p.html

grep man page: https://man7.org/linux/man-pages/man1/grep.1.html

sosreport man page: https://linux.die.net/man/1/sosreport

uniq man page: https://man7.org/linux/man-pages/man1/uniq.1.html