How to Identify Hardware Faults in Rocky Linux
Introduction
It is inevitable that hardware will fail and there are hundreds of possible scenarios.
This guide will detail troubleshooting steps which can be used to identify hardware errors and how to rectify them.
Problem
A customer informed CIQ that a node in their production cluster had spontaneously crashed and then rebooted itself.
Symptoms
Checking /var/log/messages
from the sosreport
, there were hundreds of [Hardware Error]: Hardware error from APEI Generic Hardware Error Source
messages observed.
Resolution
Prerequisites
Access to the root
user or a user with escalated privileges.
Creating an sosreport
When creating a ticket with CIQ, generating and attaching an sosreport
to the ticket is very helpful. This allows the support team to start their investigation, without further having to ask the customer to generate one.
Install the sos
package:
dnf install -y sos
Run the sos report --batch
command to perform unattended data gather of your system.
Once complete, you will find the sosreport
under /var/tmp/sosreport-hostname-yyyy-mm-dd-abcdefg.tar.xz
. An example is below:
[root@Rocky-Linux-9-5-Test-Machine ~]# ls -l /var/tmp/ | grep sosreport
-rw-------. 1 root root 11094320 Dec 27 02:33 sosreport-Rocky-Linux-9-5-Test-Machine-2024-12-27-elpmjjq.tar.xz
-rw-r--r--. 1 root root 65 Dec 27 02:33 sosreport-Rocky-Linux-9-5-Test-Machine-2024-12-27-elpmjjq.tar.xz.sha256
Please upload the sosreport
to the ticket.
Analysis of the issue
When observing behavior such as crashes or random reboots, the best log source is /var/log/messages
- this records the global messages that are generated from the various services running on your server.
Hardware errors are usually identified by the kernel in the /var/log/messages
log with a [Hardware Error]
prefix. These messages come from the ACPI Platform Error Interface.
Using a combination of tools such as grep
, awk
, and uniq
you can quickly find and triage repeating errors. The following in /var/log/messages
from a customer's sosreport
shows hundreds of memory-related errors:
grep "Hardware Error" ./sosreport-<NODE_NAME>-2024-12-24-gwwbeqs/var/log/messages | awk '{print $8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20}' | grep "memory error" | uniq -c
710 section_type: memory error
In addition to when a Hardware Error
occurs, you will also encounter messages of Corrected error, no action required
. While the error may be temporarily corrected, if these messages continue, it is a sign of a more serious fault. In that scenario, it is best to contact the hardware vendor and replace the affected part outright.
Recommendations
In the above memory fault example, the recommendation to the customer was to change out their memory modules and they then contacted their hardware team for further analysis.
References & related articles
awk
man page: https://man7.org/linux/man-pages/man1/awk.1p.html
grep
man page: https://man7.org/linux/man-pages/man1/grep.1.html
sosreport
man page: https://linux.die.net/man/1/sosreport
uniq
man page: https://man7.org/linux/man-pages/man1/uniq.1.html