Was this page helpful?

Diagnosis of a failing disk

Reviewed on 06 February 2025 • Published on 02 November 2021

Smartmontools is a set of tools that controls and monitors a disk using the SMART standard (Self-Monitoring, Analysis, and Reporting Technology System).

It consists of two parts:

smartd, a daemon that allows you to periodically check your hard drives
smartctl, a command line tool to view the status of the hard disk

The tool supports the vast majority of modern hard drives.

Before you startLink to this anchor

To complete the actions presented below, you must have:

A Dedibox account logged into the console
A Dedibox dedicated server

How to check a server with no RAID controllerLink to this anchor

Log into your server using SSH.

Run the following command from the root account (or precede it with sudo):

smartctl -a /dev/sdx

Note

The identifier sdx refers to your hard disk. Make sure that you replace x with a, b , c, etc. depending on the disk you want to test.

How to check a Dell multi-disk serverLink to this anchor

Dell PERC H200 controllerLink to this anchor

On these servers, the physical disks are referred to as sg* devices.

Log into your server using SSH.

Run the following commands:

smartctl -a -T permissive /dev/sg0
smartctl -a -T permissive /dev/sg1
smartctl -a -T permissive /dev/sg2

Tip

As the devices can be positioned a little further away, do not hesitate to test up to sg5 if you do not have conclusive results.

Dell PERC controller (H310, H700, H710, H730-P, LSI9361)Link to this anchor

Two possibilities exist for this type of controller:

megaclisas-status and
megacli

The first one displays the status of the RAID volume, whilst the second one displays the SMART status of the disks.

Log into your server using SSH.

Update the APT package lists cache, and install the required packages:

apt update
apt install megaclisas-status megacli

Run the following command to display the status of the RAID volume:

megaclisas-status

Run the following little script to retrieve the SMART values of your disks:

DEVICE=/dev/sda
for i in $(megacli -pdlist -a0 | grep Id | cut -d":" -f2); do
    echo "============================== $i =============================="
    smartctl -s on -a -d megaraid,${i} ${DEVICE} -T permissive
done

How to check an HP multi-disk server (P410, P420, P222)Link to this anchor

Log into your server using SSH.

Run the following command to display the status of the RAID:

ssacli ctrl all show config

An output similar to the following example displays:

Smart Array P410 in Slot 1                (sn: PACCRID111003N3)
array A (SATA, Unused Space: 0 MB)
    logicaldrive 1 (1.8 TB, RAID 1, OK)
    physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 2 TB, OK)
    physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 2 TB, OK)

Run the following command to display the SMART values of the disks:

smartctl -a -d cciss,0 /dev/sg0

then run:

smartctl -a -d cciss,1 /dev/sg0

How to use SMARTD to monitor your disksLink to this anchor

smartd allows you to monitor your disks and to be alerted (depending on the configuration) by email in case of failure.

Important

There is no guarantee as to the result of SMARTD and the time remaining before the disk fails completely. However, we strongly encourage you to back up and request a replacement rapidly.

How to configure SMARTDLink to this anchor

Below, you will find an example of a single-disk server installed on a Debian-like machine.

Note

The following commands are to be executed as root or via sudo.

Log into your server using SSH.

Enable basic SMART options:

smartctl -s on -o on -S on /dev/sda

Check that the disk is healthy:

smartctl -H /dev/sda

Edit the file /etc/smartd.conf, to set up automated tests:

Start by commenting out the following line:

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

Then add a line similar to the following example:

/dev/sda -a -d sat -o on -S on -s (S/../.././01|L/../1/03) -m root -M exec /usr/share/smartmontools/smartd-runner

The example above allows you to test your hard disk as follows:

A short test (S) every day at 1am (01)
A long test (L), every Monday (1) at 3am (03)

Activate the daemon by uncommenting the line start_smartd=yes in the file /etc/default/smartmontools.

Start the daemon by running the following command:

service smartmontools start

If a problem is detected, it will send a default mail to root (-m root). You can redirect the mails sent to the root user to your personal mailbox or send this mail directly to another address.

How to run tests manuallyLink to this anchor

To run SMART tests manually, use the following commands:

smartctl -t short /dev/sda to run a short test on your disk
smartctl -t long /dev/sda to run a long test on your disk

Once the tests are completed, you can check the results with the following command:

smartctl -l selftest /dev/sda

How to report disk failuresLink to this anchor

If you notice any errors when running a SMART diagnosis on your disk, open a support ticket and ask for the disk to be replaced, indicating the serial number with the result of the smartctl command:

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD103UJ
Serial Number: S13PJ1KQ513170 <----------------------- Serial Number
Firmware Version: 1AA01113
User Capacity: 1 000 204 886 016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Fri Oct 29 11:20:27 2010 CEST

Tip

For more information on Smartmontools, refer to the official documentation.

The example below shows SMART data for the HDD storage type:

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST1000NM0033-9ZM173
Serial Number:    Z1W2P3WL
LU WWN Device Id: 5 000c50 0790721c5
Add. Product Id:  DELL(tm)
Firmware Version: GA0A
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Jan 22 11:26:49 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
          was completed without error.
          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
          without error or no self-test has ever
          been run.
Total time to complete Offline
data collection:    (   90) seconds.
Offline data collection
capabilities:        (0x7b) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 115) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x50bd) SCT Status supported.
          SCT Error Recovery Control supported.
          SCT Feature Control supported.
          SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   079   063   044    Pre-fail  Always       -       90441339
  3 Spin_Up_Time            0x0103   096   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0133   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   093   060   030    Pre-fail  Always       -       2198492836
  9 Power_On_Hours          0x0032   094   011   000    Old_age   Always       -       5442
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       18
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   061   045    Old_age   Always       -       29 (Min/Max 27/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12859
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 22 0 0 0)
195 Hardware_ECC_Recovered  0x001a   046   015   000    Old_age   Always       -       90441339
196 Reallocated_Event_Count 0x0032   000   000   000    Old_age   Always       -       65535
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       62209 (42 197 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       75618145300
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       528734761477
SMART Error Log Version: 1
No Errors Logged

If total_uncorrected_errors or errors_corrected_by_rereads_rewrites is > 0, the disk is out of order.

The example below shows SMART data for the SSD storage type:

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron MX1/2/300, M5/600, 1100 Client SSDs
Device Model:     Micron_1100_MTFDDAK512TBN
Serial Number:    1709160C2354
LU WWN Device Id: 5 00a075 1160c2354
Firmware Version: M0MU031
User Capacity:    512 110 190 592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan 22 11:24:34 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x03) Offline data collection activity
          is in progress.
          Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
          without error or no self-test has ever
          been run.
Total time to complete Offline
data collection:    (  913) seconds.
Offline data collection
capabilities:        (0x7b) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (   7) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x0035) SCT Status supported.
          SCT Feature Control supported.
          SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       11
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       10
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       63309
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       12
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       1
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   060   060   000    Old_age   Always       -       610
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       6
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   068   047   000    Old_age   Always       -       32 (Min/Max 24/53)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       10
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Used   0x0030   060   060   001    Old_age   Offline      -       40
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       1
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       72065906327
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       2254963742
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       15919135484
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2459
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       44
SMART Error Log Version: 1
No Errors Logged

If the RAW_VALUE column for Reallocated_Sector_Ct or Runtime_Bad_Block or Current_Pending_Sector is > 5, the disk can already be considered as unhealthy. If it is > 20, the disk is out of order.

The example below shows SMART data for the NVMe storage type:

=== START OF INFORMATION SECTION ===
Model Number:                       SKHynix_HFS512GEJ9X164N
Serial Number:                      4YC8N008713108B48
Firmware Version:                   51770C30
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            ace42e 003abd04e2
Local Time is:                      Wed Jan 22 11:21:05 2025 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     86 Celsius
Critical Comp. Temp. Threshold:     87 Celsius
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
0 +   4.5000W       -        -    0  0  0  0      100     100
1 +   3.0000W       -        -    1  1  1  1      200     200
2 +   0.6000W       -        -    2  2  2  2      400     400
3 -   0.0150W       -        -    3  3  3  3     2000    2000
4 -   0.0030W       -        -    4  4  4  4     5000   10000
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
0 +     512       0         0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    5,718,407 [2.92 TB]
Data Units Written:                 9,717,865 [4.97 TB]
Host Read Commands:                 43,061,485
Host Write Commands:                142,156,172
Controller Busy Time:               5,906
Power Cycles:                       1,315
Power On Hours:                     2,261
Unsafe Shutdowns:                   56
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               42 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Read Self-test Log failed: Invalid Field in Command (0x002)

Note

If you encounter Health status: Failed or Failing Now, the disk is considered out of order. Make sure that you have backups, then open a support ticket and ask for the disk to be replaced, indicating the serial number with the result of the smartctl command.

Was this page helpful?