ydb-platform / nbs

Network Block Store
Apache License 2.0
50 stars 14 forks source link

[NBS] Background validation of HDD blocks that returned I/O errors #685

Open tpashkin opened 3 months ago

tpashkin commented 3 months ago

HDD I/O in general can fail

[Sat Mar  2 18:07:17 2024] ata1.00: exception Emask 0x0 SAct 0x2000000 SErr 0x0 action 0x0
[Sat Mar  2 18:07:17 2024] ata1.00: irq_stat 0x40000008
[Sat Mar  2 18:07:17 2024] ata1.00: failed command: READ FPDMA QUEUED
[Sat Mar  2 18:07:17 2024] ata1.00: cmd 60/18:c8:28:84:bd/03:00:22:00:00/40 tag 25 ncq dma 405504 in
                                    res 43/40:18:d8:84:bd/00:03:22:00:00/00 Emask 0x408 (media error) <F>
[Sat Mar  2 18:07:17 2024] ata1.00: status: { DRDY SENSE ERR }
[Sat Mar  2 18:07:17 2024] ata1.00: error: { UNC }
[Sat Mar  2 18:07:17 2024] ata1.00: configured for UDMA/133
[Sat Mar  2 18:07:17 2024] sd 0:0:0:0: [sda] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sat Mar  2 18:07:17 2024] sd 0:0:0:0: [sda] tag#25 Sense Key : Medium Error [current]
[Sat Mar  2 18:07:17 2024] sd 0:0:0:0: [sda] tag#25 Add. Sense: Unrecovered read error
[Sat Mar  2 18:07:17 2024] sd 0:0:0:0: [sda] tag#25 CDB: Read(16) 88 00 00 00 00 00 22 bd 84 28 00 00 03 18 00 00
[Sat Mar  2 18:07:17 2024] print_req_error: I/O error, dev sda, sector 582845656
[Sat Mar  2 18:07:17 2024] ata1: EH complete

We consider such errors to be fatal and change device (and disk as a consequence) state to ERROR. This might not be the best approach – sometimes this errors are transient, and sector, let alone the rest of the disk are safe to use

Might be a good idea to run test of some sort on the these bad blocks and if they pass – return them back online

tpashkin commented 3 months ago

@qkrorlqr @sharpeye