夜寝ていて、トイレで目が覚めたら、サーバーがピーピーピーとものすごくうるさくて泣いていた。
これで目が覚めたと言っても過言ではない。
どのサーバーが泣いているか切り分けしていたら、判明。
ちなみにステータス、すなわちLEDでは、異常見つからない。
ぴーぴーなっているので、最初UPSかなぁとおもって調べてみたら、HP DL320だった。
うん、こいつは諸事情によりLSI LogicのRAID板をつかってるやつで、SmartArrayじゃなかったやつのようだ。
OSはFreeBSD。調べてみると、ログが。。。
Aug 25 23:53:21 backup1 mfi0: 141334 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 13 da 00 00 26 00 Aug 25 23:53:21 backup1 mfi0: 141335 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 14 00 00 02 76 00 Aug 25 23:53:21 backup1 mfi0: 141336 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 17 44 9e 00 02 16 00 Aug 25 23:53:21 backup1 mfi0: 141337 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 17 3a 3f 00 00 6d 00 Aug 25 23:53:21 backup1 mfi0: 141338 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 28 6b 00 00 a5 00 Aug 25 23:53:21 backup1 mfi0: 141339 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 14 d5 95 00 00 b0 00 Aug 25 23:53:21 backup1 mfi0: 141340 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 6f c0 5d 00 00 c1 00 Aug 25 23:53:21 backup1 mfi0: 141341 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 17 34 9b 00 00 ad 00 Aug 25 23:53:21 backup1 mfi0: 141342 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 6f c9 c6 00 00 a6 00 Aug 25 23:53:21 backup1 mfi0: 141343 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 17 28 96 00 01 00 00 Aug 25 23:53:21 backup1 mfi0: 141344 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 6f d0 ba 00 00 cb 00 Aug 25 23:53:21 backup1 mfi0: 141345 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 6f b8 03 00 03 fd 00 Aug 25 23:53:21 backup1 mfi0: 141346 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 16 b9 eb 00 01 00 00 Aug 25 23:53:21 backup1 mfi0: 141347 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 0a 2b 00 01 90 00 Aug 25 23:53:21 backup1 mfi0: 141348 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 6f d8 01 00 03 ad 00 Aug 25 23:53:22 backup1 mfi0: 141349 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 20 67 00 00 bb 00 Aug 25 23:53:22 backup1 mfi0: 141350 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 6f d6 fe 00 00 c3 00 Aug 25 23:53:22 backup1 mfi0: 141351 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 17 49 60 00 02 a0 00 Aug 25 23:53:22 backup1 mfi0: 141352 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 17 40 d9 00 00 e3 00 Aug 25 23:53:22 backup1 mfi0: 141353 (493860690s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 04 00 00 03 e5 00 Aug 25 23:53:22 backup1 mfi0: 141354 (493860690s/0x0002/WARN) - PD 00(e0xfc/s0) Path 1221000000000000 reset (Type 03) Aug 25 23:53:22 backup1 mfi0: 141355 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d5 14 17 74 00 00 60 00 Aug 25 23:53:22 backup1 mfi0: 141356 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 2a 00 d4 17 3c 00 00 04 00 00 Aug 25 23:53:22 backup1 mfi0: 141357 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 2a 00 d4 70 19 91 00 00 de 00 Aug 25 23:53:22 backup1 mfi0: 141358 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 50 00 00 03 e0 00 Aug 25 23:53:22 backup1 mfi0: 141359 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 63 d7 00 00 29 00 Aug 25 23:53:22 backup1 mfi0: 141360 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d3 e6 d2 6e 00 00 d4 00 Aug 25 23:53:22 backup1 mfi0: 141361 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 2a 00 d4 70 1e 2e 00 00 bb 00 Aug 25 23:54:44 backup1 mfi0: 141362 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 2a 00 d4 70 38 1d 00 00 9b 00 Aug 25 23:54:44 backup1 mfi0: 141363 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 64 00 00 00 b0 00 Aug 25 23:54:44 backup1 mfi0: 141364 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d5 14 24 00 00 03 f7 00 Aug 25 23:54:44 backup1 mfi0: 141365 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 48 e5 00 02 d4 00 Aug 25 23:54:44 backup1 mfi0: 141366 (493860691s/0x0002/WARN) - Command timeout on PD 00(e0xfc/s0) Path 1221000000000000, CDB: 28 00 d4 70 58 00 00 04 00 00 Aug 25 23:54:44 backup1 mfi0: 141367 (493860702s/0x0010/CRIT) - SAS topology error: Unaddressable device Aug 25 23:54:44 backup1 mfi0: 141368 (493860703s/0x0010/CRIT) - SAS topology error: Unaddressable device Aug 25 23:54:44 backup1 mfi0: 141369 (493860703s/0x0010/CRIT) - SAS topology error: Unaddressable device Aug 25 23:54:44 backup1 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 35569, size: 12288 Aug 25 23:54:44 backup1 mfi0: 141370 (493860716s/0x0002/WARN) - PD 110(e0x00/s16) Path 1221000000000000 reset (Type 03) Aug 25 23:54:44 backup1 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 34801, size: 20480 Aug 25 23:54:44 backup1 mfi0: 141371 (493860730s/0x0002/WARN) - PD 110(e0x00/s16) Path 1221000000000000 reset (Type 03) Aug 25 23:54:44 backup1 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 35569, size: 12288
どうやらHDDの障害のようだ。
要はデグレしたよっと。
Aug 25 23:54:44 backup1 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 35569, size: 12288 Aug 25 23:54:44 backup1 mfi0: 141374 (493860772s/0x0002/WARN) - PD 110(e0x00/s16) Path 1221000000000000 reset (Type 03) Aug 25 23:54:44 backup1 mfi0: 141375 (493860773s/0x0002/WARN) - Removed: PD 00(e0xfc/s0) Aug 25 23:54:46 backup1 mfi0: 141376 (493860773s/0x0002/info) - Removed: PD 00(e0xfc/s0) Info: enclPd=fc, scsiType=0, portMap=00, sasAddr=1221000000000000,0pass0 at mfi0 bus 0 scbus0 target 0 lun 0 Aug 25 23:54:46 backup1 pass0: <ATA ST2000DM001-1CH1 CC43> s/n Z1E0ZY4H detached Aug 25 23:54:46 backup1 000000000000000 Aug 25 23:54:46 backup1 (pass0:mfi0:0:0:0): Periph destroyed Aug 25 23:54:46 backup1 mfi0: 141377 (493860773s/0x0002/info) - State change on PD 00(e0xfc/s0) from ONLINE(18) to FAILED(11) Aug 25 23:54:46 backup1 mfi0: 141378 (493860773s/0x0001/info) - State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2) Aug 25 23:54:46 backup1 mfi0: 141379 (493860773s/0x0001/CRIT) - VD 00/0 is now DEGRADED Aug 25 23:54:46 backup1 mfi0: 141380 (493860773s/0x0001/info) - State change on VD 01/1 from OPTIMAL(3) to DEGRADED(2) Aug 25 23:54:46 backup1 mfi0: 141381 (493860773s/0x0001/CRIT) - VD 01/1 is now DEGRADED Aug 25 23:54:46 backup1 mfi0: 141382 (493860773s/0x0002/info) - State change on PD 00(e0xfc/s0) from FAILED(11) to UNCONFIGURED_BAD(1)
てことで、予備のHDDに交換して終わり。
交換の対応まで、次のような時間での対応。
Aug 26 00:20:33 backup1 mfi0: 141383 (493862322s/0x0010/CRIT) - SAS topology error: Unaddressable device Aug 26 00:28:37 backup1 mfi0: 141384 (493862805s/0x0002/info) - Inserted: PD 00(e0xfc/s0) Aug 26 00:28:38 backup1 mfi0: 141385 (493862805s/0x0002/info) - Inserted: PD 00(e0xfc/s0) Info: enclPd=fc, scsiType=0, portMap=00, sasAddr=1221000000000000,0000000000000000 Aug 26 00:28:38 backup1 mfi0: 141386 (493862806s/0x0002/info) - State change on PD 00(e0xfc/s0) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0) Aug 26 00:28:38 backup1 mfi0: 141387 (493862806s/0x0002/info) - State change on PD 00(e0xfc/s0) from UNCONFIGURED_GOOD(0) to OFFLINE(10) Aug 26 00:28:38 backup1 mfi0: 141388 (493862806s/0x0002/info) - Rebuild automatically started on PD 00(e0xfc/s0) Aug 26 00:28:38 backup1 mfi0: 141389 (493862806s/0x0002/info) - State change on PD 00(e0xfc/s0) from OFFLINE(10) to REBUILD(14) Aug 26 00:28:38 backup1 mfi0: 141390 (493862806s/0x0020/info) - Patrol Read complete Aug 26 00:28:38 backup1 mfi0: 141391 (493862807s/0x0020/WARN) - Patrol Read can't be started, as PDs are either not ONLINE, or are in a VD with an active process, or are in an excluded VD Aug 26 00:28:39 backup1 mfi0: 141392 (493862808s/0x0020/WARN) - Patrol Read can't be started, as PDs are either not ONLINE, or are in a VD with an active process, or are in an excluded VD
障害検知から、45分後には、リビルド開始!
[root@backup1 ~]# mfiutil show drives mfi0 Physical Drives: 0 ( 1863G) REBUILD <ST2000DL003-9VT1 CC3C serial=6YD1QGH1> SATA E1:S0 1 ( 1863G) ONLINE <ST2000DM001-1CH1 CC43 serial=Z1E0ZY3C> SATA E1:S1 2 ( 1863G) ONLINE <ST2000DM001-1CH1 CC43 serial=Z1E0ZXAB> SATA E1:S2 3 ( 1863G) ONLINE <ST2000DM001-1CH1 CC43 serial=W1E0Z3RJ> SATA E1:S3
45分で暫定対応完了、ハード障害の対応っていうのは、オンサイト3時間より早い対応!えっへん。
にしても、うるさいわ・・・。リビルド終わるまでなきやまないな、こりゃ。。。
あとは、リビルド中にもう一本逝かないことを祈ります。。。