PPRuNe Forums - View Single Post - Great PC mistakes that you have made <Nerd Mode>
Old 22nd Nov 2007, 00:37
  #36 (permalink)  
Bushfiva
Hippopotomonstrosesquipidelian title
 
Join Date: Oct 2006
Location: is everything
Posts: 1,826
Likes: 0
Received 0 Likes on 0 Posts
Had a drive fail in a RAID array. For those that may not know, it's lots of drives pretending, typically, to be one drive. If one drive fails, there's enough error correction data spread across the other drives to rebuild the data onto a new drive. This unit was 10 drives, of which 2 were "hot spares": 8 drives for the data, 2 ready to take over automagically if one of the 8 failed.

Anyway, this was when RAID arrays were still quite a novel idea. It's a big animal. So a drive failed. No problem, one of the hot spares took over and the array started rebuilding itself. When I had a moment after the array was rebuilt, I pulled the dead drive and inserted what would now be the new hot spare. Took the old drive away and subjected it to the company's destruction policy, which involved a hammer.

Checked the RAID array sometime later, and it's reporting a dead drive, but no worries, a hot spare's taken over. Hmmm. Pulled the dead drive, inserted the new drive, took the old drive away and destroyed it, ordered a new drive by overnight courier.

Checked the RAID array sometime later, and it's reporting a dead drive, but no worries, a hot spare's taken over... A longer Hmmmmmmmmm this time. After the rebuild, pulled the dead drive, inserted the new drive, and watched this time. After about a minute, RAID reports failure, but no worries, a hot spare's taken over.

So, I checked the pulled drive: it's fine (this actually took a day or so since noone knew how to test a drive from a RAID array without risking the data on it). Then we shut down the RAID array, and over the next two days checked for dust, cleaned the filters, looked for signs of overheating, unseated chips (some were still socketed at this time), loose connectors, strained wires, pulled the power supplies (two of everything: this thing was supposed to work no matter what) and did a load check, tested earth paths, power cords, had the powerco check the wall outlets for power quality, scoped the network cabling for floating wires and everything else we could think of.

The RAID failure lights were wired up wrong.
Bushfiva is offline