PDA

View Full Version : Imminent RAID 5/6 failures



Flatty
8th January 2010, 06:10 AM
Blinky seems to think I know squat about computers (yeah, dude - I detect the hints of sarcasm :p).

Anyway, I was just sitting here browsing, working out the costings for my home-built standalone, SAN/NAS, whatever, when I stumbled across this article, which basically puts a damper on any plans I had (though nothing of any importance that I possess isn't backed up safely on a DVD). I don't trust drives & can't wait until Bluray writers become a viable option.

Basically it says that the chances that you won't be able to rebuild your RAID when you replace a faulty drive are increased as drive capacity increases. If two drives fail, well - you should have been using RAID 6, shouldn't you? Unfortunately RAID 6 is also affected by the exponential increase in failure rate as drive capacity increases.

You'll be looking at RAID 1+0, or 0+1 if you really want to be safe(r), but my advice is that if you have anything you really want to keep safe for any length of time - put in on disk & store it somewhere sensible.

Unless of course you're loaded & can afford hectic corporate setups.

Here is the article, though you might want to go to the actual article to follow the links at the bottom:


Why RAID 5 stops working in 2009 (http://blogs.zdnet.com/storage/?p=162)

The storage version of Y2k? No, it’s a function of capacity growth and RAID 5’s limitations. If you are thinking about SATA RAID for home or business use, or using RAID today, you need to know why.

RAID 5 protects against a single disk failure. You can recover all your data if a single disk breaks. The problem: once a disk breaks, there is another increasingly common failure lurking. And in 2009 it is highly certain it will find you.

Disks fail
While disks are incredibly reliable devices, they do fail. Our best data - from CMU and Google - finds that over 3% of drives fail each year in the first three years of drive life, and then failure rates start rising fast.

With 7 brand new disks, you have ~20% chance of seeing a disk failure each year. Factor in the rising failure rate with age and over 4 years you are almost certain to see a disk failure during the life of those disks.

But you’re protected by RAID 5, right? Not in 2009.

Reads fail
SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 100,000,000,000,000 bits, the disk will very politely tell you that, so sorry, but I really, truly can’t read that sector back to you.

One hundred trillion bits is about 12 terabytes. Sound like a lot? Not in 2009.

Disk capacities double
Disk drive capacities double every 18-24 months. We have 1 TB drives now, and in 2009 we’ll have 2 TB drives.

With a 7 drive RAID 5 disk failure, you’ll have 6 remaining 2 TB drives. As the RAID controller is busily reading through those 6 disks to reconstruct the data from the failed drive, it is almost certain it will see an URE.

So the read fails. And when that happens, you are one unhappy camper. The message “we can’t read this RAID volume” travels up the chain of command until an error message is presented on the screen. 12 TB of your carefully protected - you thought! - data is gone. Oh, you didn’t back it up to tape? Bummer!

So now what?
The obvious answer, and the one that storage marketers have begun trumpeting, is RAID 6, which protects your data against 2 failures. Which is all well and good, until you consider this: as drives increase in size, any drive failure will always be accompanied by a read error. So RAID 6 will give you no more protection than RAID 5 does now, but you’ll pay more anyway for extra disk capacity and slower write performance.

Gee, paying more for less! I can hardly wait!

The Storage Bits take
Users of enterprise storage arrays have less to worry about: your tiny costly disks have less capacity and thus a smaller chance of encountering an URE. And your spec’d URE rate of 10^15 also helps.

There are some other fixes out there as well, some fairly obvious and some, I’m certain, waiting for someone much brighter than me to invent. But even today a 7 drive RAID 5 with 1 TB disks has a 50% chance of a rebuild failure. (2007) RAID 5 is reaching the end of its useful life.

Update: I’ve clearly tapped into a rich vein of RAID folklore. Just to be clear I’m talking about a failed drive (i.e. all sectors are gone) plus an URE on another sector during a rebuild. With 12 TB of capacity in the remaining RAID 5 stripe and an URE rate of 10^14, you are highly likely to encounter a URE. Almost certain, if the drive vendors are right.



The key point that seems to be missed in many of the comments is that when a disk fails in a RAID 5 array and it has to rebuild there is a significant chance of a non-recoverable read error during the rebuild (BER / UER). As there is no longer any redundancy the RAID array cannot rebuild, this is not dependent on whether you are running Windows or Linux, hardware or software RAID 5, it is simple mathematics. An honest RAID controller will log this and generally abort, allowing you to restore undamaged data from backup onto a fresh array.

Thus my comment about hoping you have a backup.

Mr. Newcombe, just as I was beginning to like him, then took me to task for stating that “RAID 6 will give you no more protection than RAID 5 does now”. What I had hoped to communicate is this: in a few years - if not 2009 then not long after - all SATA RAID failures will consist of a disk failure + URE.

RAID 6 will protect you against this quite nicely, just as RAID 5 protects against a single disk failure today. In the future, though, you will require RAID 6 to protect against single disk failures + the inevitable URE and so, effectively, RAID 6 in a few years will give you no more protection than RAID 5 does today. This isn’t RAID 6’s fault. Instead it is due to the increasing capacity of disks and their steady URE rate. RAID 5 won’t work at all, and, instead, RAID 6 will replace RAID 5.

Originally the developers of RAID suggested RAID 6 as a means of protecting against 2 disk failures. As we now know, a single disk failure means a second disk failure is much more likely - see the CMU pdf Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? for details - or check out my synopsis in Everything You Know About Disks Is Wrong. RAID 5 protection is a little dodgy today due to this effect and RAID 6 - in a few years - won’t be able to help.

Finally, I recalculated the AFR for 7 drives using the 3.1% AFR from the CMU paper, using the formula suggested by a couple of readers - 1-96.9 ^# of disks - and got 19.8%. So I changed the ~23% number to ~20%.

---------- Post added at 06:10 AM ---------- Previous post was at 04:25 AM ----------

Soooo....

Seeing as it is quite apparent you shouldn't entrust your valuable collection of legal :p music, movies & software to your lovely expensive SAN/NAS/whatever - what are your options?

Blinkers did ask me what the pricing was for a tape drive backup system was, so I did some digging. If I still had my Axxess account it would be a lot cheaper.

Just an example:

HP StorageWorks DAT 320 USB External Tape Drive (AJ825A)

Capacity (per tape): 320 GB Maximum, compressed 2:1
Transfer Rate: 86.4 GB/hr Maximum, compressed 2:1

http://h10003.www1.hp.com/digmedialib/prodimg/lowres/c00664523.jpg

For the drive: $1224.00 / R9100.00
For each 320GB tape: $45.00 / R334

So, for Blinker's 2.5 TB of data, you need 8 of these tapes.

Total cost: R11772.00

AND there, chaps, is the safe solution to ensuring that you have a backup of your inevitably failing RAID

Then again, what if the tape itself fails? Shouldn't you have 2 of each, wait, make that 3, no, better make it 4 just to be safe :p

OR be like me & not give a toss because I have everything backed up on DVD :D

Vortex
8th January 2010, 07:23 AM
Why not an array of USB2 drives that you do a backup on? Faster than tape.. More useful for recoveries etc. too..

Did I mention cheaper?

EyeBall
8th January 2010, 07:32 AM
Watch out for tape drives , you need to replace your tapes every 6 - 12 month otherwise you might think you have a backup , but you don't .

We changing over to removable HDD's for backup , the transfer rates is much better than tape drives and much cheaper.

You can get away with about 1200 bucks for a 750GB HDD

Tape backup units can cost up to R 950 for a 320GB.

Arbythep00nage
8th January 2010, 07:48 AM
I have already converted my tape stuff to external hdd at my sites. Tape is the most unreliable way. In the end you are pretty poked!
I trust hdd more than tape and you cant store data for a long period of time on optical media so despite the possible failures its still best to have it on hdd.

Flatty
8th January 2010, 08:39 AM
Which makes sense, RB.

Essentially you want to get away from the thinking that a RAID is any form of redundancy when it comes to storing your precious data.

A hard disk can pack up at any time, as far as I understand, so I can't see how that would be lasting solution - though if the thing is only ever plugged in when you need to restore a backup the chances it will fail in your lifetime are slim.

As for tape - mmmm, if it's stored in the recommended manner, it should last your lifetime. You're taking it out the machine & storing it, remember - not running it as a readily accessible backup.

Though - on second thoughts - buying a pair of external HDD's for every set of data I really wanted to keep would also leave me sleeping a lot easier at night - so, yeah - tape is out - even internals that you just plug in quickly for the job would make sense as an additional storage medium - why buy externals if you're unplugging them & storing them?

My cost on a 1TB drive = R609 excluding VAT :D

So, Blinkers - Back it up on internals for min moolah ;)

senorblinky
8th January 2010, 09:31 AM
Thanks Flatty. Pretty extensive. But your first line baffled me, i trust your advice and judgement when it comes to PCs, i KNOW you know much more than i do and you know what you're doing - now i need to go reread what i wrote, i can't imagine that i made any biting comments, or sarcastic ones at that.

WingNut
8th January 2010, 10:06 AM
Bah, that whole article is a bunch of FUD, I remember reading it when Harris first posted it and it still doesn't add up.

URE rate applies to a single disk. When you have 6 2TB disks the URE rate doesn't just magically add up and guarantee that you're going to experience a read error. It still only applies to individual disks.

A lot of it depends on your hardware as well, one vendor's RAID card may cause the whole RAID to fail when encountering an URE, while another may just log the fault, continue with the rest of the blocks to be recovered, and you'll end up with a recovered disk with 99.9% of your data, except for the one block that couldn't be recovered.

As for RAID 6 the ONLY way that URE is going to cause your RAID 6 array to fail is if you experience 2 URE's, on separate disks, while recovering a certain block... which is extremely unlikely to happen.

The only sensible thing he says in the whole article is that you shouldn't depend solely on your array's redundancy, you should still backup your important data.

If you're worried about RAID, you may want to take a look at ZFS, a file system by Sun, and it's AMAZING. Has some really cool capabilities, and with their Z-RAID system, no URE can ever cause you to lose your array.

sss
8th January 2010, 10:32 AM
they just need to raid a raid 5 collection!

Flatty
8th January 2010, 11:06 AM
... i can't imagine that i made any biting comments, or sarcastic ones at that.

Not that I am aware of :p


they just need to raid a raid 5 collection!

That sounds sensible.

Anyway - I wouldn't trust my NB data to a RAID without an additional backup.

senorblinky
8th January 2010, 11:27 AM
Anyway chaps, for those who care, i slotted in the new 1TB drive this morning, and all the lights on the drobo started flashing green and orange (which means it was repairing all the drives, rebuilding all the info), we just checked it now, and although it's still rebuilding, everything is accessible again, so the movie folder that went "unreadable" is now in tip top shape again and my cronies here at the office are having a leach.


What relief!


Flatty, you confuse me...

rainy
8th January 2010, 11:47 AM
Glad to hear you're sorted Blinks. When can I come and leech?

Flatty
8th January 2010, 12:17 PM
...

Flatty, you confuse me...

Dude, I confuse myself, how can you possibly expect me not to confuse you :p ?


Bah, that whole article is a bunch of FUD, I remember reading it when Harris first posted it and it still doesn't add up.

URE rate applies to a single disk. When you have 6 2TB disks the URE rate doesn't just magically add up and guarantee that you're going to experience a read error. It still only applies to individual disks.
...

Yeah, Wingers - now that I have thought about it & actually reread your post, it seems you are quite correct:

That 12TB of data he is using to do his calculations applies to a single disk only. So, on the 2TB disk, you will have a URE every 12TB worth of data transfers on one single disk only. So for the 6, you have 72TB worth of data transfers happening before a URE could occur on each disk & not necessarily at the same time.

Also - the odds of the URE happening while you are rebuilding an array are very slim. If it happens while the array is up, the data will be just be rewritten to a clean sector, using redundancy from the other drives.

Panic Mechanic - I read it at 3 am & it made sense at the time.

This article: Four Disk RAID 10 fails with 3 good disks (http://communities.intel.com/message/77655;jsessionid=A373D786D0F79881D95D938184AB8401.node5COMS) highlights that although you should be able to rebuild your RAID under any circumstances, there is the issue of controller failure to contend with, as you pointed out.

If it were my precious collection of music, movies & pronz - I'd just get some internals and do a manual backup just in case the whole backup RAID device decides to throw a wobbly. 3 x R609 + VAT for peace of mind.

AND then again - this stupid moron on the intel forum started swapping his drives around in their bays just to see if that wouldn't help - uduh :D

But, you get my point, I hope?

Arbythep00nage
8th January 2010, 01:10 PM
Old box and readynas, tis all im saying.