Down Time and Data Loss
The truth about hard drives and managing the risk.
Hard Drive Failure
The hard drive is the part in your server that holds all your data. You may have 2 hard drives in your server: one holding the operating system, and one holding your data. Either way, the risk is the same - if you are relying on one hard disk to hold your data, you are running a huge risk of suffering sudden and potentially catastrophic down time and data loss. It is easily the biggest risk you need to manage, and the one that is typically managed the worst by most IT suppliers and installers.
Anyone who works in IT knows that hard drives fail all the time, sometimes because of external factors such as sudden impact, electrical surges, environmental conditions (though there is a debate about whether heat - often considered a factor in hard drive failure, actually makes any difference), and sometimes because of mechanical failure inside the drive. Depending on whether the failure is sudden and catastrophic or gradual, this will result in your hard drive either degrading in performance and integrity, causing system instability and data loss, or immediately causing your system to be unusable, and requiring a rebuild and restore.
What are the chances of suffering a hard drive failure?
Well, it is difficult to get hold of accurate statistics for this - as you can imagine, hard
drive manufacturers don't readily offer this kind of information. They claim 1 or 2 percent
failure rate over the life of the drive. Some real world tests put this figure much higher
at 3 - 7 percent. A recent study by Carnegie Mellon University with 100,000 hard drives,
showed a replacement rate of between 2 and 4 percent per year, and up to 13 percent on some
systems. Customers expect their computer systems to last at least 3 - 5 years, so you can see
that putting your trust in a single hard drive can be very risky indeed. Our experience varies -
some recent batches of Maxtor hard drives that we used in-house, have had failure rates of over 50 percent.
When your hard drive fails, you then have to turn to your backup, and hope that:
1. The person responsible for changing the tapes or other media, has actually been doing so
2. The backup software is actually backing up all the data you need (when did you last check?)
Even if your backup is 100 percent reliable, you are still going to suffer from down time, while the data is restored, and systems reconfigured. And if your hard drive failure occurs, for example, at 4.30 PM, it is possible you will lose all that day's data, even if your backup from last night is intact.
It is entirely possible that you have never suffered a hard drive failure. You have been lucky, so far. In 4 years with Net Therapy I have seen hundreds of hard drive failures. Indeed our own business customers have had hard drive failures, but none of them have suffered down time or data loss as a result - we have not even had to turn their server off.
Why not?
All our business customers with servers supplied by Net Therapy have servers that support hot-swappable drives in a RAID5 configuration with one hot spare available. This means that the server does not rely on a single hard drive, and is built in anticipation of hard drive failure. The data is shared between the hard drives, in such a way that if one of the hard drives fails, no data is lost, and the system stays up. The drive needs to be swapped out, but this can be done live, without shutting the machine down or opening it up. The faulty drive is pulled out of the front of the server, and a new drive inserted. Within minutes, the system rebuilds data onto the replacement drive, and your system is back to full strength. Furthermore, using hot spare technology, our systems can actually suffer up to 2 hard drive failures without any down time or data loss.
