Your guide to storage (Part 2)

If you would like to be notified when Scott Lowe releases the next part of this article series please sign up to the WindowsNetworking.com Real time article update newsletter.

Introduction

In part 1 of this series, you learned IOPS and some enterprise storage features. In this part, you will learn about RAID basics and why RAID 5 has become a controversial topic. In the next part of this series, you will learn about many of the RAID levels still commonly in use.

RAID

RAID has been around for a really long time and, while once heralded as the most important storage feature around, has since began a quick decline as storage workloads surpass RAID’s ability to reasonably cope with failures. Today, while RAID remains in use, it’s often used as a part of a larger set of data protection systems.

There are a whole lot of different RAID levels and, at one time, each RAID level was used in certain situations. Today, though, some RAID levels have fallen into disuse and are rarely, if ever, seen in the wild. As such, the discussion in this article is limited to those RAID levels that are still commonly in use in the data center.

Before getting started with a discussion on RAID levels, though, it’s important to understand why RAID itself has been so important. As you probably know, hard drives tend to fail every so often. Nothing is worse than a hard drive failing and taking with it all of an organization’s data assets. In fact, major data loss has traditionally been a reason that many businesses fail. So, failing to protect data really is a “bet the business” proposition.

But why do hard drives fail in the first place? It comes down to physics. Traditional hard drives have a whole lot of moving parts inside. The disk platters spin at insane speeds and drive heads literally float on a miniscule cushion of air seeking out data stored on those platters. When you think about it in macros terms, the fact that this even works is kind of amazing.

But, like all things mechanical, things can go wrong. Heads can come crashing to the disk, scraping the platters or platters can fail to spin. Bad spots can develop on the platters. A disk can be dropped or otherwise damaged. In short, they will fail sometimes.

Every hard drive comes with a set of manufacturer provided specifications regarding failure. The first specification is called MTBF or Mean Time Between Failure. Expressed in hours, MTBF is a way to tell consumers how long their drives will last… on average. MTBF is an “in a perfect world” figure and doesn’t always take operating conditions into account. There are all kinds of environmental factors that can result in a drive failing long before it should. These include dust, vibrations, human error, and more.

But, when it comes to RAID, it’s the second major metric that’s of real interest. This metric is known as the Bit Error Rate (BER) and sometimes called unrecoverable read error (URE). Simply put, the BER is a measure of the number of bit that are read before one of the bits is not readable. The bits in this case are the various units of information stored on the hard drive. Bit error rates are measured in terms such as “1 in 10^15 bits read” or “1 in 10^16 bits read”. What this means is that, eventually, a disk may encounter a situation in which it’s not able to read a bit from the media. When this happens, the information stored there is not recoverable. Basically, this means that data is lost.

In typical everyday operation, BERs don’t happen a whole lot. However, when a disk is undergoing intense read cycles, the chance of a URE increase. And that’s where RAID and BRE intersect in potentially a bad way.

Let’s take a look at just ne RAID level. RAID 5 was very commonly used in the data center and still enjoys wide operation. RAID 5 provides customers with some data protection at relatively little capacity cost. In an array of RAID 5 disks, total capacity overhead for RAID 5 equates to one disk’s worth of storage. So, if you have 8 x 4 TB disks in your RAID 5 array, you would have seven disk’s worth of overall capacity, or 28 TB. That last disk worth of capacity is used to store parity information, which allows you to rebuild data in the event of a disk failure. This parity information is spread across all eight disks in the array.

But, here’s the issue. With massive disks – such as that aforementioned 4 TB behemoth – when a disk happens to fail in the array, it takes an incredibly long time to rebuild that disk using the parity information that exists on the rest of the disks in the array. While a rebuild operation is taking place, the array is unprotected. The loss of another disk will result in the loss of all of the data in the array. The larger the disk, the longer the rebuild time and the higher likelihood that another disk will fail in what is called a “double disk fault” situation. However, it is also during this time that every disk in the array undergoes extremely heavy read traffic as parity information is scanned to rebuild the lost data. If the rebuild process happens to hit a URE, that’s it; the rebuild process stops.

It is for this reason that storage practitioners do not recommend RAID 5’s use with particularly large disks and, in particular, SATA disks, which are an order of magnitude more susceptible to BER/URE than SAS disks, although some enterprise grade SATA disks do match SAS’ reliability. In fact, an entire movement has been built around eliminating the use of RAID 5 in the data center. However, there are also those that believe that the fear of RAID 5 has been overblown. If you’d like to learn more about this topic, take a look at this excellent article. It’s really up to the storage architect to determine how much risk the organization can withstand and ensure that senior IT management is aware of that risk prior to implementation.

As a result of the fear of RAID 5, many recommend the use of RAID 6 as a next logical step. Whereas RAID 5 can withstand the loss of but one disk in the array, RAID 6 can withstand the loss of up to two disks in the array. Like RAID 5, RAID 6 uses a distributed parity scheme to protect data, but it does it twice. As you may guess, this means that RAID 6 requires more capacity overhead than RAID 5. To be exact, RAID 6 requires capacity overhead equaling two disk’s worth of capacity in the array.

Unfortunately, this isn’t the only overhead imposed in a RAID 6 array. In fact, capacity isn’t generally nearly as impacted as write performance. Here’s why: Every time a write request is sent to the RAID 6 array, it requires SIX I/O operations to be performed on storage, although custom hardware processors can often help make sure that these operations are performed quickly. In addition, there are some dual parity schemes that can provide RAID 6-like protection without the massive performance overhead.

Summary

With some RAID basics understood, in Part 3, we’ll go on an in-depth journey into what makes various RAID levels work.

If you would like to be notified when Scott Lowe releases the next part of this article series please sign up to the WindowsNetworking.com Real time article update newsletter.

If you would like to read the first part in this article series please go to Your guide to storage (Part 1).

About The Author

Leave a Comment

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top