Avoiding storage failures
If you would like to read previous articles in this series, please go to:
- Enterprise hard drives
- Choosing a storage solution
- Data tiering and overprovisioning
- Data tiering strategies
- Data tiering and service level agreements
As we discussed in the previous article in this series, establishing service level agreements (SLAs) for each tier of a data tiering solution can help motivate the business units of your organization to buy into such a solution. And you should strive towards obtaining the buy-in and acceptance by business units before you try to implement your data tiering solution. That's because successful data tiering requires migrating most business data to Tier 2 storage and beyond, and business units have to agree to that since they're the ones that own the data.
But there's another important criteria for successful data tiering, namely that IT will be able to deliver on the promises they make in their data tiering SLAs. For the reality is that storage failures will happen, and IT needs to be able to anticipate them in advance, avoid them if possible, and deal with them effectively when they occur.
Figure 1 illustrates a four-pronged approach to managing storage failures, and the sections below provide some details concerning each of these approaches. The analysis presented here isn't exhaustive however, but merely is intended to get you to think seriously about the subject.
Figure 1: How to manage storage failures.
Start by identifying the types of failures that might cause problems accessing storage in your environment. Remember that storage failure is a general concept. In other words, anything that can prevent an application, service or user from storing or accessing data on a storage array or device can be considered a form of storage failure.
In addition, anything that happens that prevents the terms of a storage SLA from being met can also be considered a form of storage failure. For example, if your SLA for Tier 1 storage guarantees a maximum of 50 ms latency but actual latency has risen to 100 ms, then this should be considered a storage failure. Or if your SLA for Tier 2 storage guarantees that requests for additional storage will be processed, approved and provisioned within seven working days, but after two weeks the requested new storage hasn't yet been provisioned, then this too should be considered a storage failure.
In other words, the idea is that storage "failure" fundamentally refers to failure to meet the terms of the SLA for the storage system. Such failures can be caused by a number of different things such as:
- Failure of an actual storage device, such as a hard disk drive or a RAID controller card.
- Failure of an interconnect, such as a Fibre Channel cable, SATA cable, or even an Ethernet network cable.
- Failure of related software, for example a compatibility issue arising from upgrading storage management software, a problem related to a software update applied to an operating system, a problem arising from installing the latest version of a device driver, a problem arising from flashing the BIOS of a system, and so on.
- Human error, for example misconfiguration of the snapshot functionality of a storage array or failure to properly terminate a cabling connection.
- Environmental issues, such as failure of an air conditioning system causing a storage system to overheat and crash.
- Design issues, such as failure to install sufficient air conditioning or power to meet the needs of a rack-mounted storage array.
- An accident, such as accidental deletion of a file, folder, volume, LUN, backup, or configuration data.
What other possible types of storage failure can you identify in your environment? Stop reading and think about it for a few moments.
Avoiding problems is the best way of dealing with them. You avoid hardware failures by taking steps to minimize their possible effect. The key concepts here are redundancy, failover, hot-spares, and so on.
Some of the ways you can avoid problems from happening with your storage infrastructure include:
- Implementing failover clustering so that when one node goes down the other can assume the load.
- Implementing redundancy so that when one technology or pathway fails, data can be routed through another technology or pathway. Dual storage controllers are a must-have here.
- Implementing replication so that hot data can be stored at more than one site.
- Having spares handy, including spare drives, spare controllers, spare cabling, and even entire spare systems.
- Making sure you have UPS systems, portable air conditioners, diesel generators, and tanks of fuel handy.
- Testing everything to make sure it works!
What other technologies or steps can you implement to avoid storage failures from happening in your environment? Stop reading and think about it for a few moments.
Document everything related to your storage infrastructure. This includes:
- Documenting all of the products and technologies you are using, both hardware and software. This includes where they are installed and how they are configured.
- Label everything that might need to quickly be identified in the event of a failure. For example, label the cables and connections on your storage array.
- Documenting the processes for configuring, maintaining, updating, and recovering everything relating to your storage solution. For example, write down the steps for replacing a failed hard drive in your storage array. Write these steps in a way that anyone who has a minimal technical understanding can follow them.
- Keep all documentation and labeling up to date. If something changes, revise the documentation or replace the labeling immediately. Each piece of documentation or labeling should record the name of the person who last modified it.
Once you've documented your storage infrastructure, you also need to make sure that this documentation is easily accessible and that everyone who might need it knows where it can be found.
What other steps should you perform to ensure your storage solution is properly documented so you can quickly recover from a storage failure? Stop reading and think about it for a few moments.
The final pillar of managing storage failure is practice. You need to practice everything to make sure your products, processes, and documentation work as expected. Some of the tests you can practice might include:
- Replacing a failed drive in a DAS, NAS or SAN storage array.
- Replacing a controller card.
- Properly installing cabling.
- Restoring a volume from a SAN snapshot or tape backup set.
- Recovering deleted files.
- Restoring the configuration of a storage server.
- Taking a cluster node offline for maintenance.
- Verifying all pathways in a redundant solution.
- Updating the nodes of a cluster.
What other activities should you practice to ensure you can quickly resolve storage failures in your environment? Stop reading and think about it for a few moments.
In the next article of this series I'll summarize some best practices you can implement that can increase storage performance, availability, searchability, and reliability in your IT environment.