A few months ago, I was alerted by a colleague to a critical bulletin that was released by the Hewlett Packard Enterprise Support Center. The bulletin warned about a firmware defect that had been detected in certain models of solid-state drives (SSD) used in several different HP systems and appliances. The title of the bulletin was difficult enough to parse at first: “HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation.” The bulletin was originally released in November and has since then been updated four times including late last month. You can read the full bulletin here.
What it all boils down to is that if you purchased one of the affected HP systems and turn it on, you can expect the SSD in it to catastrophically fail after exactly 3 years, 270 days and 8 hours (32,768 hours of operation). Well, at least it’s nice to know when something is going to fail so you won’t have your socks knocked off when it happens.
Of course, faulty firmware isn’t the only thing that can cause problems with SSDs. It’s a well-known fact that even SSDs that have had only minimal use can suddenly and unexpectedly fail when they have been experiencing certain kinds of loads. With hard disk drives (HDDs) at least you could get SMART errors warning you that your drive was in danger of bugging out pretty soon. SSDs, on the other hand, can prematurely fail without generating any SMART error conditions. Still, the incredible speed that SSD technologies have over slower “spinning rust” technologies has led many companies to migrate much of their storage from HDD to SSD drives where their budget has allowed them. And SSD prices continue to fall and are closing in fast towards parity with the cost of HDDs.
But the question remains: How can you prepare your datacenter so a firmware issue like this won’t take down your servers and other appliances? I talked with several colleagues about this and have distilled their consensus below as a series of best practices or tips you should follow.
Sign me up!
The first thing you should do if you use SSDs, or you have systems or appliances deployed that have SSDs within them, is to sign up with your vendor’s support alerts mailing list if they have one. And don’t purchase anything from a vendor that doesn’t have a mailing list you can sign up for that provides alerts concerning issues with their products. Unfortunately, it can be hard with some vendors to find out where you can sign up for these kinds of support alerts or bulletins. For example, HP lets you sign up for Driver and Support eAlerts on this page as well as other announcements that are more marketing oriented simply by specifying your name, company, and email address. Dell lets you subscribe here to receive driver and firmware Update notifications, but this requires that you first create a Dell account on MyAccount. For other vendors however you either have to Google for various terms like “support bulletins” or “subscribe to alerts” and so on, or just go digging around on their website for information on how to subscribe (and whether they even have a list you can subscribe to).
Befriend your TAM
If you are an enterprise customer then you probably have been assigned a technical account manager or TAM work works at the vendor and whose job is to help you get answers when you need them (and convince you to buy more of their products). My advice is that you try to build a good working relationship with your TAM and not just treat them as another grasping appendage of your vendor’s sales department. A good TAM can be a lifesaver in many kinds of difficult solutions, and a TAM who you feel comfortable talking with — and who feels comfortable that they can reach out to you as well without feeling they’re intruding or viewed as being too pushy — is just the person you need on your side when something like a critical firmware problem is discovered in one of their products. Ask your TAM to notify you if anything like this should come up on their radar, and tell them that you’d appreciate them texting or calling you direct without delay if anything like this should arise. A good TAM can not only warn you when there’s a firmware issue but can also help you find and possibly even deploy the needed firmware update when it has been released by your vendor. Or at least your TAM can connect you with someone on your vendor’s support team who actually knows their game and is not just following a script that was provided to them.
Make regular backups
My final word of advice should be a no-brainer as it applies to anything in computing or networking that is storage-related. That advice is to make sure you regularly back up the storage on all your systems. With server systems, this should be straightforward and there’s no need to discuss it any further. Network appliances are a different kettle of fish, however, because some of them may have SSD storage embedded within them but may not surface any access to their storage externally, except perhaps to the vendor’s own authorized support personnel. In such cases, you may need to build some kind of load balancing capability into where your device is positioned on your network so that if the device unexpectedly fails its workload can be handled by another device on your network. But just don’t forget the importance of doing backups wherever they are possible.
Featured image: Shutterstock