Data is the lifeblood of business in today’s world. This means that when you are unable to access your company’s data for some reason, your business is in trouble. There are two basic ways you can respond when a data availability issue arises. First, you could panic. That usually doesn’t help, however, so it’s better to follow the second approach: take a few deep breaths and begin troubleshooting to find and fix what went wrong.
In the old days, which from an IT perspective means about a decade or so ago, troubleshooting data availability problems was fairly straightforward for several reasons. First, your organization usually owned its own data infrastructure hardware. Smaller enterprises and departments had their own file servers and tape drives. And large enterprises typically had storage area networks (SANs) and automated tape libraries. So when something went wrong, someone on your IT team could physically walk over to the server room and start pulling wires and flipping switches. Which brings me to the second reason why troubleshooting these issues were easier back then: organizations didn’t outsource as much as they do now, so they needed and maintained qualified IT staff.
Today, however, the data infrastructure of a typical business is different. Cloud computing has made deep inroads into becoming a key part of your organization’s data infrastructure. For smaller enterprises this change often brings greater simplicity in how they store, access, retrieve, and protect their business data. But for larger enterprises it has usually meant increased complexity because of the combination of legacy storage systems and cloud services they employ for storing and managing their data. Cost pressures which are partly caused by the high expectations of savings anticipated by migrating to cloud solutions have also thrown a wrench into the equation. The relentless push to drive down costs and boost profits has led many companies to reduce IT staff by outsourcing much or all of their infrastructure maintenance. The reality however is that most organizations will still rely on a hybrid approach for supplying most of their IT needs, which means there are still practical steps you can take on premises for troubleshooting data availability problems when they occur. To illustrate this let’s look at a few common problems and how to approach them.
Let’s say that a user in your organization has reported that they can’t access data resource X that they need access to in order to perform their work. What could have gone wrong? The resource might have moved, perhaps because of a server migration or changing its name or address during a network reconfiguration. Or maybe the user says they can read from X but they can’t write to X so they’re worried they can’t save the report they’re currently working on. This might be a result of a quota issue where the user’s limit has been reached. Or maybe there’s no more free space on the device, or the permissions have been changed because of changes in company policy. In such cases we can see that problems involving data availability can be much broader than simply issues with the hardware where your data is stored.
Inability to access a NAS device could be caused by hardware failure of the device, a firmware update gone wrong, a problem with your network, or an issue relating to the user’s workstation. Problems can also happen when a new device is introduced into your data infrastructure. For example, the IP address, drive letter, LUN number, or mount point of a new storage volume or folder may be in conflict with an existing component of your data infrastructure. Difficulties accessing data stored on cloud volumes can also be the result of DNS-related issues. Changes in your perimeter firewall configuration can also cause problems accessing data stored in the cloud. Expired certificates can also be a problem and may need to be investigated.
Reinitializing a storage device can often prove the quickest way of resolving problems involving data unavailability. Unfortunately sometimes a device can become stuck in a state where it can’t be reinitialized. In such a situation if Linux commands won’t execute, you can try running in sudo mode or try performing a hard reset on the device. Checking the logs on any devices and systems on the network path between your data and users may provide additional clues on what’s gone wrong and how to address it.
Troubleshooting access to data stored in databases is actually more straightforward than dealing with difficulties accessing unstructured legacy or cloud data storage solutions. With databases one of the first things you should do is make sure that your database server is still up and running and the database instance still exists and hasn’t been renamed or migrated somewhere. Check too that the tembdb instance hasn’t become full or that some other instance has reached its limit. If the database has been known to have been behaving sluggishly of late, this may be an indication that the log files or main database tables are too large or have some other problem. If you have control over your database server you should also check that it has enough memory allocated for its buffers. If your database is running in a virtual machine in Azure or AWS, you might need to allocate an additional logical processor to the machine.
If none of these approaches helps, try focusing your troubleshooting efforts on networking issues. For example, make sure that endpoints are configured properly and that network access ports are responding. And if the network seems to be up and running properly, then try investigating whether your problem might relate to some underlying user authentication issue. If your directory goes down, whether it’s an on-premises domain controller or a cloud identity management service like Azure Active Directory, you can expect for sure that not only will users not be able to access their data but that they’ll also experience other kinds of issues.
Finally there are some general troubleshooting tips that can help you narrow down your problem by eliminating things that are not problem issues. First, ask yourself this: What’s still working properly? Is authentication and identity management working? Is the internet accessible for users? Has your cloud service provider reported any problems at their end? Has your systems management platform flagged anything unusual? Once you’ve taken these steps, ask this second question: What has changed recently in your environment? Did you introduce a new server? Create a new database instance? Reconfigure a router or subnet? Attach a new UPS? Hire someone new on your IT staff? Anytime a change is made the potential for fallout is present. Take a few breaths and review the immediate past for clues.
A good resource for learning more about this subject is a new book from Greg Schultz, “Data Infrastructure Management: Insights and Strategies.” Greg is a well-known expert on storage technologies and the founder and senior analyst of StorageIO. I highly recommend Greg’s book for IT decision-makers and professionals who want to learn more about managing the wide range of different data infrastructure solutions available for today’s busines.
Featured image: Shutterstock
IFA 2019, this year’s version of the annual consumer electronics trade show, did not disappoint. Is one of these smartphones…
IT professionals all dread getting this fevered message from employees and clients: “I’m having Outlook connectivity issues!” Here’s what you…
Here’s a script designed to start and stop virtual machines based on tags associated at the resource group level. It…
Traditional VPNs are showing their age in the modern cloud-powered workplace. That’s why software-defined perimeter solutions are in your future.
Should you disallow NUMA spanning in your Hyper-V architecture? There are two sides to this story, and you’ll get both…
Coding may not be the No. 1 job duty for cloud admins, but it is often a part of the…