Patching Exchange servers whether you’re on Exchange 2013, 2016, or 2019 is critical to enhance security and reduce exploits. Unfortunately, the January 2022 updates had a devastating effect on Exchange servers. Resilient File System (ReFS) storage caused domain controllers to constantly reboot.
In the patch roll-out process, testing ensures all functionality works as expected. Testing is not only for migrations or upgrades. You need this for every update and patch. Yet many administrators trusted Microsoft’s testing to do their work for them. Recovering these domain controllers isn’t the easiest thing to do and is a major issue for administrators!
In this article, I delve into this Challenger level catastrophe and how to fix it. Let’s get started with what exactly ReFS is!
What is ReFS?
Resilient File System (ReFS) is the latest in Microsoft’s file system. It’s compatible with Windows Server 2012 and later. That’s right, many companies are still using Windows Server 2012! It’s designed to maximize data availability and efficient scaling. This is to help organizations manage large data sets and larger file sizes. ReFS helps provide better data integrity and reduces corruption for these larger data sizes. ReFS is designed to expand the set of storage scenarios possible within an organization and help R&D ventures.
Now that we know what ReFS is, let’s circle back for the rookies what patches and updates are exactly. It’s important to know this so you can spot the potential errors that arise from them by rookies!
What are Patches and Updates?
Microsoft releases updates in two flavors. These are security updates (SU) or cumulative updates (CU) that patch some vulnerability. They fix other issues companies raise such as calendars not forwarding; a current issue with Exchange 2013. Patches and updates are terms used interchangeably.
All administrators must remember vendors are rolling out each patch in a hurry to stop exploits. The software development and testing team are always running against the clock. Microsoft is a large company that hires the best and brightest to create software, including updates. This may stop a lot of poorly conceived patches and inadequate testing, sometimes it fails.
The workflow for software development always compresses the time needed to test a solution. The January 2022 patches may be a good example of this.
Now that we know about update roll-out challenges let’s turn our attention to the January issue in more detail!
The ReFS Storage Issue
Customers running Exchange reported the ReFS issue on the Microsoft Tech Community forum. They found ReFS volumes that host Exchange databases are empty after the update. Their storage volumes were shown as containing absolutely no data. This caused major down-time as end users couldn’t connect to Exchange as no database existed.
It’s difficult to determine if the January 2022 monthly Tuesday patches actually broke something. I have a testlab with all the Windows Server and Exchange versions to test and tell clients. I provide feedback on what action to take, if anything can be done in such circumstances.
Adding monthly patches to Exchange 2013, Exchange 2016, and Exchange 2019 servers didn’t reproduce the issue. Yet, several people reported it so something on those servers was unique that caused the issue. It was likely a driver issue along with updates or older servers like Windows Server 2012.
I tried removing older Windows updates and then running the January updates. This failed to reproduce the issue. On my other lab machines that are 3 months behind current patches, I still couldn’t reproduce the issue. My ReFS volumes after formatting were still intact after the patches.
Two things come to mind with this issue, hardware and firmware. As the issue was specific to ReFS it would likely mean that hardware along with firmware needed could have been the cause for this issue. As we know data stored on a drive is added and removed through proxy actions through the device itself.
Operating system updates sit on top of these layers. It’s possible that particular hardware and firmware configurations are incompatible. As such test systems should be identical to the production system. This is due to the thousands of configurations of these potential factors.
The issue raised clearly caused downtime and pain for several customers. As much as updates stress that patches are tested, nobody can say 100% for sure they’ll work properly.
Backups are essential in this situation. The database is easily restored depending on the solution used. Some data loss may have occurred depending on the backup system and why an accurate test system is vital for testing before the roll out.
Exporting Exchange data from clients and then re-importing it back into the mailbox can be done if you need to roll-back.
Testing, Testing, and More Testing!
I have never come across an issue where an update breaks Exchange other than where a non-supported version of .NET Framework was installed.
With any patch released, they need to be tested as you do have some scenarios where the patch breaks something. Sometimes everything works fine on a test system, but the production server still falls over. Most of these situations occur due to either an out of date test system being used or bad documentation of a solution during testing!
With Exchange, I like to stay one version behind the latest CU released. Unfortunately due to the cyberattack in 2020 on Exchange I’ve revised this policy. I now tell customers to keep up-to-date.
What I’ve also been doing is updating one server in a database availability group (DAG) and monitoring it for two weeks. This is so that end users can provide any feedback.
If you are behind a load balancer you could take the lowest patched server out and leave the patched one in. This way, if you do encounter an error with updates, you have 1, 2 or more servers in your environment. Recovering 1 server is much better than recovering the entire production environment if needed!
I do want to end this article by saying that I know many customers who don’t have the hardware to spin up a test environment. Obviously after reading this article you should understand how bad this is!
If you’re in this situation you could possibly afford an Azure subscription and pay for a machine only when in use. This will at least give you the opportunity to test patches, security, and CUs before rolling them out to production.
Also look at patching 1 server and monitoring before doing the rest. If there are any issues you have reduced your risk exposure and workload to remediate the situation. Users will quickly tell you if something is wrong. You may want to use logging during this period, but remember to stop and clear them out once you’re done!
Remember for an auditing perspective, you may need to roll out patches quickly to ensure your business can work with others. This can give you a good reason to ask for a test system if you don’t have one!
What are patches?
Patches are updates that effectively deal with security vulnerabilities that have been found in the wild. A vendor will create patches to resolve a known issue for supported software. Some updates come with fixes or existing features. In general, these terms are used interchangeably.
What is ReFS?
Resilient File System (ReFS) was designed to operate on Windows Server 2012 and later. This storage system helps improve performance and scalability for large and big data formats. This also improves the throughput of larger data sets.
What caused the Exchange server update issue?
So far it’s a mystery. I personally suspect that it’s related to a hardware and firmware compatibility issue. Yet as there are thousands of configurations possible it’s effectively impossible to prove out on a test system.
Do I need a test system?
Yes, you need a test system. For some SMEs but either the company doesn’t have the money to create one or see the benefit. Hopefully reading this article will help you understand the challenges of not having one. This includes having a production environment fallover.
Should I roll-out a patch all at once?
If you have used a test system that is a good reflection of the production environment, odds are there is less risk. That being said, there is always a risk and rolling out a solution in stages can help reduce the risk to your production system. Try updating one server at a time and watch each over the space of two weeks.
TechGenix’s Exchange 2010 Windows Azure Test System Article
Create an Azure test system for Exchange 2010 with this article here.
TechGenix’s Azure Lab Article
Get an Azure lab up and running fast with this article.
TechGenix’s Exchange 2016 Nimbus Storage Article
Read this article on Exchange 2016 and Nimbus storage.
TechGenix’s Exchange 2010 Upgrade Article
Discover how to upgrade Exchange 2010 in this article.
TechGenix’s Exchange Backup Article
Find one of the best ways of backing up an Exchange server here.