Solutions for Virtualizing Domain Controllers (Part 6)

If you would like to read the other parts in this article series please go to:

Introduction

Throughout this article series, I have been discussing the pros and cons of various methods for arranging domain controllers in a virtualized environment. My original intent had been to discuss domain controller placement in forests consisting of multiple domains. However, I want to put that topic on hold until Part 7, and use this article to talk about something that recently happened to me.

About two weeks ago, my network was severely damaged by a lightning strike. The recovery process that followed taught me an extremely valuable lesson about domain controller placement. Even though my story will probably make me look like an idiot, I have decided to go ahead and share it with you in the hopes that you can benefit from my experiences.

That being said, I need to start out by giving you a little bit of background information on my network configuration. Being that I work out of my home writing books and articles for a living, I turned the entire second floor of my home into a data center. Most of my computers are lab machines that I use when I write about various topics. For example, I just wrote an article for Microsoft on SharePoint 2010, so I deployed SharePoint 2010 on a lab machine so that I could verify the techniques that I wrote about.

Although most of my computers are lab machines, I have a production network as well. My production network contains a few domain controllers, a file server (for storing all of my articles, invoices, etc.), and a couple of Exchange Servers.

As you can imagine, all of the computer hardware that I use leads to some astronomical electric bills. Not only do the computers consume electricity, but the air conditioner has to work hard to overcome all of the heat that the computers give off. Because I live in the deep south, I sometimes even have to run the air conditioner in the winter.

About two years ago, I decided to see if I could reduce my electric bills by virtualizing the servers on my production network. The process of doing so went far more smoothly than I expected it, and my virtual servers have proven to be stable and reliable ever since.

A couple of weeks ago however, my house was struck by lightning. Even though I make full use of surge protectors and UPSs, the lightning strike did quite a bit of damage to my network. Some of the damage was immediately obvious. A couple of UPSs were blown, as was a 24 port, gigabit switch. Some of the damage was a bit more subtle though.

After I repaired the most obvious damage, I decided to do a health check on each of my servers to see if there were any problems that initially went undetected. In doing so, I discovered that one of the drives on a storage array used for several production virtual machines was bad. I knew that I needed to fix the problem, but I really wasn’t overly concerned. The array was set up to use RAID 5, and the remaining disks were still functional.

The next day, I removed the bad drive from the array and replaced it with a new one. I have had to replace drives in the array before, so I expected the process to go as smoothly as it has in the past. When I replaced the drive however, I received an error message telling me that the main volume on the array was corrupt and that synchronization with the new disk was impossible.

All of the virtual servers whose virtual hard drives were located on the array still seemed to be operating normally, so I decided that I needed to make a backup before things got any worse.

Since the host server was running Hyper-V, I decided to shut down the virtual machines and then export them to a NAS box. My plan was to delete the corrupt volume from the array, recreate it, and then import the virtual machines. As you have probably already guessed though, things didn’t go as smoothly as I planned.

The server in question contained four virtual servers. Two of the virtual servers were exported without incident, while the other two generated errors.

After some further diagnosis, I determined that one of the virtual servers that could not be exported could be saved. That virtual machine was acting as a file server, and as such had three virtual hard drives attached to it. Only one of the virtual hard drives was corrupt, and I had a backup of the drive, so I simply detached the corrupt virtual hard drive and then exported the virtual machine. When it came time to import this virtual machine, I imported it in the normal manner, manually recreated the volume, and restored a backup.

The fourth virtual machine wasn’t so simple. It was running Exchange Server and was also acting as a domain controller. Now I should tell you up front that Microsoft strongly discourages running Exchange on a domain controller, and has done so since the very beginning. As you will recall though, this server was once running on physical hardware. Being that I am operating a one man shop, I simply could not justify the cost of the hardware and software licenses required to run Exchange on a dedicated server. Once I had virtualized the server there was really nothing stopping me from moving Exchange to a dedicated virtual machine. After all, I have two other production Exchange Servers running within dedicated virtual machines. It was just one of those tasks that I had never gotten around to.

At any rate, the virtual server that was acting as a domain controller and as an Exchange Server was completely corrupt (although it had somehow continued to function). I tried using several different techniques to back the server up, but given the degree of corruption backups were impossible. Since I did have a backup from the day before the lightning strike, I decided to take my chances and try using CHKDSK to repair the volume. Ultimately though, CHKDSK rendered the volume unbootable.

At that point, I knew that I was going to have to resort to deleting the virtual machine and restoring a backup. I really didn’t expect the restoration process to be a big deal, but my virtual domain controller placement ended up causing some major problems.

Don’t get me wrong… I did do a few things right. I had another functional domain controller that was running on another host server. That domain controller was also acting as a DNS server and as a global catalog server. I also had a perfectly good backup.

So what was the problem? Well, I use Microsoft’s System Center Data Protection Manager 2007 (DPM) to back up my network. I hadn’t enabled the System Recovery Tool (SRT) which facilitates bare metal restoration because of the fact that I was backing up virtual servers. Even that shouldn’t have been an issue though.

When a DPM protected server fails and needs to be restored without the use of SRT, there is a fairly simple and straightforward way of restoring the backup. You simply reset the server’s computer account in Active Directory, install Windows (using the same machine name as before), join the machine to the Active Directory, deploy the DPM agent to the server, and then begin the restore. The problem is that Windows does not allow you to reset the computer account for a domain controller.

Had I planned more carefully, even this wouldn’t have been a big deal. As I said, I had another domain controller up and running. I could have just deleted the computer account, deployed a fresh copy of Windows, and then promoted the machine to act as a domain controller.

The problem was that the domain controller was also acting as an Exchange Server. Exchange Server is designed to store almost all of its configuration information in the Active Directory. If an Exchange Server fails to the point that it has to be manually rebuilt then you have to reset the computer account, install Windows, join it to the domain (using the same computer name), and then run Exchange Setup using a special command line switch that tells Setup to rebuild the server using information from the Active Directory.

So here was the dilemma… I couldn’t restore my backup, because doing so would have required me to reset the computer account in the Active Directory. I couldn’t reinstall Exchange (and salvage my configuration), because doing so would have required me to reset the computer account. I couldn’t just delete and recreate the computer account, because that would have deleted my Exchange Server configuration data, and would have prevented me from being able to restore my backup.

As you can see, I was in a no win situation that could have easily been avoided if I had planned the placement of my virtual domain controllers more carefully. In the end, I was able to fix the problem, but my solution wasn’t pretty. Before I tell you what I did, I need to tell you that my fix is not supported by Microsoft. In fact, they specifically discourage my solution, but I didn’t have any choice but to go rogue.

My Exchange Server wasn’t configured as a mailbox server, so I didn’t have to worry about losing any data. That being the case, I shut down all of my Exchange Servers and then used ADSIEdit (found in the Windows Support Tools) to manually remove any references to the Exchange Server from the Active Directory (I had a backups of my other domain controllers in case anything went horribly wrong). After doing so, I used the Active Directory Users and Computers console to delete the computer account.

After giving my remaining domain controllers ample time to replicate, I created a new virtual server that used the same name as my failed server, and joined it to the Active Directory. After doing so, I installed Exchange on the new VM and applied all of the patches that had previously been running on my failed server. Once the new Exchange Server was in place, I used my network documentation to manually reconfigure it in an identical manner to the failed server. Finally, I created an additional virtual server, and configured it to act as a domain controller. That way, Exchange would no longer be running on a domain controller.

Conclusion

I am happy to report that my repairs worked the way that I hoped that they would, and my network now functions just as well as it did prior to the lightning strike. Even though this story has a happy ending, it also comes with a moral (which I learned the hard way). The moral of the story is that you should never run applications on your domain controllers. Running applications on domain controllers can have completely unexpected consequences if you should ever have to recover the server from a failure. Therefore, as you plan the domain controller placement within your virtual datacenter, I would strongly recommend ensuring that your host servers have sufficient resources to dedicate an entire VM to each domain controller.

In Part 7, I want to go back to my original plan and wrap up the series by talking about virtual domain controller placement in more complex networks.

If you would like to read the other parts in this article series please go to: