Active Directory Insights (Part 8) – Virtual domain controllers and disaster recovery

If you would like to read the other parts in this article series please go to:

Until Windows Server 2012 was released Microsoft always recommended organizations have at least one domain controller deployed on physical hardware. In other words, while the option of virtualizing domain controllers (running them as virtual machines on Microsoft Hyper-V or VMware ESX hosts) has become an increasingly attractive idea for many organizations, the “official” guidance from Microsoft has been to make sure you keep at least one physical domain controller in your environment. For example, Microsoft Knowledge Base article KB888794 says:

“Note: Always have at least one DC that is on physical hardware so that failover clusters and other infrastructure can start.”

This recommendation was basically because virtualizing domain controllers running Windows Server 2003 or Windows Server 2008 had certain risks, in particular the risk of USN rollback occurring. USN rollback happens when the normal updates of update sequence numbers (USNs), which are used to keep track of replication of data between domain controllers, are accidentally or unintentionally circumvented. When this happens, the domain controller tries to use a USN that is lower than its latest update which can result in Active Directory replication errors occurring. One way to prevent USN rollback with virtual domain controllers is to avoid taking or using snapshots of such virtual machines. This is one of the main reasons Microsoft has traditionally advised against using Hyper-V snapshots in production environments. Another way to prevent USN rollback is to avoid exporting running virtual domain controllers, for example by attempting to clone them to create new virtual domain controllers to deploy in your environment.

Windows Server 2012 was intended to resolve this problem with the inclusion of new safeguards that would help prevent USN rollback on virtual domain controllers and provide the ability to safely clone virtual domain controllers. These enhancements for Windows Server 2012 are described fully in a series of TechNet articles titled Active Directory Domain Services (AD DS) Virtualization, and if you’re new to the idea of virtualizing domain controllers I recommend that you start by reading these articles as background information.

What the above TechNet articles fail to clarify however is whether Microsoft still recommends having at least one physical domain controller present in your infrastructure even if all of your domain controllers are running Windows Server 2012 or Windows Server 2012 R2. The “official” guidance concerning this matter still seems to be that contained in the TechNet article titled Running Domain Controllers in Hyper-V which was last updated in April 2008 and which specifically applies to Windows Server 2008 and Windows Server 2008 R2 and which adds “this topic will be updated in order to make the guidance applicable to Windows Server 2012” but unfortunately we don’t know when this will happen. Anyways, the section “Avoid creating single points of failure” in this article specifically says you should:

“Maintain physical domain controllers in each of your domains. This mitigates the risk of a virtualization platform malfunction that affects all host systems that use that platform.”

So until Microsoft updates this article for Windows Server 2012 this remains their “official” guidance on the matter of whether you should keep some of your domain controllers physical instead of virtualizing all of them. But even if virtualizing all of your Windows Server 2012 domain controllers is perfectly safe from a technical point of view, is it safe from a business perspective? And if so, are there any special considerations you should watch for or best practices you should follow? We’ll examine the issues involved in this article and the next and in particular the matter of whether there might be any disaster recovery or business continuity implications of having all of your organization’s domain controllers virtualized.

Understanding what single point of failure can mean

As IT admins we generally realize the crucial importance of having redundancy and avoiding a single point of failure for their environment. But we may not always be aware of exactly what a single point of failure might be in certain scenarios. Let’s start by considering some obvious examples.

1. Contoso Ltd. has a single domain network with only one domain controller.

Does the above scenario indicate that there is a single point of failure? Of course! Contoso has only one domain controller so if that fails there will be major problems for the users of the corporate network. What’s the solution here? Have at least two domain controllers per domain to create some redundancy.

2. Contoso Ltd. Has a single domain with two virtual domain controllers both running on the same physical Hyper-V host machine.

Is there a single point of failure here? Yes, because if the Hyper-V host box dies then neither of the two virtualized domain controllers will be present and once again the users may have problems accessing resources on the network. What’s the solution? If you’re going to have only virtual domain controllers in your environment then make sure you have at least two of them and that they are running on different virtualization hosts. For example you could have the virtual machine VDC-01 running on physical server HOST-01 and virtual machine VDC-02 running on HOST-02.

The next scenario is trickier:

3. Contoso Ltd. Has a single domain with two virtual domain controllers each running on separate physical host machines that are both running Windows Server 2012 R2 Hyper-V. Both host machines are Dell PowerEdge R320 Rack Servers with identical hardware and purchased at the same time.

There are two problems with the above scenario from the perspective of single point of failure. First, both physical host systems have identical hardware i.e. the same processor, storage and network hardware. This can point to several potential problems. For example, since both host systems were purchased at the same time from Dell, the hard disk drives (or solid state drives) for these machines are likely from the same batch. If there were problems with the manufacturing of this batch that might increase the likelihood of sudden failure of the disks from the batch then you might expect an increased risk of the disks in both of your host systems failing nearly simultaneously (although that risk is probably very small).

More importantly however, since both of your host systems have identical storage hardware they would therefore almost certainly have identical storage drivers installed on them. Now imagine for a moment that the vendor (Dell in this case, but in reality the same risk may be present with any system vendor) shipped those drivers with a bug in them and under certain conditions that bug can trigger corruption of the data stored on the disks the drivers are associated with. What could happen in that case is that the storage on both of your virtualization hosts could become corrupted at the same time. And if this happens then the virtual machines (the virtual domain controllers) running on those hosts would likely crash, leaving your infrastructure without any working domain controllers. I’ve actually heard of this happening to one company where they lost almost all their domain controllers and almost had to rebuild their domain from scratch. Fortunately they had a few domain controllers that were running on different system hardware and the storage on those domain controllers didn’t get corrupted, so they were able to save the domain by getting the hardware vendor to provide them with an updated driver that fixed the storage problem after which they rebuilt all of the failed domain controllers from scratch. In this particular case the domain controllers involved were all physical ones and not virtual, but the same principle applies to both physical or virtual domain controllers. That principle is this:

Ensure you have hardware diversity across your domain controllers. 

So if we changed scenario 3 to read as follows:

3A. Contoso Ltd. Has a single domain with two virtual domain controllers each running on separate physical host machines that are both running Windows Server 2012 R2 Hyper-V. One of the host machines is a Dell PowerEdge R320 Rack Servers while the other is a HP ProLiant DL580 Gen9 Rack Server.

Would it prevent such problems from happening? Does scenario 3A adhere to the diversity requirement needed to avoid having a single point of failure? Not quite, for while we’ve now insured hardware diversity for our virtual domain controllers, we haven’t insured software diversity for both of our domain controllers have been virtualized within the Microsoft Hyper-V virtualization environment. This might seem like overkill to some readers, and Microsoft might object that we’re being paranoid in saying this, but to achieve true diversity–that is, diversity of both hardware and software–you could change the scenario to something like this:

3B. Contoso Ltd. Has a single domain with two virtual domain controllers each running on separate physical host machines. One of the host machines is a Dell PowerEdge R320 Rack Servers running Windows Server 2012 R2 Hyper-V. The other host machine is a HP ProLiant DL580 Gen9 Rack Server running VMware ESXi 6.

The reasoning behind the above choice is to avoid a problem like where Microsoft releases a patch for Windows Server 2012 R2 that causes a problem for the Hyper-V server role. If this should happen (and there have been some instances in the past where Microsoft has released problematic patches that have caused major problems for some customers) and all of your virtualization hosts are running the same version of Hyper-V then all of your virtualized domain controllers could be put out of commission leaving you with major problems on your hands. Of course let’s hope that never happens, but what the above scenarios really poses is the following question which your IT management should ponder carefully:

How much risk are you willing to tolerate that your Active Directory infrastructure might fail?

Since greater hardware/software diversity also means greater management overhead, we simply have the age-old tradeoff to consider between manageability and reliability/security. And it’s up to you how you decide to balance the two sides of this equation. We’ll continue our discussion of virtual domain controllers in the next article of this series.

Still got questions about Active Directory?

If you have any questions about domain controller hardware planning, the best place to ask them is the Active Directory Domain Services forum on TechNet. If you don’t get help that you need there, you can try sending your question to [email protected] so we can publish it in the Ask Our Readers section of our newsletter and see whether any of the almost 100,000 IT pro subscribers of our newsletter have any suggestions concerning your problem.

If you would like to read the other parts in this article series please go to:

About The Author

Leave a Comment

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top