Many larger organizations still rely heavily upon Microsoft’s Active Directory services platform. Active Directory functions as the central lookup and security access control watchdog for your Windows-based network. Without a reliable functioning Active Directory service in place, users will have trouble logging on to their machines and access shared network resources like file repositories, printers, and other services. A key to how Active Directory works in a larger organization is its replication feature — and it’s not uncommon to experience Active Directory replication delays.
While Active Directory manifests itself to users and devices as a centralized service, in reality its directory of objects is distributed across multiple systems called domain controllers. Each domain controller needs to maintain an up-to-date and accurate catalog of the directory objects needed by users and devices that might access it. So when AD replication fails for some reason or doesn’t perform as expected, Active Directory replication delays start to happen and impact users trying to get their work done.
From my experience and that of my colleagues who work as consultants for different organizations, one of the biggest challenges is to ensure that replication takes place reliably in large hub-and-spoke Active Directory environments. Organizations like these typically have a datacenter at their corporate headquarters that acts as the hub or center of their Active Directory infrastructure. This headquarters is then connected by various slow WAN links to remote branch offices, which act as the spokes in the hub-and-spoke wheel of how AD is deployed across the company.
This picture of a wheel with a single hub and spokes going out to the rim basically outlines the larger network topology that the organization uses to unify the LANs at the head office and all branch offices into a single larger network. And while an Active Directory forest can be logically divided different ways into domains and trees of domains, most organizations take the simplest option of deploying a single-domain across most of their sites. Yet even when you try to keep your AD deployment as simple as possible, ensuring reliable replication across all domain controllers can still be a challenge. Things can get even more challenging when an organization has more than one hub in its AD deployment.
Troubleshooting Active Directory replication delays
This challenge was brought home to me recently by a colleague who shared about his experience helping an organization resolve some intermittent issues they were having that they suspected might have something to do with AD replication. I’ve changed some of the details of this scenario to anonymize the organization involved, but the basic problem is this. A company we’ll call Contoso has three hub sites and more than 500 remote sites all tied together using Active Directory in Windows Server 2012 R2. Suspecting that AD replication errors might be happening, an administrator began running the repadmin /replsummary periodically to get some quick summaries of the replication health of their AD infrastructure. This command will give you useful information, like the total number of replications that have been attempted recently, how many of these attempts have failed, and the largest replication deltas. What she found was that on some occasions last replication times were around 15 minutes while at other times they might be an hour or even several hours. Since no replication errors were displayed, and considering the large distributed nature of the AD infrastructure, the longer replication times were not too great of a concern.
After running repadmin /replsummary for several weeks the admin noticed something strange though: A few blocks of their domain controllers were experiencing Active Directory replication delays of several days or more, again without any reported replication errors. She then talked with the head of her organization’s network infrastructure team and discovered that they were currently underway with a project to implement network quality of service (QoS) across the organization’s network infrastructure, and as part of this initiative they had assigned a lower priority to DC replication traffic compared to some other types of network traffic. The result of adding this QoS was postulated to have increased the WAN latency for DC replication traffic, so a request was made for the networking team to increase the priority for replication traffic. This was a reasonable step to take, but unfortunately it didn’t resolve the problem.
Trying another approach
After talking with several colleagues who had expertise with Active Directory, the admin then decided to check whether replication attempts might be queuing on the domain controllers located in the hub sites. To determine this she used the repadmin /queue command to see whether AD might be overloading the domain controllers in the hub sites as far as replication operations are concerned. The result from running this command showed that the queue length was indeed highly variable and ranged from 50 to more than 500 objects at different times, so she tried forcing a replication between the hub site and one of the remote sites to see what happens — and found once again that replication eventually completed with no errors.
The next thing the admin tried doing was to monitor some performance counters on the domain controllers in the hub sites. Performance Monitor, the built-in tool in Windows Server, was utilized for this purpose, and the standard counters for processor, network, disk, and memory were all collected for a period of time. Analysis of the resulting logged data however showed that the hub DCs were clearly not experiencing any bottlenecks in regard to any of these resources. So whatever was causing replication attempts to become queued up on the hub DCs was clearly not because these DCs were overloaded.
Resolution of the problem
Feeling a bit up against the wall, the admin finally reached out to another AD admin at a different large organization. It turned out her colleague had also experienced similar problems and had researched it thoroughly with the help of someone from Microsoft Consulting Services. What he had discovered was that this kind of replication scenario turned out to be quite common with large distributed hub-and-spoke AD deployments and was rooted in the way Active Directory replication occurs under the hood. This is because inbound AD replication is serial in nature, which means only one incoming DC/NC combination can be active at any time. And because replication is serial, your DCs will spend most of their time just waiting for replication attempts to complete, which is why monitoring these DCs will show no significant load on them.
This also means, of course, that if the WAN links between your hub and spoke sites suffers any serious latency the whole replication process can start experiencing significant delays. Added to this is the problem that different AD partitions have different replication priorities, with the schema having the highest priority and any DNS application partitions having lowest priority. Because of this, the DNS replication performed by Active Directory in such environments can sometimes become so delayed that it fails to occur, which of course can lead to various problems such as TLS authentication failing for web applications.
Resolving the problem, then, turned out to be fairly straightforward. First, the admin increased the number of AD bridgehead severs. When you link two sites like a hub site and a spoke site for AD replication to occur between them, AD uses specially designated DCs called bridgehead servers to collect and manage replication changes efficiently. In other words, bridgehead servers function as the contact point for the exchange of directory information between sites in an AD deployment. These bridgehead servers can either be chosen automatically by an Active Directory feature called the Knowledge Consistency Checker (KCC) or they can be manually designated by the administrator. What the Contoso admin discovered was that her AD infrastructure didn’t have enough bridgeheads because of the very high ratio of remote sites to hub sites. By doubling the number of hub sites replication was found to occur with somewhat less delay.
The second thing the Contoso admin did was to increase the intersite replication interval schedule until it was observed that the replication queues for all bridgeheads had lowered to the point that they reached zero at least once each day. The intersite replication schedule is an important tuning parameter for AD replication that specifies how often a domain controller that is acting as a bridgehead server in a site requests changes from its source replication partner in a different site. To configure the intersite replication frequency for AD replication, see this TechNet page.
Photo credit: Pixabay