Deploying Exchange 2007 Multi-site CCR Clusters – Do’s and Don’ts (part 2)

If you would like to be notified of when Henrik Walther releases the next part in this article series please sign up to our MSExchange.org Real-Time Article Update newsletter.

 

If you would like to read the first part of this article series please go to Deploying Exchange 2007 Multi-site CCR Clusters – Do’s and Don’ts (part 1).

 

Introduction

 

In part 1 of this four part article series, we took a look at what operating system to use for your multi-site CCR clusters. We looked at stretched and non-stretched subnets as well as the network card configurations.

 

In this part 2, we will continue where we left off in part 1. Particularly, we will look at stretched Active Directory site strategies as well as recommended Network Latency and Heartbeat timeout values.

 

Stretched Active Directory site

 

No matter if you use Windows Server 2003 or Windows Server 2008 as the underlying operating system for your Exchange 2007 servers, the CCR cluster nodes must always be located in the same Active Directory (AD) site. This means that although the nodes can be located on separate subnets, you must still stretch the AD site they belong to between the two datacenters. Some of you may have heard that Windows Server 2008 Failover Clusters supports nodes located in different AD sites, and although this is true, Exchange can not locate CCR cluster nodes in separate AD sites.

 

Because the CCR cluster nodes must belong to the same AD site, the AD site needs to be stretched between the datacenters. With regards to the Hub Transport server role, this will mean that messages sent or received by users that have a mailbox stored on the CMS located in the primary datacenter can theoretically be received from or sent to Hub Transport (HT) servers in the backup datacenter. The same is true for Exchange client requests such as Autodiscover, OWA, EAS, POP3, and IMAP requests/connections to Client Access Servers (CAS). In addition, LDAP/auth requests to Global Catalog servers (GCs) can also go to servers in the backup datacenter. As you can imagine, this can result in a lot of traffic between the datacenters. This is especially true because HT and CAS servers and Outlook clients uses MAPI over RPC to communicate with Exchange 2007 Mailbox servers.

 

However, you can block connections/requests going to the servers in the backup datacenter. In regards to message submissions from the CMS to any HT servers in the AD site, you can use the Set-MailboxServer cmdlet with the SubmissionServerOverrideList parameter to specify which Hub Transport servers should be used. This way you can exclude the HT servers located in the backup datacenter even though they belong to the same AD site.

 

If/when a disaster strikes in the primary datacenter resulting in a failover to the backup datacenter, just update the submission override list, so that it only includes the HT servers in the backup datacenter.

 

In order to block CAS server requests/connections from hitting servers in the backup data center, you can take advantage of load balancing mechanisms. If you have a large environment, chances are you have implemented either a hardware based or WNLB based solution (see this previous article of mine on how you load balance CAS servers). If this is the case, here is what to do;

 

 

  • Create an NLB array which includes all CAS servers located in the primary datacenter. Configure the internal URL for Autodiscover, OWA, OAB, EWS and UM to point to the FQDN (namespace.company.com) of the load balancing solution. All requests that hit the specified FQDN will then go to CAS servers in the primary datacenter.
  • Now, create another NLB array for CAS servers in the backup datacenter, use the same FQDN but use another IP address than the one you used for the first NLB array. In addition, make sure you only create a record in DNS for the first NLB array otherwise requests will be load balanced between the two NLB arrays.
  • If/when a disaster strikes in the primary datacenter resulting in a failover to the backup datacenter, just update the NLB record in DNS to point to the IP address of the NLB array created for the CAS servers in the backup datacenter. Then wait for the change to propagate to all DNS servers in your organization.
  • In regards to GCs, you can configure Outlook clients to use GCs located in the primary datacenter. The steps necessary to implement this behavior are explained in this KB article.
  • Although the above will block requests/connections from hitting the servers in the backup datacenter, it adds complexity to your environment. But fear not, there is also another viable method which I will describe in the next section.

 

Using two Active Directory Sites in the Backup Datacenter

 

In order to eliminate the chance of having servers and clients in the primary datacenter communicate with Exchange 2007 servers and GCs in the backup datacenter, you can create an additional AD site in the backup datacenter and then move all servers except the Windows Failover Cluster on which the passive Mailbox role is installed to this AD site as depicted in Figure 1.

 


Figure 1: Two AD sites in the Backup Datacenter

 

When disaster strikes, the primary datacenter results in a failover of the backup datacenter. What should be done next is to move the servers from the second AD site (AD site 2) to the stretched AD site (AD site 1) by giving each server an IP address on the subnet of the stretched AD site or by changing the AD site definitions in the Active Directory Sites and Services MMC snap-in. Some of you may think it would be easier to simply move the CMS to the second AD site, but this would make it impossible for an HT server to re-submit messages to the CMS which results in data loss during a failover and is not a supported method.

 

Network Latency and Heartbeat timeout values

 

When deploying a multi-site CCR cluster, you should at all times try to keep the network latency between the datacenters below 500 milliseconds (ms). If you are deploying the solution in a LORG (large organization) with full utilization of storage groups/databases and a lot of mailbox activity, it is recommended to keep the network latency under 50 milliseconds (ms). Otherwise, there is a chance of experiencing issues with large copy queues etc. With that said though, you can adjust the aggressiveness of heartbeat timeouts which helps avoid unnecessary failovers during temporary network problems. By default, the tolerance settings for missed cluster heartbeats are configured to 5 missed heartbeats both for nodes located on the same subnet and nodes on different subnets (Figure 2). When dealing with multi-site clusters, it is recommended that you change this setting to 10 missed heartbeats (approximately 12 seconds).

 


Figure 2: Default value for Subnet Thresholds

 

To change the CrossSubnetThreshold threshold to ten missed heartbeats instead of the default of five, use the following command:

 

cluster ClusterName /prop CrossSubnetThreshold=10

 

You can verify the new heartbeat threshold values by entering the following command:

 

Cluster.exe /cluster:<ClusterName> /prop

 


Figure 3: Subnet Thresholds changes to 10 missed heartbeats
Note:
Later on in the article series, we will pause the passive cluster node, which means that a failover will not happen automatically even though you have two votes (one cluster node and the File Share Witness (FSW)) available. If you do not use the pause cluster method, the above heartbeat settings are strongly recommended, if you do not want a failover to occur out of the blue.

 

DNS Time to Live values

 

When the CMS is moved (via planned hand over or a failover) from a cluster node on one subnet to a cluster node on another, and the IP address and network resource name comes online again, the failover cluster kicks off a timer. After 10 minutes with the network resource name and IP address on the cluster node, a DNS record update is issued.

 

By default the DNS Time to Live (TTL) value for the network name resource is 20 minutes. This means that when the 10 minute timer has occurred, you would need to wait another 20 minutes for the DNS record (Figure 4). Add to this that the update must propagate to domain controllers throughout the organization. Moreover, the client side resolver cache on the clients running Outlook would also need some time for the update to be reflected.

 


Figure 4: Default DNS TTL value for CMS in DNS Manager

 

Plus, 30 minutes is considered a long time in most environments. The best practice recommendation is to change the DNS TTL value to 5 minutes. To do so, first we need to find the CMS cluster network name resource. This can be done by opening a command prompt on one of the cluster nodes and enter the following command:

 

cluster /cluster:<Name of CMS> res

 


Figure 5: Finding the CMS cluster network name resource name

 

Now that we have the cluster network’s name, let us change the TTL to 5 minutes. We do so using the following command:

 

Cluster.exe res <CMSNetworkNameResource> /priv HostRecordTTL=300

 


Figure 6: Changing the TTL to 300 seconds (5 minutes)

 

Stop and start the Clustered Mailbox Server (CMS) using Stop-ClusteredMailboxServer and Start-ClusteredMailboxServer cmdlet’s or the Manage Clustered Mailbox Server wizard.

 

Note:
Although some may feel tempted to do so, do not try to change the DNS TTL via the property page of the DNS record in the DNS Managerm, as the setting will be overwritten with the value configured for the HostRecordTTL on the cluster nodes every time the DNS record is refreshed. The record will be refreshed when the CMS is started, moved or brought online after a failure or failover.

 

Now let us verify that the TTL for the DNS record has been changed from 20 to 5 minutes. We do so by opening the property page of the CMS cluster network name resource DNS record in the DNS Manager on a DNS server as shown in Figure 7.

 


Figure 7:
TTL value changed
We reached the end of part 2 in the article series covering the do’s and don’ts of multi-site CCR clusters. Part 3 will be published in a near future. Until then enjoy!

 

If you would like to be notified of when Henrik Walther releases the next part in this article series please sign up to our MSExchange.org Real-Time Article Update newsletter.

 

If you would like to read the other parts of this article series please go to:
Deploying Exchange 2007 Multi-site CCR Clusters – Do’s and Don’ts (part 1)
Deploying Exchange 2007 Multi-site CCR Clusters – Do’s and Don’ts (part 3)
Deploying Exchange 2007 Multi-site CCR Clusters – Do’s and Don’ts (part 4)

 

About The Author

Leave a Comment

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top