Designing a Site Resilient Exchange 2010 Solution (Part 3)

If you would like to read the other parts in this article series please go to:

Introduction

In Part 2 of this article series, I described a second Exchange 2010 site resilient scenario which was an active/passive datacenter model where a unique namespace was used for each datacenter. All Client Access Servers (CAS) in the Exchange organization used the same SAN certificate. The trick in regards to having Outlook Anywhere clients behave nicely to an *over and site failover respectively was to configure the CAS servers in both datacenters to use the same principal name using the Set-OutlookProvider cmdlet. This scenario is close to ideal, so why continue with a part 3 then? Well there are also enterprises that want to be able to have active users in both datacenter at the same time.

In this part 3, I’ll describe a scenario which is actually identical to scenario 2. The big difference is the fact that both datacenters will have active users. This will have an affect on how we should go about designing our database availability group (DAG) infrastructure. Doing this the wrong way can have catastrophic consequences!

Credits:
I would like to give a special thanks to Greg Taylor and Ross Smith IV both Senior Program Managers on the Exchange Customer Experience team in the Exchange Product group at Microsoft. Both have provided me with invaluable information around Client Access Server (CAS) and Database Availability Group (DAG) high availability and site resiliency over the last couple of years. Without their help this multi-part article wouldn’t exist. Thanks guys!

Scenario 3: Active/Active Model with Different Namespaces

The third datacenter model we are going to delve into is an active/active model where each datacenter has a unique namespace. An active/active model with a unique namespace for each datacenter is probably the most appealing model when designing a site resilient Exchange 2010 solution with two datacenters each with active users.

This model is depicted in the following illustration.

Figure 1: Scenario 3: Active/Active model

Scenario 3 which will be covered in this article is an active/active model with different namespaces (mail.eu.exchangelabs.dk & mail.us.exchangelabs.dk) and active users (active database copies) in both datacenters.

Note that this scenario includes two DAGs – one DAG with active databases in the EU datacenter and one with active databases in the US datacenter. In addition, we have a dedicated witness server in each datacenter. There’s of course a very specific reason for this design, and I will explain it in the “Database Availability Group design” section later in this article.

Unless you have LAN quality communication between the two datacenters, it’s recommended to use a separate AD site for each datacenter instead of spanning a single AD site. To avoid unnecessary broadcast traffic etc. between the datacenters, it’s also recommended to use different subnets for each. We don’t have LAN quality communication between the two datacenters in this specific scenario so we have an AD site in each datacenter.

Client Access Server Infrastructure

The scenario depicted in Figure 1 includes 2 CAS servers in each datacenter. A redundant hardware load balancer and a RPC Client Access array have been deployed in each datacenter and are used to distribute client traffic evenly between the CAS servers.

As you can see in the figure, the same SAN certificate is used in both datacenters. The SAN certificate holds the following FQDNs:

Mail.eu.exchangelabs.dk (certificate principal name)
Mail.us.exchangelabs.dk
Autodiscover.exchangelabs.dk

All internal and external web service URLs (OWA, ECP, EWS, OAB) on the CAS servers in the EU datacenter points to mail.eu.exchangelabs.dk which resolves to the virtual IP address (VIP) of the load balancer in this datacenter. The internal and external web service URLs on the CAS servers in the US datacenter points to mail.us.exchangelabs.dk, which again resolves to the VIP of the load balancer in this datacenter.

Although we have active users connecting to both datacenters, the autodiscover record (autodiscover.exchangelabs.dk) in external DNS points to the load balancer in the EU datacenter. The internal “AutoDiscoverServiceInternalUri” on the CAS servers in the both datacenters has been configured with a value of https://mail.eu.exchangelabs.dk/autodiscover/autodiscover.xml. Now you could in theory point the AutoDiscoverInternalUri on CAS servers in the US datacenter at “https://autodiscover.us.exchangelabs.dk/autodiscover/autodiscover.xml” but you can easily up on in a situation where SCP’s aren’t reachable during a site failover. Also, cross-site traffic caused by Autodiscover has a minor impact on the WAN link since autodiscover requests consists of small XML based text files.

The RPC Client Access arrays have been configured with different values. We have an RPC CA array named outlook-eu.exchangelabs.dk in the EU datacenter and one named outlook-us.exchangelabs.dk in the US datacenter. The FQDN of each RPC CA array is only used by internal Outlook and don’t need to be included on the SAN list in the certificate as RPC traffic doesn’t use or require a certificate for it to be encrypted.

As I mentioned earlier, we use the same SAN certificate on all CAS servers, which means that the certificate principal name (mail.eu.exchangelabs.dk) is the same for all CAS servers. By default this will break Outlook connectivity when either a database *over or a site failover occurs. This is because when Outlook Anywhere is enabled for a CAS server, the FQDN (Outlook Proxy Endpoint) specified will be used as the value for the “msstd” as well (Figure 2). If the “msstd” value doesn’t match the certificate principal name, Outlook Anywhere clients will not be able to connect.

Figure 2: Default Exchange Proxy Settings

Then why not use a different certificate for each datacenter where the certificate principal name in the EU datacenter is mail.eu.exchangelabs.dk and mail.us.exchangelabs.dk in the US datacenter? Wouldn’t this fix the Outlook connectivity problem that occurs during cross-site *overs and site failovers? Nope unfortunately it won’t, well at least not for all Outlook client versions.

However as long as you use the approach described in this article, Outlook clients will connect just fine. What we do here is we use the same certificate in both datacenter. In addition, we configure the “msstd” value to be the same in both datacenters. In order to do so, we can use the “Set-OutlookProvider” cmdlet with the “CertPrincipalName” parameter, so that the “msstd” value is configured identically for CAS servers in both datacenter. In this example, we would use the following command:

Set-OutlookProvider EXPR –CertPrincipalName msstd:mail.eu.exchangelabs.dk

Note:
Some of you might wonder why the RPC CA array isn’t named the same as the external namespace mail.domain.com so that complexity is reduced a little. There’s a very good reason why it isn’t. You see if you use the same FQDN, Outlook Anywhere clients will experience approximately a 30 second delay every time they connect since Outlook by default tries to connect using TCP/IP before trying HTTP. Said in another way, don’t name the RPC CA array the same as the external namespace or something else that’s resolvable in external DNS.

Hub Transport Infrastructure

The scenario depicted in Figure 1 includes 2 Hub Transport (HT) servers in each datacenter. Traffic coming from external SMTP servers and from internal LOB applications goes to the hardware load balancer which distributes it evenly among the HT servers. Inbound messages only go to the primary datacenter. In case of a site failover to the failover datacenter, inbound mail flow is routed to the failover datacenter.

There’s often a lot of confusion around whether it’s supported to load balance SMTP traffic going to Exchange 2007/2010 HUB servers. And the answer is in the details. It’s supported to load balance HT servers using a HB or WNLB, but it isn’t supported to load balance connections between HT servers on your internal corporate production network using HLB or WNLB. You may only load balance inbound SMTP connections from applications (such as LOB application, MOSS, and SCOM 2007 etc.) and other non-Exchange sources as well as client connections. Steps on how to do this can be found in this previous article of mine.

Distributing inbound messages (using MX records or a service such as FOPE) between each datacenter means that the traffic on the WAN between the datacenters will increase since some messages that enter the EU datacenter will be destined for recipients located in the US datacenter and vice versa. So please make sure the WAN connection can handle all the traffic that will occur between the datacenters. That is traffic such as log replications, e-mail messages between Hub Transport servers etc. as well as other non-Exchange traffic.

Database Availability Group design

The scenario depicted in Figure 1 includes 2 Mailbox servers in each datacenter. In addition it includes two DAGs both stretched between the datacenters. DAG1 has active database copies in the EU datacenter and DAG2 has active databases in the US datacenter.

Because we have an even number of DAG members (two to be precise) in each DAG, the file share witness is located in the datacenter where the active database copies reside. The reason for this decision is because if the network fails between the two datacenters, the mailbox databases will still stay mounted in the “primary datacenter” since it has majority (1 DAG member plus a witness = 2 votes versus 1 DAG member = 1 vote in the other datacenter).

Figure 3: Two stretched DAGs

But why two stretched DAGs instead of one? Well, because we have active users in both datacenters. Hmm yes and? Let me take it one step further. What do you think would happen if you only had one DAG and the WAN connection between the datacenters was lost? Correct! The mailbox databases would mount in the datacenter where the DAG had majority which would be fine for the users connected to this datacenter. But what about the users that connects to the other datacenter? Yes Outlook clients would disconnect as there wouldn’t be any active database copies in the other datacenter. Not a good situation right?

With two stretched DAGs, majority can be retained in each datacenter which will keep all mailbox databases mounted in their primary datacenter resulting in all client staying connected no matter which datacenter they are connected to.

Lastly, the “RpcClientAccessServer” value for all EU mailbox databases (databases that are active in the EU datacenter) has been configured to: outlook-eu.exchangelabs.dk. US mailbox databases has been configured to: outlook-us.exchangelabs.dk.

Note:
With Exchange 2010 SP1 two member DAGs stretched between AD sites can now be configured to use datacenter activation coordination mode (DAC). I won’t cover DAC in detail in this article, but will do so in another article that will be released here on MSExchange.org in a not so distant future.

Database Switch & Fail-Overs

When an active database copy is moved from one DAG member server in let’s say the EU datacenter to the other DAG member server in the US datacenter, the “RpcClientAccessServer” property configured on the mailbox database isn’t changed. In addition, as long as the CAS servers are available in the EU datacenter, clients will continue to connect to the CAS servers in this datacenter and the CAS servers will then connect directly to the DAG member servers in the US datacenter now hosting the active database copies using RPC.

How does this affect the miscellaneous Exchange clients?

Outlook – Outlook Anywhere connected to a mailbox in the EU datacenter will connect using mail.eu.exchangelabs.dk as the RPC Proxy Endpoint and also have this FQDN specified in msstd box. Outlook 2003 clients will continue to connect to the original RPC Proxy Endpoint (mail.eu.exchangelabs.dk), since they don’t support the autodiscover service. Outlook 2007 clients will receive new connection information (EWS URLs and RPC proxy endpoint) but will ignore the RPC proxy endpoint URL (mail.us.exchangelabs.dk) it receives, but since the CAS servers in the EU datacenters are available, Outlook 2007 clients will still be able to connect. Outlook 2010 will accept the new connection settings and connect to mail.us.exchangelabs.dk. Also be aware that the WAN traffic between the two datacenters will increase significantly because of the cross-site RPC traffic between CAS and Mailbox servers.
Mobile devices (EAS) – The external URL configured for EAS on the CAS servers in the EU datacenter is mail.eu.exchangelabs.dk. When an *over has occurred, mobile devices will get an HTTP 451 from the CAS servers in the primary datacenter, and be told to use mail.us.exchangelabs.dk instead. So as long as the mobile device supports an HTTP 451 (redirect), things will be fine.
OWA – The external URL configured for OWA on CAS servers in the EU datacenter is mail.eu.exchangelabs.dk. When an *over has occurred, the CAS server will tell the OWA client that it went to the wrong site and tell the user to go to mail.us.exchangelabs.dk. So things are fine for OWA clients as well.

As you can understand the way an *over affects Exchange clients in this scenario is acceptable although each Outlook client version behaves differently to an *over. By the way the Outlook 2007 issue described above will be fixed via an Outlook update in the near future.

And if a database should *over from the US to the EU datacenter, the same thing will happen just in the opposite order.

Complete Site failover

If the EU datacenter is destroyed or for some other reason is unavailable and a site failure to the failover datacenter is performed by repointing DNS to this datacenter, how will this affect Exchange clients?

Outlook – Outlook Anywhere connected to a mailbox in the EU datacenter will connect using mail.eu.exchangelabs.dk as the RPC Proxy Endpoint and also have this FQDN specified in msstd box. Outlook 2003/2007/2010 will connect just fine when DNS has been repointed to the other datacenter since the certificate principal name matches the msstd value. The FQDN of RPC Client Access array changes but that doesn’t affect Outlook clients as long as the RpcClientAccessServer property on the mailbox databases isn’t changed which it normally shouldn’t be during a site failover in this scenario (this will create other problems, you don’t won’t to deal with during a site failover).
Mobile devices (EAS) – The external URL configured for EAS on the CAS servers in the EU datacenter is mail.eu.exchangelabs.dk. Mobile devices will connect just fine when DNS has been repointed to the other datacenter since mail.us.exchangelabs.dk is included on the SAN list of the certificate. Have in mind though that some old devices expect the principal certificate name to match the external URL specified for Exchange ActiveSync.
OWA – The external URL configured for OWA on CAS servers in the EU datacenter is mail.eu.exchangelabs.dk. OWA clients will connect just fine when DNS has been repointed to the other datacenter since mail.us.exchangelabs.dk is included in the SAN list of the certificate.

One important thing to have in mind during a complete site failover is DNS delays. DNS updates can take from minutes to several hours depending on the topology and DNS TTL values specified for DNS records used by Exchange. To reduce the delays, it’s important you configure internal and external DNS records used for Exchange with a low TTL value (five minutes is a good best practice).

Okay scenario 3 has now been covered and as you can see this scenario is just a good as scenario 2 as long as you use 2 DAGs instead of one.

We reached the end of part 3 and since we now have uncovered the most typical site resilient scenarios involving two datacenters, this is also the last part in this multi-part article.

If you would like to read the other parts in this article series please go to:

Designing a Site Resilient Exchange 2010 Solution (Part 3)

Introduction

Scenario 3: Active/Active Model with Different Namespaces

Client Access Server Infrastructure

Hub Transport Infrastructure

Database Availability Group design

Database Switch & Fail-Overs

Complete Site failover

About The Author

Henrik Walther

Leave a Comment Cancel Reply

Introduction

Scenario 3: Active/Active Model with Different Namespaces

Client Access Server Infrastructure

Hub Transport Infrastructure

Database Availability Group design

Database Switch & Fail-Overs

Complete Site failover

About The Author

Henrik Walther

Read Next

Leave a Comment Cancel Reply