Designing a Site Resilient Exchange 2010 Solution (Part 1)

If you would like to read the other parts in this article series please go to:

- Designing a Site Resilient Exchange 2010 Solution (Part 2)

Designing a Site Resilient Exchange 2010 Solution (Part 3)

Introduction

Just as many of us became relatively comfortable with deploying highly available (and often site resilient) Exchange 2007 based messaging solutions, the Exchange product team released Exchange 2010. Fortunately though, the Exchange 2010 high availability story is exactly the same as with Exchange 2007 right? Before you wonder what on earth I’ve been smoking, let me stress out that I’m of course joking.

With Exchange 2007, we typically used Cluster Continuous Replication (CCR) in order to achieve local mailbox resiliency within a datacenter, and Standby Continuous Replication (SCR) to make the solution site resilient by replicating log files to one or more clustered or non-clustered SCR target server located in another datacenter.

Note:
Although the recommendation was to use SCR, it was also possible to use CCR for site resiliency (read more about this topic in here). Bear in mind though that it wasn’t a best practice recommendation to use CCR as the basis for a site resilient solution.

Load balancing and high availability for the Client Access server (CAS) role within a datacenter was usually achieved using either Windows Network Load Balancing (WNLB) or a load balancer solution from a 3rd party vendor. With Exchange 2007 most went with WNLB (including Microsoft IT). I uncover this approach here. If Exchange 2007 was deployed across multiple datacenters each with their own Internet connection, the recommended approach in regards to the Client Access server role was to use a unique namespace for each datacenter (for instance mail.contoso.com and mail.standby.contoso.com). In case of a disaster that results in the primary datacenter to become unavailable, the external and internal DNS records for client access was pointed to CAS servers in the failover datacenter.

Resiliency was designed into the Hub Transport role so that Hub Transport server to Hub Transport server communication inside an organization automatically load balances between available Hub Transport servers in an Active Directory site. In order to load balance inbound SMTP connections from external SMTP servers/clients, internal LOB application, and devices such as miscellaneous network devices you could use WNLB or a load balancer solution from a 3rd party vendor (I cover the WNLB approach here). For inbound SMTP traffic from external servers/clients, you could also use the good old MX record trick which I describe in my Edge Transport server article series (more specifically part 6). During a failover from one datacenter to another, mail flow could be redirected to the other datacenter by changing the MX record or the cloud-based service such as FOPE to point at the Hub Transport or Edge Transport servers in the failover datacenter.

With the release of Exchange 2010, the high availability story changed again. Not just a little but rather significantly as the introduction of database Availability groups (DAGs) and RPC Client Access (RPC CA) arrays changes the way we deploy Exchange in a highly available/site resilient fashion.

Most of you probably are aware of the fact that I already uncovered new high availability related improvements and changes such as RPC CA arrays, hardware load balancing, and DAG in other articles series published over the last year here on MSExchange.org. Here’s a list of them:

If you haven’t yet read them, now would be a good time to do so before you read on.

The intention with this specific articles series is to explain the different Exchange 2010 site resiliency scenarios that are available to an organization that have two datacenters at its disposal. I’ll also talk about the pros and cons of each scenario in terms of how database fail/switch-overs (aka *overs) and complete site failures affect the miscellaneous Exchange clients.

Note:
Topics such as proxying between Internet-facing CAS servers in one AD site and CAS servers that are not Internet-facing in another AD site is outside the scope of this article series. Instead I’ll refer you to the excellent documentation already available on Microsoft TechNet.

Important Notes Before we Continue:
An Exchange 2010 deployment involving multiple datacenters is a complex topic and all organizations have different Active Directory/Exchange topologies, network infrastructures, and just as important needs. Each scenario described in this articles series shares the following characteristics:

Pure Exchange 2010 organizations (no coexistence with Exchange 2003/2007)
Consists of two datacenters (both with Internet connectivity)
No proxy sites (AD sites with no Internet connectivity)
Uses split-DNS (same namespace internally and externally)
An Active Directory site in each datacenter (no spanned AD sites)
Each DAG includes member servers from each datacenter (stretched DAGs)
Hardware Load Balancer deployed in each datacenter
SAN certificates are used (no wildcard certs)

Scenario 1: Active/Passive Model with Shared Namespace

The first datacenter model we’re going to delve into is the active/passive model where the same namespace is used in both datacenters. An active/passive model with a shared namespace is probably the most appealing model when designing a site resilient Exchange 2010 solution with two datacenters. The reason therefore is because it’s the model with least configuration complexity involved and because this model is somehow similar to many organizations have done with previous Exchange versions over the years.

This model is depicted in the following illustration.

Figure 1: Scenario 1 – Active/Passive model

When using an active/passive model with a shared namespace, you only have active users (active database copies) in one datacenter and share the same namespace (for instance mail.domain.com) between the two. It’s important to note that seen from the user distribution model, the failover datacenter is passive but from the namespace perspective both datacenters are active.

Unless you have LAN quality communication between the two datacenters, it’s recommended to use a separate AD site for each datacenter instead of spanning a single AD site. To avoid unnecessary broadcast traffic etc. between the datacenters, it’s also recommended to use different subnets for each. We don’t have LAN quality communication between the two datacenters in this specific scenario so we have an AD site in each datacenter.

Client Access Server Infrastructure

The scenario depicted in Figure 1 includes 2 CAS servers in each datacenter. A redundant hardware load balancer and a RPC Client Access array have been deployed in each datacenter and are used to distribute client traffic evenly between the CAS servers.

As you also can see in the figure, the same SAN certificate is installed on all four CAS servers. The SAN certificate holds the following FQDNs:

- Mail.exchangelabs.dk (certificate principal name)
- Autodiscover.exchangelabs.dk.

In addition, all Web service URLs (OWA, ECP, EWS, OAB) and the AutoDiscoverServiceInternalUri has been configured with the same values in both datacenters. Both internal and external URL values for each virtual directory point at a load balancer.

Only the RPC Client Access arrays have been configured with different values. This is because you only can have one RPC Client Access array per AD site and when you have multiple RPC CA arrays they cannot be configured with the same FQDN. So an RPC CA array has been created for each datacenter named outlook-1.exchangelabs.dk and outlook-2.exchangelabs.dk respectively. The FQDN of each RPC CA array is only used by internal Outlook and don’t need to be included on the SAN list in the certificate as RPC traffic doesn’t use or require a certificate for it to be encrypted.

Note:
Some of you might wonder why the RPC CA array isn’t named the same as the external namespace mail.domain.com so that complexity is reduced a little. There’s a very good reason why it isn’t. You see if you use the same FQDN, Outlook Anywhere clients will experience approximately a 30 second delay every time they connect since Outlook by default tries to connect using TCP/IP before trying HTTP. Said in another way, don’t name the RPC CA array the same as the external namespace or something else that’s resolvable in external DNS.

Hub Transport Infrastructure

The scenario depicted in Figure 1 includes 2 Hub Transport (HT) servers in each datacenter. Traffic coming from external SMTP servers and from internal LOB applications goes to the hardware load balancer which distributes it evenly among the HT servers. Inbound messages only go to the primary datacenter. In case of a site failover to the failover datacenter, inbound mail flow is routed to the failover datacenter.

There’s often a lot of confusion around whether it’s supported to load balance SMTP traffic going to Exchange 2007/2010 HUB servers. And the answer is in the details. It’s supported to load balance HT servers using a HB or WNLB, but it isn’t supported to load balance connections between HT servers on your internal corporate production network using HLB or WNLB. You may only load balance inbound SMTP connections from applications (such as LOB application, MOSS, and SCOM 2007 etc.) and other non-Exchange sources as well as client connections. Steps on how to do this can be found in this previous article of mine.

Database Availability Group design

The scenario depicted in Figure 1 includes 2 Mailbox servers in each datacenter. Only a single DAG is used and the DAG is stretched between the datacenters. We only have active database copies in the primary datacenter and each database has three copies.

Because we have an even number of DAG members, the witness server is configured in the primary datacenter. The reason for this decision is because if the network fails between the two datacenters, the mailbox databases will still stay mounted in the primary datacenter since it has majority (2 DAG members plus a witness = 3 votes versus two DAG members = 2 votes in the failover datacenter).

The “RpcClientAccessServer” value for all mailbox databases has been configured to: outlook-1.exchangelabs.dk.

Since we don’t have any users in either of the datacenters, this DAG design approach works well.

Note:
Delving into the DAG sizing and database layout is outside the scope of this article

Database switch & fail-overs

When an active database copy is moved from one DAG member server in the primary datacenter to another DAG member server in the failover datacenter, the “RpcClientAccessServer” property configured on the mailbox database isn’t changed. In addition, as long as the CAS servers are available in the primary datacenter, clients will continue to connect to the CAS servers in this datacenter and the CAS servers will then connect directly to the DAG member servers in the failover datacenter hosting the active database copies using RPC.

How does this affect the miscellaneous Exchange clients?

Outlook Anywhere – Outlook Anywhere connects using mail.exchangelabs.dk as the RPC Proxy Endpoint and also has this FQDN specified in MSSTD box. Since Outlook will continue to connect to the original RPC Proxy Endpoint, Outlook 2003/2007/2010 users won’t be affected by an *over between the datacenters in this scenario. Only thing to be aware of is that the WAN traffic between the two datacenters will increase significantly. If there are clients connecting directly via MAPI, these would also continue to work.
Mobile devices (EAS) – The external URL configured for EAS is mail.exchangelabs.dk. When an *over has occurred it will create problems for all types of mobile devices. More specifically CAS will send a HTTP 451 to the device and tell it to connect to the external URL in the failover datacenter. Since the same value is set for the EAS vdir in both datacenters it will create a loop and the device won’t get access.
OWA – The external URL configured for OWA is mail.exchangelabs.dk. When an *over has occurred, the CAS server will tell the OWA client that it went to the wrong site and issue a redirect to mail.exchangelabs.dk which will result in OWA client to be prompted for credential over and over again.

As you can understand the way an *over affects Exchange clients in this scenario is far from ideal. So when using this scenario, you should to stay away from *overs. Actually it’s a good idea to block the database copies in the failover datacenter from being activated. This can be done on the server or database level. I describe the process in part 4 of this article series uncovering DAGs.

Complete Site Failover

If the primary datacenter is destroyed or for some other reason is unavailable and a site failure to the failover datacenter is performed by repointing DNS to this datacenter and re-configure the DAG to point to a witness server in the failover-site. How will a site failover affect Exchange clients in an active/passive datacenter model with a shared namespace?

Outlook – Outlook Anywhere connects using mail.exchangelabs.dk as the RPC Proxy Endpoint and also have this FQDN specified in “msstd” box. Outlook will connect just fine when DNS has been repointed to the other datacenter since all settings in this datacenter are the same as the primary datacenter. Only the FQDN of RPC Client Access array changes but that doesn’t affect Outlook clients as long as the RpcClientAccessServer property on the mailbox databases isn’t changed which it shouldn’t be during a site failover in this scenario.
Mobile devices (EAS) – The external URL configured for EAS is mail.exchangelabs.dk. Mobile devices will connect just fine when DNS has been repointed to the other datacenter since all settings in this datacenter are the same as the primary datacenter.
OWA – The external URL configured for OWA is mail.exchangelabs.dk. OWA clients will connect just fine when DNS has been repointed to the other datacenter since all settings in this datacenter are the same as the primary datacenter.

One important thing to have in mind during a complete site failover is DNS delays. DNS updates can take from minutes to several hours depending on the topology and DNS TTL values specified for DNS records used by Exchange. To reduce the delays, it’s important you configure internal and external DNS records used for Exchange with a low TTL value (five minutes is a good best practice).

Okay scenario 1 has now been covered and as you can see this scenario isn’t that bad as long as you avoid unnecessary database *overs to the failover datacenter. As a site resilient solution where you need to be able to perform a complete site failover, this scenario works quite well.

We reached the end of part 1, but you can look forward to more scenarios covered in the coming parts.

If you would like to read the other parts in this article series please go to: