Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 9)

If you would like to read the other parts of this article series please go to:

Introduction

In part 8 of this multi-part article, we added the four Exchange 2010 multi-role servers to our newly created DAG. After the DAG member servers were added to the DAG, we collapsed the DAG networks, enabled database availability coordination (DAC) and redistributed the active database copies across the two Exchange 2010 multi-role servers located in the primary datacenter. With that we ended up with a fully configured site resilient Exchange 2010 solution sized for a medium organization.

In this part 9, I’ll explain what local and site level switchovers and failovers (aka *overs) are and at what levels they can occur. After having described the different high availability and disaster recovery terms, I’ll simulate a disk failure on EX01 resulting in a database *over from EX01 to EX03. Lastly we’ll take a look at how at database *over from one DAG member server to another within the same datacenter will affect the three most popular Exchange clients – Outlook 2007/2010, Outlook Web App (OWA) and Exchange ActiveSync devices.

Important note:
The client access array level failovers described in this article may differ from the results you see during your own testing as this depends heavily on the load balancer solution used in the respective Exchange 2010 environment. As mentioned earlier in this multi-part article, I use a load balancer solution based on Load Master Devices from KEMP Technologies in each datacenter. The one in the primary datacenter is built physical Load Master devices and the one the secondary datacenter is using two Hyper-V based virtual Load Master appliances.

Switchovers versus Failovers

Let’s begin with the basic terminologies. When it comes to high availbility and site resilience in Exchange 2010, we have two types of so called *overs. We have:

  • Switchovers   Switchovers are initiated manually by an Exchange administrator (usually prior to a service or maintenance window). For instance, this could be when a new Exchange service pack or roll-up update have to be applied to a specific Exchange server or a set of Exchange servers. In this case, we would need to move any active database copies to another server in the DAG. From the client access and Hub Transport server perspective, it’s usually also a good idea to take Exchange 2010 multi-role server(s) out of the CAS array and exclude it from receiving SMTP request (done via load balancer solution) during the maintenance window. Even though most load balancer solutions can detect service failures, there are still situations where clients and SMTP traffic can be directed to a server that’s partially down.
  • Failovers   Failovers are typically initiated automatically by Exchange when a services becomes unavailable and need to be restored in an automatic fashion. For instance, if one of our Exchange 2010 multi-role servers hosted one or more active database copies and suddenly crashed, the PAM (primary active manager) would initiate an automatic failover to one of the other three servers in the DAG. This would be done using the best copy selection process (state of content index, copy queue and replay queue length and the activation preference set on the database copies). From the client access server perspective, most load balancer solutions include failover and detection mecanisms, that will exclude a server with the client access and/or Hub Transport server roles installed from the CAS array so that client or SMTP traffic aren’t directed towards a failed server.

When it comes to Exchange 2010 database *overs, they can occur at the following three levels:

  • Database level   If the disk holding a database on a Mailbox server in a DAG becomes corrupt, the particular database would be activated on another server in the DAG.
  • Server level   If a mailbox server part of a DAG crashes all active databases on the server will need to be activated on other member servers in the DAG.
  • Site level   If the primary datacenter becames a smoking hole, all databases will need to be restored in the failover datacenter.

Client Access and Hub Transport *overs can occur at the following two levels:

  • Server level   If a server with the Client Access and Hub Transport roles installed crashes other Client Access and Hub Transport within the primary datacenter will need to take over. Most load balancer solutions will make sure this happens in an automatic fashion.
  • Site level   If the primary datacenter becomes a smoking hole, the DNS records pointing to the Client Access servers (CAS array) and Hub Transport servers in the primary datacenter will need to be updated so they point to the the Client Access servers (CAS array) and Hub Transport servers in the failover datacenter. Depending on TTL values of the DNS records, DNS client cache and the complexity/size of the Active Directory topology this can take a substantial amount of time.

Okay you should now have an idea of what a switchover and failover is and how they relate to the different Exchange 2010 server roles.

Quick Recap of the Environment

Before we move on and perform the actual failover simulation, let’s qucikly recap the environment. This is an active/passive user distribution datacenter model, where only the primary datacenter have active mailboxes. The diagram shown below illustrates the scenario. We have two Internet-facing datacenters with two Exchange 2010 multi-role servers in each datacenter. Inbound client and SMTP traffic goes to the primary datacenter. We have a stretched DAG (with DAC enabled) with four database copies per mailbox database. Active database copies are spread across the two servers (EX01 and EX03) in the primary datacenter. Lastly, we have a CAS array and load balancer solution configured in each datacenter.


Figure 1: Datacenter model used in this multi-part article

Simulating Database Failure

Okay let’s first take a look at what will happen when database disk in EX01 located in the primary datacenter fails. As can be seen in Figure 2 mailbox database 1 thorugh 6 are currently active on EX01.


Figure 2: Mailbox Database 1 through 6 active on EX01

To simulate a failure of the disk holding the active databases, I’ll simply take it offline via the Disk Management tool in the Server Manager console as shown in Figure 3.


Figure 3: Taking the Database Disk Offline

When the disk is offline its obviously no longer visible in Windows Explorer and because of this Exchange 2010 or more specifically the Active Manager will initiate a database failover to EX03 as the database copies on this server have an activation preference set to “2”. Remember though that it isn’t only the activation preference the active manager will look at, but also the state of the content index, copy queue and replay queue length of any available passive database copies. This means that one or more of databases potentially could be activated on EX02 or EX04 in the failover datacenter.

Okay in this example, the state of the database copies on EX03 are all fine, and as can be seen in Figure 4 all databases are now active on EX03 and the copy status for database copies on EX01 are now as expected “Failed and Suspended” as the disk holding the database is gone.


Figure 4: Databases activated on EX03

Client Behaviour

So how do the top three Exchange client types (Outlook 2007/2010, Outlook Web App and Exchange ActiveSync) behave when the databases are activated on another DAG member server in the primary datacenter?

Outlook

Outlook clients will stay connected no matter if they are connected to the CAS array via EX01 or EX03. This is because the current RPC connections to the CAS array isn’t affected by a database level failover. Remember that with Exchange 2010 all clients including Outlook MAPI uses the the CAS array as the connection endpoint. Read more about these the architectural changes in another multi-part article I wrote here on MSExchange.org.


Figure 5: Outlook Clients stays connected

Outlook Web App (OWA)

If a refresh in an existing OWA session occurs during the database level failover, the end user usually gets an error similar to the one shown in Figure 6.


Figure 6: Error if OWA refreshes during Database *over

Since this is a database level failover, the OWA cookie won’t be lost and as soon as the failover has completed and the browser is refreshed thereafter, the end user will get back into the current OWA session.


Figure 7: End User keeps current OWA session after a database level failover

Exchange ActiveSync devices

Users with Exchange ActiveSync devices will not notice a database level failover.

Now that we have simulated a database level failure, let’s bring the disk in EX01 back online and update the databases copies by right-clicking on each database copy on EX01 and selecting “Update Database”. If you have many databases in the environment, I recommend you instead use the Update-MailboxDatabaseCopy cmdlet.

When the databse copies have been updated, you can use the redistribute the active mailbox databases across EX01 and EX03 using the RedistributeActiveDatabases script, I showed you back in part 7 of this multi-part article.


Figure 8: Redistributing Active Mailbox Databases across EX01 and EX03

As you can see from above a database level failover to a DAG member server in the same datacenter is fully automatic and almost invisible to end users. We did not go through a database level switchover in this article as it has the same affect on the end user clients as a failover. 

We have now reached the end of part 9. Until next time, have fun.

If you would like to read the other parts of this article series please go to:

Leave a Comment

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top