Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 10)

If you would like to read the other parts of this article series please go to:

Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 1)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 2)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 3)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 4)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 5)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 6)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 7)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 8)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 9)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 11)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 12)
Planning, Deploying, and Testing an Exchange 2010 Site-Resilient Solution sized for a Medium Organization (Part 13)

Introduction

In part 9 of this multi-part article, I explained what local and site level switchovers and failovers (aka *overs) are and at what levels they can occur. After having described the different high availability and disaster recovery terms, I simulated a disk failure on EX01 resulting in a database failover from EX01 to EX03 which like EX01 is located in the primary datacenter. In addition, we had a look at how a database level failover from one DAG member server to another within the same datacenter affects the three most popular Exchange client types – Outlook 2007/2010, Outlook Web App (OWA) and Exchange ActiveSync devices.

In this part 10, we’ll continue where we left off in part 9. We will take things a step further and simulate a server level failure. That is we will fail EX01 so that a failover occurs on both the database and client access array level plus see how it affects the three most popular Exchange clients – Outlook 2007/2010, Outlook Web App (OWA) and Exchange ActiveSync devices.

Important:
The client access array level failovers described in this article may differ from the results you see during your own testing as this depends heavily on the load balancer solution used in the respective Exchange 2010 environment. As mentioned earlier in this multi-part article, I use a load balancer solution based on Load Master Devices from KEMP Technologies in each datacenter. The one in the primary datacenter is built on physical Load Master devices and the one in the secondary datacenter is using two Hyper-V based virtual Load Master appliances.

Simulating A Single Server Failure

Okay let’s simulate a failure of server “EX01” in the primary datacenter. As can be seen in Figure 1 mailbox database 1 thorugh 6 are currently active on this server.

Figure 1: Mailbox Database 1 through 6 active on EX01

As can be seen on the statistics page on our load balancer solution in the primary datacenter, we have several Outlook MAPI, Outlook Anywhere, Outlook Web App and Exchange Activesync client connections to the CAS array here (Figure 2).

Figure 2: Current connections to the CAS array in the primary datacenter

The connections have been load balanced across EX01 (192.168.2.221) and EX03 (192.168.2.222) as shown in Figure 3.

Figure 3: Client Connections to CAS Array is Load Balanced across EX01 and EX03

Bear in mind that even though a user mailbox is located in a database that’s active on let’s say EX01, it doesn’t mean that the client opening this mailbox will make an RPC or SSL connection to that server. It will pick any server (of course based on persistence method used) in the CAS array configured for the primary datacenter.

We can verify this using several methods. In this article, I’ll show you how to verify this using the “About” page in OWA. In OWA we can click on the question mark in the upper right corner and then “About” in the dropdown menu (Figure 4).

Figure 4: About option in Outlook Web App

The “About” page shows all kinds of useful information such as information about which Exchange Client Access server in the CAS array that OWA is connected to. It also shows the name of the mailbox server that holds the active copy of the database in which the mailbox is stored. Figure 5 shows the “About” page for a user that has an OWA session against EX01 and also have his mailbox in a database that currently is active on EX01. Figure 6 shows the “About” page for another user that has an OWA session against EX03 and his mailbox in a database that is active on EX01.

Figure 5: User connected to OWA via EX01 and Mailbox stored in database currently active on EX01

Figure 6: User connected to OWA via EX03 and Mailbox stored in database currently active on EX01

So why is this important? Well because the end user experience is slighty different when server “EX01” fails. More about this in the “Client Behaviour” section.

Alrighty, it’s time to kill server “EX01”. This can be accomplished using several different methods. However, the easierst method to do so in my particular environment is to simply turn off the Hyper-V machine. This is done by clicking the stop button in the toolbar of the virtual machine as shown in Figure 7.

Figure 7: Turning off the Hyper-V based Virtual Machine

Exchange 2010, or more specifically, the Active Manager will now initiate a database failover to EX03 as the database copies on this server have an activation preference set to “2”. Remember though that it isn’t only the activation preference the active manager will look at, but also the state of the content index, copy queue and replay queue length of any available passive database copies. This means that one or more of databases potentially could be activated on EX02 or EX04 in the failover datacenter.

Okay in this example, the state of the database copies on EX03 are all fine, and as can be seen in Figure 8 all databases are now active on EX03 and the copy status for database copies on EX01 are now as expected “ServiceDown” since the server it unavailable.

Figure 8: Databases activated on EX03

If we turn our attention to the load balancer solution in the primary datacenter, we can see that although the virtual services (except one) are up, there’s only one real server (target server) for each virtual service available which is EX03.

Note:
Some of you might wonder why one of the virtual services are down? Well, this is because the load balancer has reverse SSL (SSL bridging) enabled. When using reverse SSL, it’s necessary to create a back-end virtual service for each real server (target server) so that the load balancer can inspect the content of the HTTPs packets.

Figure 9: Current status of the virtual services on the load balancer after EX01 has been turned off

Client Behaviour

So how do the top three Exchange client types (Outlook 2007/2010, Outlook Web App and Exchange ActiveSync) behave when the server in the primary datacenter on which the databases currently are active becomes unavailable?

Outlook:

If the Outlook client has established a connection to EX03 in the CAS array, the Outlook client will stay connected and the end user will not observe anything (Figure 10). This is true for Outlook MAPI as well as Outlook Anywhere (RPC over HTTP) clients.

Figure 10: Outlook MAPI Clients stays connected

Figure 11: Outlook Anywhere Clients stays connected

However, if an Outlook MAPI client has established a connection to EX01 in the CAS array, Outlook will disconnect and prompt the end user to enter his password when a new session is established to EX03 in the CAS array. Outlook Anywhere clients will not be prompted for credentials. They will silently failover to server “EX03” in the CAS array.

Figure 12: End user using an Outlook MAPI client prompted for password after a failover on the client access level

Outlook Web Access:

If the end user has established an OWA session to EX03 in the CAS array and a refresh in an existing OWA session occurs during the database level failover, the end user usually gets an error similar to the one shown in Figure 13.

Figure 13: Error if OWA refreshes during Database *over

Since end user is connected to his mailbox via OWA using EX03, the loss of the “EX01” server will only be seen as a database level failover. The OWA cookie won’t be lost and as soon as the failover has completed and the browser is refreshed thereafter, the end user will get back into the current OWA session.

Figure 14: End User keeps current OWA session after a database level failover

If the end user has established an OWA session to EX01 in the CAS array the loss of server “EX01” server will be seen as a database as well as a CAS array level failover. Because of the CAS array level failover, the user will loose his existing SSL session (unlike Outlook Anywhere) and a new session needs to be established against EX03. This will result in the user being taken back to the FBA logon page as shown in Figure 15.

Figure 15: End user is taken back to the FBA logon page after a CAS array level failover

Exchange ActiveSync devices:

Users with Exchange ActiveSync devices will not notice a complete server failure in the primary datacenter no matter if they are connected to the CAS array via server “EX01” or “EX03”.

Now let’s get EX01 back online by turning it on again. When it’s up and running, we may need to update the database files on EX01 from one of the other DAG member servers like we also did back in part 9. However, since the disk wasn’t lost (the server was only turned off in this example), there’s a good chance the database copies will come back in a healthy state.

When the databse copies have been updated, you can use the redistribute the active mailbox databases across EX01 and EX03 using the RedistributeActiveDatabases script, I showed you back in part 7 of this multi-part article.

Figure 16: Redistributing Active Mailbox Databases across EX01 and EX03