Uncovering Exchange 2010 Database Availability Groups (DAGs) (Part 1)

If you would like to read the other parts in this article series please go to:

A Bit of History

Let us begin with a walk down memory lane. Prior to the release of Exchange 2007, the high availability and disaster recovery features included with the Exchange Server product were quite limited. Previous versions of Exchange (Exchange 2003 and earlier) could take advantage of Microsoft Cluster Services (MSCS), but this only provided redundancy at the hardware level since the nodes shared the same storage subsystem. If the active cluster node suddenly became unavailable, the Exchange Virtual Server (EVS) and any relevant cluster resources would fail over to the passive node and the end users could then continue their work.

But you wouldn’t want the storage subsystem to fail as it was a single point of failure. In order to achieve redundancy at the storage level, organizations were forced to invest in replication solutions provided by third-party software vendors and/or storage hardware vendors. Since solutions provided by third party are not supported by Microsoft and typically are quite expensive to implement, the Exchange Product group wanted to provide better high availability and disaster recovery features natively in the Exchange Server product.

Most of us probably agree that with the release of Exchange 2007 those visions became a reality! Exchange 2007 gave us a whole sleeve of brand new high availability and disaster recovery features such as Local Continuous Replication (LCR), which targeted small organizations and Cluster Continuous Replication (CCR) which targeted medium and large organizations. Later on (with Exchange 2007 SP1) came Standby Continuous Replication (SCR), targeted at organizations of pretty much all sizes. All three features used a new asynchronous replication technology, which worked by shipping log files to a passive copy of a storage group and after inspection replaying these into this passive copy.

Although LCR provided redundancy at the storage level, the feature never really got much attention. The reason behind this was that since the storage group copies had to be stored on a volume local to the Mailbox server, it presented a single point of failure at the hardware level. Since Exchange 2007 was released, CCR has been a huge success. The interesting thing about CCR was that it combined the new asynchronous replication technology introduced by Exchange 2007 with Windows Failover Clustering technology, thereby providing redundancy at the hardware level as well as at the storage level, providing a true high availability solution without any single point of failures.

CCR cluster nodes could be located in separate datacenters in order to provide site-level redundancy, but since CCR was not developed with site resiliency in mind, there were too many complexities involved with a multi-site CCR cluster solution (for details on multi-site CCR cluster deployment take a look at a previous article series of mine). This made the Exchange Product group think about how they could provide a built-in feature geared towards offering site resilience functionality with Exchange 2007.

When Exchange 2007 SP1 was released we got exactly that. A feature called Standby Continuous Replication (SCR) which made it possible to ship log files to another Exchange 2007 Mailbox Server. Because SCR did not require Windows Failover Clustering, the log files could be shipped from both clustered and non-clustered Mailbox servers (the SCR source) to clustered and non-clustered mailbox servers (SCR target). What was really interesting with SCR was that you could specify a log replay lag time of up to 7 days, which made it possible to fix most database/store related issues before they hit the SCR target located in another datacenter.

Note:
Exchange 2007 Service Pack 2 is mainly a service pack that prepares an existing Exchange 2007 organization for deployment of Exchange 2010, so we did not see any additional changes or improvements to high availability or site resilience functionality in this service pack.

With features such as CCR and SCR already at our disposal, one would think we wouldn’t need any major improvements or changes in regards to the high availability and site resilience story with the latest and greatest version of Exchange server to date – Exchange 2010 right?

Well, the Exchange Product Group has been busy with developing Exchange 2010 the last couple of years, and actually a significant part of the time has been spend on improving native high availability and site resilience features significantly.

High Availability Changes in Exchange Server 2010?

With Exchange 2010, we no longer have the concept of Local Continuous Replication (LCR), Single Copy Clusters (SCC), Cluster Continuous Replication (CCR) or Standby Continuous Replication (SCR) for that matter. WHAT!? I hear some of you yell! Yes I am not kidding here. But to be more specific, only LCR and SCC have been removed from the Exchange Server product. CCR and SCR have been combined and have evolved into a more unified high availability framework in which the new Database Availability Group (DAG) act as the base component. This means that no matter if you are going to deploy a local or site-level highly available or disaster recoverable solution, you use a DAG. To make myself clear, with Exchange 2010, your one and only method to protect mailbox databases is by using DAG.

Figure 1: Mailbox databases protected by DAG

The primary component in a DAG is a new component called Active Manager. Instead of the Exchange cluster resource DLL (exres.dll) and the associated cluster service resources that were required when clustering Exchange 2007 and previous versions of Exchange, Exchange 2010 now relies on the Active Manager to manage switch-overs and fail-overs (*-overs) between mailbox servers part of a DAG. Active Manager runs on all Mailbox servers in a given DAG. We have 2 active manager roles, the Primary Active Manager (PAM) and the Standby Active Manager (SAM). For a detailed explanation of the PAM and the SAM roles, please see the relevant Exchange 2010 Online documentation over at Microsoft TechNet.

So what’s interesting about a DAG then? Well there are many things; the most notable are listed in the following:

Limited dependency on Windows Failover Clustering – A DAG only uses a limited set of the clustering features provided by the Windows Failover Clustering component. DAG uses the cluster database, heartbeat, and file share witness functionality. With Exchange 2007 (and earlier versions), Exchange were an application operated by a Windows Failover Cluster. This is no longer the case with Exchange 2010. The Exchange cluster resource DLL (exres.dll) and all the cluster resources it created when it was registered have been removed from the Exchange 2010 code.

Figure 2: DAG still relies on cluster database, heartbeat and replication of the Windows Failover Cluster component

Figure 3: No Exchange Cluster Resources in the Windows Failover Cluster

Incremental deployment – Because DAGs still use some of the WFC components such as the cluster database, heartbeat and file share witness functionality, Windows Server 2008 SP2 or R2 Enterprise edition is required in order to be able to configure Exchange 2010 Mailbox servers in a DAG. But Exchange 2010 supports an incremental deployment approach meaning that you don’t need to form a cluster prior to installing Exchange 2010. You can install the Exchange 2010 Mailbox servers, and then create a DAG and add the servers and any databases to the DAG when needed.

Co-existence with other Exchange roles – With CCR you could not install other Exchange Server roles on the mailbox servers (cluster nodes) that were protected using CCR. With DAG, a mailbox server part of a DAG can have other Exchange roles installed. This is especially beneficial for small organizations. Because now that a DAG protected Mailbox server can co-exist with other Exchange roles, it also means that you can have a fully redundant Exchange 2010 solution with only two machines dedicated as Exchange servers. Of course, you need to configure a file share witness, but this could be any in your environment. The file share witness does not need to run the same Windows version as the Exchange 2010 servers. It just needs to run Windows server 2003 or later. Another thing you should bear in mind is that if you go down the path where you use two Exchange 2010 servers, and you want to have a fully redundant solution, you must use an external hardware or software based load balancing solution for Client Access services.

Managed 100% via Exchange tools – With CCR in Exchange 2007, you had to configure and manage CCR clusters using a combination of Exchange and cluster management tools. With DAGs in Exchange 2010, you no longer need to use cluster management tools for either the initial configuration or management. You can manage DAGs fully using the Exchange Management tools. This means that the Exchange administrators within an enterprise no longer need to have cluster experience (although this still could be a good idea).

Figure 4: DAG, replication networks, and FSW settings etc. managed via Exchange tools

Replication at the database level – In order to support the new DAG feature, databases in Exchange 2010 has now been moved to the organizational level instead of the server level where they existed in Exchange 2007 and earlier versions. This also means Exchange 2010 does not have the concept of storage groups any longer. Now there are databases and a log stream associated with each database. One drawback of CCR was that if only one database failed on the active node, a fail-over of all active databases existing on the clustered mailbox server were moved to the passive CCR node. This meant that all users on that had a mailbox stored on the respective CMS were affected.

Figure 5: Databases on the organization level

Support for up to 16 members in each DAG – Now that you can add up to 16 Mailbox servers to a DAG and potentially have 16 copies of each Mailbox database, Exchange 2010 had to support a larger number of mailbox databases than Exchange 2007 did. So the maximum limit has now been upped from 50 to 100 Mailbox database in the Exchange 2010 Enterprise edition. However, the Standard edition still only supports up to 5 databases per Mailbox server.

Switch/Fail-overs much quicker than in the past – Because of the improvement made with Exchange 2010 DAG, we will now experience much quicker switches and fail-overs (*-overs) between mailbox database copies. They will typically occur in under 30 seconds, compared to several minutes with CCR in Exchange 2007. In addition, because Outlook MAPI clients now connect to the RPC Client Access service on the Client Access Servers, end users will rarely notice a *-over occurred. You can read more about the RPC Client Access service in a previous articles series of mine here.

Go backup-less with +3 DB copies – When having 3 or more copies of a mailbox database, it is programmed to backup-less. This means that you basically enable circular logging on all mailbox databases protected by DAG, and no longer perform backups as we know them. This thinking of course requires enterprise organizations to change their mindset in regards to how they think mailbox databases should be protected.

Support for DAG members in separate AD sites – Unlike CCR cluster nodes, you can have DAG member servers located in different Active Directory sites. This should be a welcome addition to those of you who do not have the option of using the same AD site across physical locations (datacenters). It should be noted though, that you cannot place Mailbox servers protected by the same DAG in different domains within your Active Directory forest.

Log shipping via TCP – In Exchange 2007, the Microsoft Exchange Replication Service copied log files to the passive database copy (LCR), passive cluster node (CCR) or SCR target over Server Message Block (SMB), which meant you needed to open port 445 in any firewall between the CCR cluster nodes (typically when deploying multisite CCR clusters) and/or SCR source and targets. Those of you who work for or with a large enterprise organization know that convincing network administrators to open port 445/TCP between two datacenters is far from a trivial exercise. With Exchange 2010 DAG, the asynchronous replication technology no longer relies on SMB. Exchange 2010 uses TCP/IP for log file copying and seeding and, even better, it provides the option of specifying which port you want to use for log file replication. By default, DAG uses port 64327, but you can specify another port if required.

Log file compression – with Exchange 2010 DAGs you can enable compression for seeding and replication over one or more networks in a DAG. This is a property of the DAG itself, not a DAG network. The default setting is InterSubnetOnly and has the same settings available as those of the network encryption property.

Log file encryption – Exchange 2010 DAG supports the use of encryption whereas log files in Exchange 2007 are copied over an unencrypted channel (unless IPsec has been configured). More specifically, DAG leverages the encryption capabilities of Windows Server 2008—that is, DAG uses Kerberos authentication between each Mailbox server member of the respective DAG. Network encryption is a property of the DAG itself, not the DAG network. Settings for a DAG’s network encryption property are: Disabled (network encryption not in use), Enabled (network encryption enabled for seeding and replication on all networks existing in a DAG), InterSubnetOnly (the default setting meaning network encryption in use on DAG networks across subnets), and SeedOnly (network encryption in use for seeding on all networks in a DAG).

Up to 14 day lagged copies – With Standby Continuous Replication which were included in Exchange 2007 SP1, the concept of lagged database copies were introduced. With this feature we could delay the time for when log files that were copied to the SCR target should be replayed into the databases on the SCR target. We also had the option of specifying a so called truncation lag time, which was an option which allowed us to delay the time for when log files that had been copied to the SCR target and replayed into the cop of the database should be truncated. With both options we could specify a lag time of up to 7 days. With Exchange 2010 DAG, we can now specify a truncation lag time of up to 14 days, which is extra interesting when you choose to go backup-less.

Seeding from a DB copy – Unlike CCR in Exchange 2007, we can now perform a seed by specifying a database copy as the source database. This means that a new seed or a re-seed of an existing mailbox database no longer has any impact on the active database copy.

Figure 6: Seeding from a selective source server

Public folder databases not protected by DAG – Unlike CCR in Exchange 2007, we cannot protect public folder databases using DAG. Public folder databases must be protected using traditional public folder replication mechanisms. The positive thing about this is that we no longer are limited to only one public folder store in the Exchange organization, if it is stored on a DAG member server. With CCR we could only have one public folder store, when it was protected by CCR.

Improved Transport Dumpster – The transport dumpster we know from Exchange 2007 has also been improved, so that messages are re-delivered even when a lossy database failover occurs between databases copies stored in different Active Directory sites. In addition to this, when all messages have been replicated to all database copies, they will be deleted from the transport dumpster.

So with this part one of this multi-part article ends. In the next part, we will begin deploying DAG in our lab environment and look at the various DAG specific setting and so forth. Until then, have a nice one.

If you would like to read the other parts in this article series please go to: