If you would like to read the other parts of this article series please go to:
- Testing SCR in a Production Environment (Part 2)
- Testing SCR in a Production Environment (Part 3)
- Testing SCR in a Production Environment (Part 4)
Introduction
A while ago I had installed and configured an infrastructure that consisted of a Cluster Continuous Replication (CCR) environment as well as a single-node standby cluster that had been configured with Standby Continuous Replication (SCR). The infrastructure was working perfectly and transaction logs where shipping from the CCR environment to the standby cluster. I then had a need to prove that the SCR standby cluster was able to take over the role of the main mailbox server. In other words, although the transaction logs were copying correctly to the SCR standby cluster, how would I really know that this standby cluster would actually work correctly when I needed it to? Of course, the only real way to find out is to test the failover process periodically. Since this was a production environment with live users, proving this failover process had to be a properly coordinated exercise. I also had to make sure that the existing CCR environment remained unaffected by the failover exercise and that all data was maintained.
In this article series I am going to detail the process that I used to perform a test of the SCR standby cluster without destroying the CCR environment. To do this I built a lab environment consisting of a number of Windows 2008 servers. In my original project the operating system in use was Windows 2003 but since Windows 2008 has been out for some time and is increasingly likely to be deployed I decided to update the process for use on Windows 2008. The servers I constructed in my lab were:
- DCHUBCAS: This is a domain controller also running the Hub Transport and Client Access Server roles.
- CCRA and CCRB: These are the two nodes of the CCR environment.
- CLUSTER-P: This is the name of the CCR environment cluster.
- E2K7: This is the Clustered Mailbox Server (CMS) name, the name that the Outlook clients connect to.
- SCR: This is the name of the single node in the standby cluster.
- CLUSTER-S: This is the name of the standby cluster itself.
The aim of this article is to show you how to recover the CMS, called E2K7, onto the SCR standby cluster without affecting the contents of the CCR environment databases. In other words, once the CMS has been recovered to the SCR standby cluster and accessibility proven, the original CCR environment will be brought back online for the users to connect to as normal.
There are a few other points I would like to make about this configuration and how it affects the article contents:
- Before you do any tests in your production environment, you really should be doing these tests in a lab environment first. It is not difficult to construct a lab environment using virtual machines and you really do not want to be finding out any issues for the first time using your production environment. I cannot stress this point enough.
- In my example there is only a single mailbox database and a single public folder database. If you have more databases than this, you will obviously need to check each database individually.
- I am only recovering the CMS role to a standby cluster. Of course, in a real-world scenario, the separate disaster recovery site containing the SCR standby cluster will also contain other roles required for a full disaster recovery implementation, such as domain controllers, DNS servers, Hub Transport servers and Client Access Servers.
- Another important point to make is that once the CCR environment had been brought back online, I chose to re-seed the databases back to the SCR standby cluster so that it was again ready in case of a site failover requirement. This was mainly because at that time there was only a single 30GB mailbox database along with a single, small, public folder database to worry about. It could be that your circumstances may mean that you would prefer to enable SCR back to the original CCR environment after doing the SCR failover test. I have covered this type of process in a previous article series here on MSExchange.org, although note that this previous article used Windows 2003 as its operating system whereas in this article I am using Windows 2008 there are some subtle differences between the two processes. The point is that you have options in how you want to bring back the production CCR environment.
So you have decided that you want to test your SCR configuration using your production CCR environment. The first thing that you need to do is to inform your users well in advance that the system should not be used during the testing. Obviously the best time to do this test is outside of core business hours which typically means an evening, overnight or over a weekend. It is vital that the users do not use the system as there exists the chance that they will lose email messages that they send during the test. Let me explain why this is. During the failover test that I am going to cover in this article, I am going to be recovering the CMS so that it is running on a standby cluster. The CMS will therefore be alive and well and will respond to connection attempts made to it. Whilst it is running on the standby cluster, the CMS will happily process messages via the Hub Transport server that is also running in the disaster recovery site. Once the test has finished, I am effectively going to discard the database on the standby cluster and re-start the original production CCR environment. In other words, any messages that users have sent whilst the CMS is running on the standby cluster will not be present in the databases on the production CCR environment when it is restarted. That will be confusing for the users to say the least, since they will no longer have some messages in their folders that they thought they had. So, keep users well away from the system during the test.
Enough of the talking – let us get going with the process. I have detailed each step in numerical sequence.
Step 0 – Take a Backup
I originally planned to call this step 1 but in fact I consider it to be such a fundamental process that it should occur before you take any steps at all. Before you do any other steps, take a backup of your system and verify it.
Step 1 – Disable Internet Connectivity
With the production CCR environment humming along quite nicely, the first step to take is to prevent inbound Internet email from being delivered to the user mailboxes whilst the test is performed. Therefore, at this point you should consider shutting down your SMTP gateway server, such as an Exchange 2007 Edge Transport server, which means that any inbound Internet email messages will queue on the sending SMTP server. Most SMTP servers should attempt redeliver for at least 48 hours so you have plenty of time to perform your tests.
Step 2 – Check Redundant Machines
The next step is not strictly a required step of the overall process but it can be extremely useful in furthering your understanding of the process. This step is the highlighting of an important configuration element that you will come across later in this article. Using the Exchange Management Shell, run the Get-MailboxServer cmdlet and inspect the contents of the RedundantMachines parameter. An example is shown in Figure 1 where you can see that I have run the Get-MailboxServer against my CMS name, E2K7, and piped the results into the format-list command. You should find that the RedundantMachines parameter is set to the names of the two nodes of the CCR environment, which in my case are CCRA and CCRB. The RedundantMachines parameter is modified by the Exchange 2007 setup.com program during the SCR site failover process and I will discuss this later on in this article series. For now, just remember the RedundantMachines parameter and what it is set to in your environment.
Figure 1: RedundantMachines Attribute
Summary
That is it for part one of this article, where we have mainly set the scene and performed some initial preparation of the existing environment ready for the SCR failover test. In part two, I will be covering the CCR shutdown process and the steps required to restore the storage groups to the SCR standby cluster.
If you would like to read the other parts of this article series please go to:
- Testing SCR in a Production Environment (Part 2)
- Testing SCR in a Production Environment (Part 3)
- Testing SCR in a Production Environment (Part 4)