Exchange Server disaster recovery: Step-by-step planning

Sponsored by Stellar Data Recovery

Exchange Server administrators have to deal with various downtime situations, ranging from planned outages for maintenance and upgrade to unplanned downtime due to the failure of a disk, server, database, or network.

Further, situations involving large-scale disruption of critical Exchange services across an entire site or data center can lead to prolonged outages with no defined SLA for restoration. Such crisis-like situations are placed under the “Exchange Disaster” category and are amongst the biggest worries of any Exchange Server administrator.

Thankfully, there are provisions, including the built-in Exchange Server features and configurability options, backup and recovery solutions, and third-party tools to address such disasters. However, the Service Level Agreements (SLA) of these provisions depends upon diligent planning and testing for “on the ground” responsiveness and feasibility.

This article presents detailed step-by-step planning guidance for Exchange Server disaster recovery. Before you commence planning, be aware that any Disaster Recovery program focuses on restoring the “critical services.” So it would be best if you began with identifying the Exchange Server critical services, dependencies, and data, to restore the essential services in your organization.

Here are the steps to create the Exchange Server disaster recovery plan:

Step 1: Define the success metrics for disaster recovery

Before planning out the recovery paths and mechanisms to restore critical Exchange services, you need to define the success metrics for the exercise. This step will help you to meet the end goal as per the defined timeline (SLA) and database state as per your organization’s standards:

A) Recovery Point Objective (RPO):

This is the “point in time” before the disaster event to which the server state and operations can be restored in case of a failure event. For example, if you take a daily full back up at 1:00 AM and the failure happens at 5:00 a.m., the RPO is 1:00 a.m. Therefore, to be successful, you should be able to restore the Exchange database and other critical services to their particular state at 1:00 AM. The RPO establishes the maximum amount of data your organization can afford to lose in case of a disaster. So, it provides essential input for defining the “backup frequency and schedule.” This would also involve other mechanisms like high availability to counter data loss or reduce the data loss period.

B) Recovery Time Objective (RTO):

This is the “maximum time allowed” to restore the Exchange services and mailbox database after a disaster event. RTO constitutes a vital component of the SLA to restore the system and provides useful inputs to define the backup and restore processes. The RTO is defined in seconds, minutes, days, etc., and is crucial for Exchange disaster recovery planning. RTO also helps in determining the “backup size” you should consider against the SLA. For example, if it takes approx. 1.5 hours to back up and restore 300GB data, and the SLA is six hours. Your backup size should be less than 1.2TB.

Step 2: Choose the Exchange backup strategy

This step focuses on choosing the type of backup for an on-premises Exchange Server. The goal is to preserve one or multiple point-in-time copies of the on-premises Exchange mailbox database and services to allow restoration in case of a failure event.

The RPO and RTO specified in Step 1 provide the essential input to choose a suitable backup strategy. The thumb rule is that the “time to restore” is approximately twice the “time to back up”; this is because the restore duration also factors in the time to notice the failure event, try to troubleshoot the issue, find out and initialize the backup media, restore the backup, and replay the transaction logs.

The following traditional backup and restore options are available for Exchange:

A) Full backup copies the entire target database and transaction log file. After completing a full backup process, Exchange Server truncates the old log files of the database. Full backups take a longer duration and consume the most space amongst the different options.

B) Copy backup is the same as a full backup, except it does not truncate the transaction logs after copying the database.

C) Incremental backup copies only the specific changes in the transaction logs since the last full backup or incremental backup. The purpose of an incremental backup is to minimize the “frequency” of full backups and the “space” consumed by backup data.

D) Differential backup also captures a partial dataset like an incremental backup. However, it copies the changes since the last full backup without considering incremental or differential backup. A differential backup aims to minimize the “number of recovery operations” needed to restore a database. Differential backup occupies more space than an incremental backup but can be restored faster. It doesn’t truncate the saved transaction logs.

Comparison table for the Exchange database backup types

Backup Type	Backup Time	Backup Space	Backup Frequency	Restore Time	Reliability
Full Backup	High	High	Need-basis	Low	High
Incremental Backup	Low	Low	Low	High	Low
Differential Backup	Intermediate	Intermediate	Depends on full backup	Low	Medium

Important note on transaction log truncation and circular logging

Exchange Server truncates the log files only after taking a full or incremental backup. As a result, the transaction log database size can grow invariably without limit, leading to disk space consumption and chances of log database corruption. However, the availability of the full database backup and the latest log files offers high recoverability in disastrous situations, provided the backup is restorable.

The circular logging feature in the Exchange Extensible Storage Engine (ESE) prevents the growth of log files by overwriting the current log file after every transaction. Thus, it saves disk space by reducing the number of saved log files. However, circular logging automatically purges the log files from the disk after being committed to the database. So, it limits the database recoverability to the last full back up as the latest transaction log files are not available. So, it is better to keep circular logging disabled, mostly if you are not replicating the mailbox databases in a Database Availability Group (DAG), to facilitate mailbox restoration as part of your Exchange Server disaster recovery plan.

Nevertheless, restoring the Exchange mailbox database can be challenging even if the backup is available. Reasons like database corruption, absence of the latest log files, an old or corrupted backup can pose challenges to a successful recovery from failure.

Having a third-party Exchange database recovery tool such as Stellar Toolkit for Exchange can be extremely useful. The software repairs corrupt EDB files and recovers the database, including individual mail items with original integrity. It also extracts the mailboxes from corrupt backup files and VHDX files. Further, it allows Exchange database recovery without needing the transaction log file. A professional Exchange database recovery software can help you design a failsafe Exchange Server disaster recovery strategy.

Step 3: Choose the backup destination

Mainly, there are two options for copying the backup data, namely local backup and cloud. Here is an overview of these Exchange backup options:

A) Local backup: This option is based on copying the backup data on a local storage media such as tape drive, hard drive, storage server, network-attached storage (NAS), etc. Following are the pros and cons of the local backup:

Pros:

Better control of the governance of the data.
No or limited third-party dependencies
No need for the Internet to backup and restore

Cons:

Upfront investment in storage hardware
Low scalability to serve quick changes in storage needs
High vulnerability to natural or manmade disasters
Possibility of storage media and data corruption
Security of the backup and off-site storage

B) Cloud backup: The cloud option involves creating copies of the backup data on a remote cloud server managed by a third-party responsible for the overall upkeep of the data. Major cloud storage and backup solutions include Microsoft Azure, Amazon S3, etc. Following are the pros and cons of cloud backup:

Pros:

Native availability of technology such as compression, encryption, virtualization, etc.
Global accessibility via the Internet
Scalability to meet storage needs on the go
Managed service and support

Cons:

Need to trust third-party with business data
Dependency on Internet connectivity for backup & restore for speed of backup and restore times

Step 4: Back up the critical components

You should consider the following critical components within the scope of your backup strategy for Exchange disaster recovery:

A) Active Directory (AD):

Exchange Server depends upon Active Directory to organize, store, authenticate, and serve the information of all the network objects to users and administrators. Active Directory stores the configuration settings for Mailbox servers and Client Access services. A comprehensive backup of Active Directory is crucial for Exchange Server disaster recovery.

Note: Active Directory backup must be performed online, and when the Active Directory Domain Services is installed. And restoration of the AD server must be performed offline.

B) System state:

This component involves backing up the server operating system, registry, IIS database, cluster server configuration (if a DAG is deployed), and local security settings to allow a fast restoration with applicable customizations.

C) File system:

Backing up the file system can provide the availability of crucial log files, configuration data, and other vital information such as memory usage threshold, etc., needed at the time of system restoration.

D) Mailbox database:

This component comprises the mailbox items, including emails, attachments, calendar, tasks, etc., essential for business communication.

Step 5: Address the critical dependencies:

Your Exchange Server disaster recovery plan should also consider the dependencies on Exchange Server core components, network, and any third-party utilities, as follows:

A) Network dependencies:

Evaluate the dependencies on subnets, IP addresses, domain name system, DHCP services, switch configurations, router settings, etc.

B) Active Directory services:

Undertake a detailed assessment of AD services to ensure that all the Exchange services dependent on the active directory can perform a disaster recovery scenario. For example, check the dependencies for Exchange replication services.

C) Third-Party applications:

Consider the dependencies of any third-party apps installed for Exchange monitoring, backup, and archival, etc., on the messaging services. Ensure that these dependencies are duly addressed in your Exchange disaster recovery plan.

Step 6: Document the disaster recovery plan:

After determining the components of your Exchange Server disaster recovery plan, the next step is to document all the contingencies and map them to a definite response. The following are the elements of a disaster recovery plan document:

A) Roles and responsibilities:

The plan should designate specific people with defined roles in disaster recovery. It should document the contact details of the designated people.

B) Schematics:

The document should have a graphical layout of the Exchange system to facilitate clear and efficient communication.

C) Physical assets and configuration:

It should document the vital details such as server name, IP configuration, disk configuration, hardware specifications, etc., to facilitate efficient recovery after a disaster.

D) RTO and RPO:

The plan should document the determined values for RTO and RPO. As outline earlier in this article, RTO or Recovery Time Objective is the maximum time allowed to restore Exchange services after a failure event. And Recovery Point Object (RPO) is the point in time to which a system can be restored after failure.

E) Dependencies:

The document should have a section to capture the dependencies on the network, AD services, and third-party apps configured with Exchange Server. This information is crucial for the disaster response team to know any surprises beforehand.

F) Response mapping:

The plan should document a specific response and ensuing steps against every contingency. This will help in quick decision making and mobilizing the action to remediate the problem. The plan’s responsiveness can be increased by using a grid representation of all contingencies mapped to the response and categorizing them based on the severity levels.

Step 7: Test the recovery plan

A plan is no good if it cannot perform as expected when an actual need arises. So, real-world testing of the Exchange disaster recovery plan is crucial to ensure its effectiveness and efficiency. The exercise can also help identify any gaps in planning and surface new or unaccounted points of failure.

For example, it can evaluate the restorability of the backups in terms of critical aspects like — whether you can restore the backups and the actual time required to restore the backups. Testing can also identify the root cause of an issue, such as backup file corruption or problems in the backup media such as tape.

The best approach to test the recovery plan is by simulating the disaster through the actual staging of failure events documented in the plan. The simulation, depending upon your specific Exchange Server deployment setup, would help determine the efficacy and gaps in the disaster recovery plan.

Such tests are to be exercised over a period of time, like every six months or on a yearly basis, to ensure all the above.

Exchange Server disaster recovery: Best practices

The following are some of the best practices to consider in a disaster recovery plan:

Use a diligent and traceable change management process to track along with the correct and updated documentation and reproduce the changes made in local web.config, custom registry, and other configuration files. This practice will help you reapply the customized settings after restoring the system.
Consider deploying a multi-site Database Availability Group (DAG) for enabling a highly available, resilient infrastructure to reduce data loss and downtime.
Create a lagged mailbox database copy in DAG to allow restoration to a previous point in time if there is a corruption in existing replicated copies. Lagged database copy can facilitate faster recovery than restoring from a backup.
Disable circular logging for mailbox databases not replicated in a DAG. This step will facilitate backing up all the transaction logs before truncation.
Disable circular logging to avoid exhausting the disk space for tasks such as mailbox migration that generate a large number of logs.
Define the maximum mailbox database backup size based on the allowed restoration time as per the RTO. Generally, the upper size limit for a mailbox that undergoes backup and restore is 200GB.
Consider approx. 15 percent margin for deleted items when calculating the upper size limit of the mailbox database for backup and restoration.
Store your transaction log files separately on a RAID volume to allow redundancy.

Endnotes

Exchange Server disaster recovery needs rigorous planning to address all the contingencies that can potentially disrupt the mail services. The plan document needs to be specific and actionable to allow an efficient remedial of issues within the defined SLAs.

This article presented in-depth planning guidance for Exchange disaster recovery and highlighted a few important points, namely the importance of testing the recovery plan and complexity of the database backup aspect in the planning; backing up the database comprises varying choices like the type of backups, backup destination, frequency, etc. and the scenarios emerging from log file truncation and circular logging. Thus, careful consideration of the mailbox database backup strategy is crucial to the effectiveness of the overall disaster recovery planning. Also, it would be wise to include third-party software in the overall Exchange Server disaster recovery planning to allow seamless recovery of the mailboxes in unforeseen circumstances like a database or backup file corruption that can happen for numerous reasons.

Featured image: Shutterstock