Monitor DAG Database Failover

Introduction

With the introduction of Database Availability Groups [DAG] in Exchange 2010, Microsoft greatly improved and simplified the availability and resilience of the Exchange Mailbox Server role.

As a brief description, a DAG is a group of up to 16 Mailbox servers that hosts a set of databases. It provides automatic database-level recovery from failures that affect individual servers or databases by replicating databases between servers.

Ensuring that these database copies are healthy is vital for daily messaging operations and to guarantee high availability. As such, proactively monitoring them is crucial.

Note:
Although in this article we will be discussing only the monitoring of databases, it is also very important to monitor the server’s hardware, the Windows operating system, Exchange itself, etc… After all, what good are healthy databases if Windows or Exchange services keep crashing?

To help administrators, Exchange includes a set of tools, scripts and features that can be used. Let’s briefly discuss them and then proceed to writing our own script:

Get-MailboxDatabaseCopyStatus cmdlet

This cmdlet is probably the most used by administrators. It displays valuable status information about databases configured in a DAG:

Figure 1: Database copy status using the Get-MailboxDatabaseCopyStatus cmdlet

Test-ReplicationHealth cmdlet

Another cmdlet, this time to view information regarding the status of the mailbox replication, extremely useful in proactive monitoring as it checks all aspects of the replication status to provide a complete overview of a specific Mailbox server in a DAG.

Figure 2: Different checks performed by the Test-ReplicationHealth cmdlet

Event Logs

Windows Server 2008 now includes another category of event logs. Besides the usual Windows logs (Application, Security and System plus two new logs called Setup and ForwardedEvents) it now includes the Applications and Services logs which store events from a single application or component. As you probably guessed, Exchange is part of these logs. In here, you will find three logs:

HighAvailability that contains events related to the startup and shutdown of the Microsoft Exchange Replication service and all the components that run within it, and events such as a database mount, log truncation or events related to the DAG’s underlying cluster;
MailboxDatabaseFailureItems is used to log events associated with any failures that affect a replicated mailbox database;
Troubleshooters is used by some Exchange troubleshooting scripts such as the Content Index Troubleshooter (Troubleshoot-CI.ps1) to log warnings and failures.

Figure 3: New Applications and Services logs

CollectOverMetrics.ps1 Script

This excellent script (found in the Exchange Scripts folder) reads event logs to gather information about database operations (such as database mounts, moves and failovers) over a specific period of time. For each of these operations it will record the following information:

Identity of the database;
The time the operation began and ended;
Servers on which the database was mounted at the start and finish of the operation;
Reason for the operation;
If the operation was successful, including the error details if the operation failed.

The script, which supports parameters to customize its behaviour, creates a .CSV file with all this information and can even create a nice HTML report:

CollectOverMetrics.ps1 -DatabaseAvailabilityGroup DAG1 -StartTime “06/15/2012” -EndTime “06/16/2012” -GenerateHTMLReport –ShowHTMLReport

Figure 4: HTML report with the result of CollectOverMetrics.ps1 script

CollectReplicationMetrics.ps1 Script

This script provides another form of active monitoring as it collects metrics from performance counters related to database replication in real time. Some examples of counters are: the amount of time each copy was failed or suspended, the average copy or replay queue length, the amount of time that copies were outside of their failover criteria, etc.

The data is collected from each server in the DAG (or individual servers specified) and written to a file called CounterData.<ServerName>.<TimeStamp>.csv. The summary report is then generated and written to HaReplPerfReport.<DAGName>.<TimeStamp>.csv or HaReplPerfReport.<TimeStamp>.csv if you didn’t run the script with the DagName parameter.

CollectReplicationMetrics.ps1 –DagName DAG1 -ReportPath E:\Reports -Duration “00:10:00” -Frequency “00:01:00”

Figure 5: Data collected and saved in the CounterData.<ServerName>.<TimeStamp>.csv file

Figure 6: Final report generated by the CollectReplicationMetrics.ps1 script

CheckDatabaseRedundancy.ps1 Script

This last script monitors the redundancy of replicated mailbox databases by validating that there is at least two configured, healthy and current copies, and alerts administrators when only a single healthy copy of a replicated database exists.

Unlike the previous scripts, this one is automatically configured by Exchange to run every 60 minutes as a scheduled task named Database One Copy Alert when the Mailbox server role is installed (if the Mailbox server is not a member of a DAG the script will exit straight away). If a database is in a “bad” state for over 20 minutes (in duration, not consecutive) in the hour-long run of the script, an EventID 4113 is generated and logged into the local server’s Application log. If the database is “ok” for 10 consecutive minutes, an EventID 4114 is logged.

You can also run the script manually and specify a database or a server name to check and it will output its CurrentState in the screen. Additionally, you can use the ShowDetailedErrors parameter to get more details about any errors as well as the SendSummaryMailTos parameter to send hourly summary reports by e-mail to administrators.

Figure 7: Checking database redundancy with the CheckDatabaseRedundancy.ps1 script

Database copy status

Before we start writing our own script, we need to understand all the possible values a mailbox database copy state can have. These are:

Failed – because it isn’t suspended and it isn’t able to copy or replay log files;
Seeding – the mailbox database copy is being seeded, the content index for the mailbox database copy is being seeded or both are being seeded. Upon successful completion of seeding, the copy status should change to Initializing;
SeedingSource – it is being used as a source for a database copy seeding operation;
Suspended – as a result of an administrator manually suspending the database copy by running the Suspend-MailboxDatabaseCopy cmdlet;
Healthy – is successfully copying and replaying log files or it has already successfully copied and replayed all available log files;
ServiceDown – the Microsoft Exchange Replication service isn’t available or running on the server that hosts the mailbox database copy;
Initializing – when a database copy has been created, when the Microsoft Exchange Replication service is starting or has just been started, and during transitions from Suspended, ServiceDown, Failed, Seeding, SinglePageRestore, LostWrite or Disconnected to another state. While in this state, the system is verifying that the database and log stream are in a consistent state;
Resynchronizing – the mailbox database copy and its log files are being compared with the active copy of the database to check for any divergence between the two copies;
Mounting – the active copy is coming online and not yet accepting client connections;
Mounted – the active copy is online and accepting client connections;
Dismounting – the active copy is going offline and terminating client connections;
Dismounted – the active copy is offline and not accepting client connections;
DisconnectedAndHealthy – is no longer connected to the active database copy and it was in the Healthy state when the loss of connection occurred;
DisconnectedAndResynchronizing – is no longer connected to the active database copy and it was in the Resynchronizing state when the loss of connection occurred;
FailedAndSuspended – the Failed and Suspended states have been set simultaneously by the system because a failure was detected and because resolution of the failure explicitly requires administrator intervention;
SinglePageRestore – this state indicates that a single page restore operation is occurring on the mailbox database copy.

To check the Status and ContentIndexState attributes of a database copy, we use the Get-MailboxDatabaseCopyStatus cmdlet as mentioned previously.

Building our own script

All these scripts and tools are great to keep a close eye on our databases and ensuring everything is running smoothly. But what if Exchange decides to failover some databases in the middle of the night because the switch to where that server is connected had a momentary failure and the Witness server couldn’t contact it? The next morning you might get to work and everything is looking normal with all your databases healthy, all your servers ok and you might miss the SCOM alerts because everything is back to normal. However, the active databases are not where you expect them to be! This can be an issue if you have a DAG with servers in different sites and the databases are now mounted on a site where you are using manual redirection for OWA, for example.

Besides the Status attribute, databases in a DAG, and therefore with multiple copies, also have the ActivationPreference attribute that shows which servers have preference over the others to mount the database in case of a disaster or a manual switchover.

The following output is just an example of what you will get if you run the following command in an environment with at least a DAG and multiple copies:

Get-MailboxDatabase | Sort Name | Select Name, ActivationPreference

Name ActivationPreference

—- ——————–

ADB1 {[MBXA1, 1], [MBXA2, 2]}

ADB2 {[MBXA1, 1], [MBXA2, 2]}

ADB3 {[MBXA1, 1], [MBXA2, 2]}

…

MDB1 {[MBX1, 1], [MBX2, 2], [MBX3, 3], [MBX4, 4]}

MDB2 {[MBX1, 3], [MBX2, 1], [MBX3, 2], [MBX4, 4]}

MDB3 {[MBX1, 2], [MBX2, 3], [MBX3, 1], [MBX4, 4]}

…

The above output shows that database MDB1, for example, has the highest activation preference on server MBX1, so there’s where we want it to be always mounted. In case of a problem with MBX1, Exchange will try to mount it on server MBX2 first, then on MBX3 and finally on MBX4 if all previous ones failed.

To check if a database is mounted where we want it to be (on a server with an ActivationPreference of 1), we can use the following script:

Get-MailboxDatabase | Where {$_.Recovery -eq$False} | SortName | ForEach {

$db=$_.Name

$curServer=$_.Server.Name

$ownServer=$_.ActivationPreference | ? {$_.Value -eq 1}

If ($curServer-ne$ownServer.Key.Name) {

Write-Host“$db on $curServer should be on $($ownServer.Key.Name)!”-ForegroundColorRed

}

This code compares the server where the database is currently active with the server that has an ActivationPreference of 1. If they are different, it prints a message in red to let the administrator know.

But how do we detect if $curServer and $ownServer servers are on different Active Directory [AD] sites? To take this into consideration, we will create a function called Get-ExchangeServerADSite that receives a server name and returns its AD site. Then, we just need to update the previous code to check if both servers are on the same site (please see final code at the end of the article).

Now, we also want to check for the status of the database and the state of its content index as we did previously. Based on all the possible values, we want the Status attribute of all mailbox databases to be either Mounted (for the server where the database is mounted) or Healthy (for the servers that hold a copy of it). For the ContentIndexState attribute we want it to be always Healthy.

But a Healthy database with a high CopyQueueLength or ReplayQueueLength is not really useful as if you force it to mount, you will lose data and it is also a possible indication of issues with your storage… So let’s add all this together:

Get-MailboxDatabase | Where {$_.Recovery -eq$False} | SortName | Get-MailboxDatabaseCopyStatus | ForEach {

If ($_.Status -notmatch“Mounted”-and$_.Status -notmatch“Healthy”-or$_.ContentIndexState -notmatch“Healthy”-or$_.CopyQueueLength -ge 200 -or$_.ReplayQueueLength -ge 200) {

Write-Host“Something is wrong with database $($_.Name)!”-ForegroundColorRed

}

Now, let’s put everything together and tell the script that if something is wrong with any database to send an e-mail to the administrator. This way we can create a schedule task to run this script every 2 minutes, for example, send an e-mail in case of any issue and stop the scheduled task (otherwise you will receive e-mails every 2 minutes until everything is back to normal – which is also an option).

FunctionGet-ExchangeServerADSite ([String] $excServer)

{

# We could use WMI to check for the domain, but I think this method is better

# Get-WmiObject Win32_NTDomain -ComputerName $excServer

$configNC=([ADSI]“LDAP://RootDse”).configurationNamingContext

$search=new-objectDirectoryServices.DirectorySearcher([ADSI]“LDAP://$configNC”)

$search.Filter =“(&(objectClass=msExchExchangeServer)(name=$excServer))”

$search.PageSize = 1000

[Void] $search.PropertiesToLoad.Add(“msExchServerSite”)

Try {

$adSite= [String] ($search.FindOne()).Properties.Item(“msExchServerSite”)

Return ($adSite.Split(“,”)[0]).Substring(3)

} Catch {

Return$null

}

[Bool] $bolFailover=$False

[String] $errMessage=$null

# Check if all databases are currently mounted on the server with ActivationPreference of 1

Get-MailboxDatabase | Where {$_.Recovery -eq$False} | SortName | ForEach {

$db=$_.Name

$curServer=$_.Server.Name

$ownServer=$_.ActivationPreference | ? {$_.Value -eq 1}

# Compare the server where the DB is currently active to the server where it should be

If ($curServer-ne$ownServer.Key.Name) {

# Get the AD sites of both servers

$siteCur=Get-ExchangeServerADSite$curServer

$siteOwn=Get-ExchangeServerADSite$ownServer.Key

# Check if both servers are on different AD sites

If ($siteCur-ne$null-and$siteOwn-ne$null-and$siteCur-ne$siteOwn) {

$errMessage+=“`n$db on $curServer should be on $($ownServer.Key) (DIFFERENT AD SITE: $siteCur)!”

} Else {

$errMessage+=“`n$db on $curServer should be on $($ownServer.Key)!”

}

$bolFailover=$True

}

$errMessage+=“`n`n”

# Check the Status of all databases including Content Index and Queues

Get-MailboxDatabase | Where {$_.Recovery -eq$False} | SortName | Get-MailboxDatabaseCopyStatus | ForEach {

If ($_.Status -notmatch“Mounted”-and$_.Status -notmatch“Healthy”-or$_.ContentIndexState -notmatch“Healthy”-or$_.CopyQueueLength -ge 200 -or$_.ReplayQueueLength -ge 200) {

$errMessage+=“`n`n$($_.Name) – Status: $($_.Status) – Copy QL: $($_.CopyQueueLength) – Replay QL: $($_.ReplayQueueLength) – Index: $($_.ContentIndexState)”

$bolFailover=$True

}

If ($bolFailover) {

# Disable the schedule task that runs the script and send e-mail to administrator

Schtasks.exe /Change /TN “MonitorDAG” /DISABLE

Send-MailMessage-From“[email protected]”-To“[email protected]”-Subject“DAG NOT Healthy!”-Body$errMessage-PriorityHigh-SMTPserver“smtp.letsexchange.com”-DeliveryNotificationOptiononFailure

}

If the script founds anything wrong with any database, the administrator(s) will receive a similar e-mail to this:

Figure 8: Warning e-mail generated by the script

Please note that there are more attributes that can and should be monitored! For example, we could incorporate the result of the Test-ReplicationHealth cmdlet into this script. At the end, it is all about your preferences and what you need as an administrator.

We could also add a line to move a database back to where it should be if it is found to be mounted on the “wrong” server. However, if Exchange failed the database over to another server, it had a reason to do so and you should investigate it before moving databases back to where they were.

You can download the script here.

Conclusion

Although the Exchange Team provides administrators with an arsenal of tools and scripts to manage and monitor Exchange, there might be times where we need something customized to our Exchange environment and to our preferences and needs. Hopefully this script will reveal useful to many fellow administrators.

Monitor DAG Database Failover

Introduction

Test-ReplicationHealth cmdlet

Event Logs

CollectReplicationMetrics.ps1 Script

CheckDatabaseRedundancy.ps1 Script

Database copy status

Building our own script

Conclusion

About The Author

Nuno Mota

Leave a Comment Cancel Reply

Introduction

Test-ReplicationHealth cmdlet

Event Logs

CollectReplicationMetrics.ps1 Script

CheckDatabaseRedundancy.ps1 Script

Database copy status

Building our own script

Conclusion

About The Author

Nuno Mota

Read Next

Leave a Comment Cancel Reply