In early August of this year, the Windows SE team released the following Knowledge Base (KB) article and accompanying software hotfix regarding an issue in Windows Server 2008 R2 failover clusters:
This hotfix is strongly recommended for all databases availability groups that are stretched across multiple datacenters. For DAGs that are not stretched across multiple datacenters, this hotfix is good to have, as well. The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.
As described on TechNet, a database availability group (DAG) relies on specific cluster functionality, including the cluster database. In order for a DAG to be able to operate and provide high availability, the cluster and the cluster database must also be operating properly.
Microsoft has encountered scenarios in which a transient network failure occurs (a failure of network communications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked, if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to restart all members within the entire cluster to resolve the deadlock condition.