Hyper-V Windows Failover Cluster and IsAlive Operation (Part 3)

If you would like to be notified when Nirmal Sharma releases the next part of this article series please sign up to the WindowsNetworking.com Real time article update newsletter.

If you would like to read the other parts in this article series please go to:

In the Part I of this article series, we provided an overview of the Hyper-V cluster issue and also explained the DNS network name registration process invoked by the “IsAlive” call which is executed during the failover cluster operation. Since the error and warning messages that we saw in the cluster nodes were related to the DNS, Winlogon, Group Policy processing, NetLogon, and Kerberos, it was a bit challenging for us as to whether we should correlate event error messages with the Failover cluster failure or not. It took hours of time for us to investigate and fix a simple DNS issue. In Part 1, we explained the method we used to fix the issue.

In the part 2 of this article series, we explained the Windows Failover cluster interaction with resources and how the whole failover clustering is implemented to monitor the resources in a Windows Failover cluster. As explained in part 2 of this article series, the default behavior of the Windows Failover cluster is to implement monitoring of the cluster operation in a single RHS.exe process unless you configure a standard cluster or application resources to run in a separate RHS process.

Resource Host Subsystem requires communication with Resource DLLs to monitor status of resources managed by the Application Resource DLL. The Resource DLL implements various cluster specific functions that are executed against the resources. While a single RHS process can maintain the status of all resources in the cluster, it is important to understand that implementing a single RHS process can cause failure to whole cluster if a particular resource fails.

During our course of troubleshooting, we made a few changes on the Windows Failover cluster. The changes that we made were suggested by multiple folks during the troubleshooting call. However, it is important to note that certain changes may cause downtime to the failover cluster. While there are other configuration items that you should consider when implementing a Windows Failover cluster, in the final part of this article series, we are going to explain single RHS vs. multiple RHS, whether to disable Dynamic DNS registration on cluster nodes or not, and configuring PendingTimeout and DeadLockTime properties for specific cluster resources.

Single RHS vs. Multiple RHS

We configured all resources to run in their own RHS process. In a singleRHS scenario, the whole Windows Failover cluster operation is maintained by a single RHS.exe running in the Task Manager. RHS.exe must receive a response from every cluster resource in order to keep resources running in the cluster. If RHS.exe fails to receive a response from cluster resources, it might terminate itself causing all other cluster resources to fail.

In the multiple RHS scenario, every RHS process will be responsible for maintaining the health of specific resources. If a cluster resource fails, the corresponding RHS process for that resource will also fail, which, in turn, does not cause all other cluster resources to go down.

You can configure a cluster resource to run in its own RHS process by configuring the “Run Resource in a Separate RHS Process” setting on the property page of a cluster resource. However, before configuring Windows Failover cluster to run in the multiple RHS scenario, you need to consider a few points outlined below:

Function of the Windows Failover Cluster: Whether to implement multiple RHS scenario or not depends on the function of the Windows Failover Cluster. For example, if a Windows Failover cluster is implemented to provide SQL Services, there is no need to implement the multiple RHS scenario. You would not want to configure all SQL Server resources to run in their own RHS process.
System Performance: Every RHS.exe process will consume system resources on cluster nodes. It is because the RHS.exe process is responsible for executing “IsAlive” and “LooksAlive” functions to check the status of cluster resources, which, in turn, might cause system performance issues. So you might not want to implement the multiple RHS scenario if your cluster nodes are running low on system resources.
Difficult to troubleshoot RHS related issues: In a single RHS scenario, it is difficult to troubleshoot RHS related issues. It is because there is no way to determine which resource is maintained by which RHS.exe process.

Avoid Disabling Dynamic DNS Registration on Cluster Nodes

As stated in the Part 1 of this article series, we disabled the ability for cluster nodes to process dynamic DNS registration for Network Name resources which helped us resolve the issue. However, disabling dynamic DNS registration on cluster nodes is not recommended for networks that are used for client communications. Note that the DNS record is still registered as a Dynamic DNS record in the DNS Server. When the Scavenging process starts on the DNS Server, the process will delete all DNS Records that have not been refreshed within the default refresh period. As a result Cluster Network Name DNS records will be deleted from the DNS Server. On the other hand, if you want to disable dynamic DNS registration for cluster nodes, make sure you have manually created a host record for the cluster network name that points to the cluster IP address. You must uncheck “Register this connection’s DNS Addresses” on network cards that are used for cluster communication, storage communication and live migration traffic.

Consider configuring PendingTimout and DeadLockTime for Cluster Resources

Every cluster resource performs a thorough check to ensure cluster resource can be considered online after meeting certain conditions. While the cluster resource is in the process of doing a check on its stability, RHS.exe might ask the cluster resource to report back on its status. If the resource does not report back its status within a specified time, RHS might time out resulting in resource or cluster termination. If you have a few cluster resources that take more time than usual to come online, consider implementing PendingTimout and DeadLockTime properties for the cluster resource. You can configure PendingTimeout and DeadLockTime by using executing the below commands:

Cluster.exe Resource <ResourceName> /Prop DeadLockTimeOut=600000
Cluster.exe Resource <ResourceName> /Prop PendingTimeout=600000

Conclusion

What Microsoft says that we don’t want non-critical resources to bring the cluster down, so Microsoft Failover cluster developers provided “Run Resource in a Separate RHS process” setting for cluster resources, but that doesn’t necessarily mean that we want to implement multiple RHS scenario for “all” cluster resources running in the cluster. We explained some of the points you should consider before you implement single vs. multiple RHS configuration and why you should not disable dynamic DNS registration on cluster production networks.

If you would like to be notified when Nirmal Sharma releases the next part of this article series please sign up to the WindowsNetworking.com Real time article update newsletter.

If you would like to read the other parts in this article series please go to:

Hyper-V Windows Failover Cluster and IsAlive Operation (Part 3)

Single RHS vs. Multiple RHS

Avoid Disabling Dynamic DNS Registration on Cluster Nodes

Consider configuring PendingTimout and DeadLockTime for Cluster Resources

About The Author

Nirmal Sharma

Leave a Comment Cancel Reply

Single RHS vs. Multiple RHS

Avoid Disabling Dynamic DNS Registration on Cluster Nodes

Consider configuring PendingTimout and DeadLockTime for Cluster Resources

About The Author

Nirmal Sharma

Read Next

Leave a Comment Cancel Reply