Fire Drill definition

In the corporate world, the meaning of fire drill has been modified to suggest that any activity that is a waste of time is called a “fire drill”.

Tuesday, May 21, 2013

Cluster nodes behaving badly? (BSOD)


Before applying a hotfix for"0x0000009E" Stop error when you add an extra storage disk to a failover cluster in Windows Server 2008 R2:  http://support.microsoft.com/kb/2520235

You might want to check out the 2008 R2 default HangRecoveryAction setting.  In our case we changed it to 1 until the hotfix could be applied.
http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx

tl;dr version:
HangRecoveryAction

This property controls the action to take if the user-mode processes have stopped responding. For the HangRecoveryAction, we actually have 4 different settings with 3 being the default.

0 = Disables the heartbeat and monitoring mechanism.
1 = Logs an event in the system log of the Event Viewer.
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node.  <<-- default for 2008

If you want to change the setting, you would issue the command:
cluster /cluster:clustername /prop HangRecoveryAction=x

With this setting instead of a BSOD, you get Event ID 4869 repeated every 60 seconds:

Event ID: 4869
Source:  Microsoft-Windows-FailoverClustering
Description:  User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the 'C:\Windows\Cluster\clussvc.exe' process with a process ID '%1', for '%2' seconds. Please use Performance Monitor to evaluate the health of the system and determine which process may be negatively impacting the system.
* where %2 is the value of ClusSvcHangTimeout
* where %1 is the Process ID you would see in Task Manager

No comments:

Post a Comment