Fire Drill definition

In the corporate world, the meaning of fire drill has been modified to suggest that any activity that is a waste of time is called a “fire drill”.

Tuesday, May 21, 2013

Cluster nodes behaving badly? (BSOD)


Before applying a hotfix for"0x0000009E" Stop error when you add an extra storage disk to a failover cluster in Windows Server 2008 R2:  http://support.microsoft.com/kb/2520235

You might want to check out the 2008 R2 default HangRecoveryAction setting.  In our case we changed it to 1 until the hotfix could be applied.
http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx

tl;dr version:
HangRecoveryAction

This property controls the action to take if the user-mode processes have stopped responding. For the HangRecoveryAction, we actually have 4 different settings with 3 being the default.

0 = Disables the heartbeat and monitoring mechanism.
1 = Logs an event in the system log of the Event Viewer.
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node.  <<-- default for 2008

If you want to change the setting, you would issue the command:
cluster /cluster:clustername /prop HangRecoveryAction=x

With this setting instead of a BSOD, you get Event ID 4869 repeated every 60 seconds:

Event ID: 4869
Source:  Microsoft-Windows-FailoverClustering
Description:  User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the 'C:\Windows\Cluster\clussvc.exe' process with a process ID '%1', for '%2' seconds. Please use Performance Monitor to evaluate the health of the system and determine which process may be negatively impacting the system.
* where %2 is the value of ClusSvcHangTimeout
* where %1 is the Process ID you would see in Task Manager

NIC Binding Order in Windows 2008 Server R2

While working on an Enterprise Vault issue, one of the Symantec KB articles cited improper NIC order as one of the possible causes of the Event ID, among a truckload of other optimizations:

http://www.symantec.com/business/support/index?page=content&id=TECH62307

Most of the Symantec docs were written with Windows 2003 in mind, so the instructions listed were ambiguous at best.

Took some digging, but a Technet Microsoft DNS article squared me away:
http://technet.microsoft.com/en-us/library/dd391967(WS.10).aspx

Here's the tl;dr version:

  1. Click Start, click Network, click Network and Sharing Center, and then click Manage Network Connections.
  2. Press the ALT key, click Advanced, and then click Advanced Settings. If you are prompted for an administrator password or confirmation, type the password or provide confirmation.
  3. Click the Adapters and Bindings tab, and then, under Connections, click the connection you want to modify.
  4. Under Bindings for <connection name>, select the protocol that you want to move up or down in the list, click the up or down arrow button, and then click OK.