Friday, January 4, 2008

Adjusting the Default Timeout Used for Failure & Isolation Detection in HA

Have you ever thought that maybe you'd like to have more than 15 seconds before your HA begins to do its thing in VMware? I know I have. After doing a little digging I discovered how this is possible. What I found was that HA response time can be configured to different than the 15 seconds (15000 ms in VMware talk) normally given by performing the following steps:

  1. Right click on the cluster->Edit Settings>VMware HA->Advanced Options
  2. Add the das.failuredetectiontime = option/value pair to the cluster’s settingswhere represents the desired timeout value in milliseconds. For example, 60 seconds would equal 60000 milliseconds. You can do the math from there.
  3. Next click the OK button and your configuration will be complete.

Now, VMware HA will not declare a host failure nor initiate an isolation detection response
until the timeout value specified has been exceeded without heartbeats received. This saves you, me, and the network team grief in the event of a network blip. Also, more importantly for me, it keeps me from pulling what little hair I have left out because due to a 20 second network problem I now have to wait 2 to 5 minutes for everything that was working just fine on the host server to come back up AND I have to explain why it happened to those in charge. Neither of theses are things I like to do.

1 Comment:

Pik Master said...

It may be important - I had to remove HA from the entire cluster, and reapply it again, to make das.failuredetectiontime working. Before that, even though log message "cluster reconfiguration - Completed" it wasn't applied to any cluster nodes.

Google