Exchange 2010 DAG on VMs (HyperV and VMware) and HeartBeat Settings for LiveMigration

 

As you might already know, with Exchange 2010 SP1 Microsoft started supporting virtualizing Exchange mailbox servers in DAG and live migration:

Per Exchange Blog:

•The Unified Messaging server role is supported in a virtualized environment.
•Combining Exchange 2010 high availability solutions (database availability groups (DAGs)) with hypervisor-based clustering, high availability, or migration solutions that will move or automatically failover mailbox servers that are members of a DAG between clustered root servers, is now supported.

There are two best practice documents for running Exchange on a virtual machines, which I highly recommend reading:

One of the important points is that the DAG node is evicted when the heartbeat exceeds 5 seconds. If required, the heartbeat timeout can be increased to maximum 10 seconds. The reason I wanted to stress on this is that when a live migration occurs depending on the environment/systems sometimes the cutover time from 1 node to the other exceeds 5 seconds and this causes cluster to failover the databases. To tune this timeout value, you need to evaluate the environment and the requirements.

Per MS Best Practice document (p30):

If the server offline time exceeds five seconds, the DAG node will be evicted from the cluster. It is preferable to ensure that hypervisor and host-based clustering technology is able to migrate resources in less than five seconds. If this is not feasible, the cluster heartbeat timeout can be raised, although we don’t recommend raising it to more than 10 seconds.

To set the cluster heartbeat timeout to 10 seconds, follow these instructions. Note: this only applies to communication between DAG nodes that are on the same subnet.

1.    Open PowerShell on one of the Mailbox servers that is a member of the DAG. Note: this only needs to be done on one of the DAG members as the setting will affect the entire DAG.
2.    Type the following commands:
Import-module FailoverClusters
(Get-Cluster).SameSubnetThreshold=10
(Get-Cluster).SameSubnetDelay=1000
3.    Close PowerShell.

To revert to the default settings, follow these instructions:

1.    Open PowerShell on one of the Mailbox servers that is a member of the DAG. Note: this only needs to be done on one of the DAG members as the setting will affect the entire DAG.
2.    Type the following commands:
Import-module FailoverClusters
(Get-Cluster).SameSubnetThreshold=5
(Get-Cluster).SameSubnetDelay=1000
3.    Close PowerShell.

Per Vmware Best Practice document section 5.2 (Cluster Heartbeat Settings):

To establish a baseline during the vMotion tests no changes were made to the default configuration of the database availability group. Each virtual machine was configured with 24GB of memory, and with the change rate of memory pages, due to the load generation tool running, the application stun time was just enough to cause database failovers when jumbo frames were not used. However, the vMotion migrations completed 100% of the time and database replication continued uninterrupted. It was clear that the cluster heart beat interval was too short to support no-impact vMotion migrations in our test environment.
To support vMotion migrations during heavy usage we adjusted the parameter of the cluster from the default of 1000ms to 2000ms. This parameter controls how often cluster heartbeat communication is transmitted. The default threshold for missed packets is five, after which the cluster service determines that the node failed. By increasing the transmission period to two seconds and keeping the threshold at five intervals we were able to perform all of the vMotion tests (when jumbo frames were not used) with no database failovers.
Note Microsoft recommends using a maximum value of 10 seconds for cluster heartbeat timeout. In our configuration the maximum recommended value is used by configuring a heartbeat interval of two seconds (2000 milliseconds) and a threshold of five (default)."

I am not sure whether increasing the heartbeat interval (SameSubnetDelay) or the threshold (SameSubnetThreshold) is better.

To see the current values:

First make sure you have Failover cluster module is imported:

[PS] Get-Module -ListAvailable

ModuleType Name                      ExportedCommands
———- —-                      —————-
Manifest   ActiveDirectory           {}
Manifest   ADRMS                     {}
Manifest   AppLocker                 {}
Manifest   BestPractices             {}
Manifest   BitsTransfer              {}
Manifest   FailoverClusters          {}
Manifest   GroupPolicy               {}
Manifest   PSDiagnostics             {}
Manifest   ServerManager             {}
Manifest   TroubleshootingPack       {}
Manifest   WebAdministration         {}

[PS] import-module failoverclusters

[PS] get-cluster | select SameSubnetThreshold, samesubnetdelay, crosssubnetthreshold, crosssubnetdelay

SameSubnetThreshold               SameSubnetDelay         CrossSubnetThreshold              CrossSubnetDelay
——————-               —————          ——————–              —————-
               5                          1000                             5                          1000

 

The default values are 5, 1000 for SameSubnetThreshold, samesubnetdelay respectively.

image

 

If you would like to use cluster.exe, the format is as follows:

cluster /cluster:<ClusterName> /prop SameSubnetDelay=<value>
cluster /cluster:<ClusterName> /prop SameSubnetThreshold=<value>
cluster /cluster:<ClusterName> /prop CrossSubnetDelay=<value>
cluster /cluster:<ClusterName> /prop CrossSubnetThreshold=<value>

and to confirm:
cluster /cluster:<ClusterName> /prop

Bulent Tolu

Bulent Tolu

Sr. Systems Engineer at VMware
Bulent is an IT professional with Master's in MIS and 10-years of experience in broad range of Information Technologies. He is exposed to engineering/architecting, implementation/integration, and administration of various high-available IT systems and infrastructure. He has a passion to continually research, test and evaluate new technologies and follow industry best practices to secure and optimize IT systems. Currently, he lives in Istanbul and works as a Sr. Systems Engineer at VMware.
Bulent Tolu

Leave a Reply

Your email address will not be published. Required fields are marked *

Post Navigation

Share
Translate »