Exchange 2010 DAG on VMs (HyperV and VMware) and HeartBeat Settings for LiveMigration
As you might already know, with Exchange 2010 SP1 Microsoft started supporting virtualizing Exchange mailbox servers in DAG and live migration:
Per Exchange Blog:
•The Unified Messaging server role is supported in a virtualized environment.
•Combining Exchange 2010 high availability solutions (database availability groups (DAGs)) with hypervisor-based clustering, high availability, or migration solutions that will move or automatically failover mailbox servers that are members of a DAG between clustered root servers, is now supported.
There are two best practice documents for running Exchange on a virtual machines, which I highly recommend reading:
- Best Practices for Virtualizing Exchange Server 2010 with Windows Server® 2008 R2 Hyper V: http://www.microsoft.com/en-us/download/details.aspx?id=2428
- Using VMware HA, DRS and vMotion with Exchange 2010 DAGs: http://www.vmware.com/files/pdf/using-vmware-HA-DRS-and-vmotion-with-exchange-2010-dags.pdf
One of the important points is that the DAG node is evicted when the heartbeat exceeds 5 seconds. If required, the heartbeat timeout can be increased to maximum 10 seconds. The reason I wanted to stress on this is that when a live migration occurs depending on the environment/systems sometimes the cutover time from 1 node to the other exceeds 5 seconds and this causes cluster to failover the databases. To tune this timeout value, you need to evaluate the environment and the requirements.
Per MS Best Practice document (p30):
If the server offline time exceeds five seconds, the DAG node will be evicted from the cluster. It is preferable to ensure that hypervisor and host-based clustering technology is able to migrate resources in less than five seconds. If this is not feasible, the cluster heartbeat timeout can be raised, although we don’t recommend raising it to more than 10 seconds.
To set the cluster heartbeat timeout to 10 seconds, follow these instructions. Note: this only applies to communication between DAG nodes that are on the same subnet.
1. Open PowerShell on one of the Mailbox servers that is a member of the DAG. Note: this only needs to be done on one of the DAG members as the setting will affect the entire DAG.
2. Type the following commands:
Import-module FailoverClusters
(Get-Cluster).SameSubnetThreshold=10
(Get-Cluster).SameSubnetDelay=1000
3. Close PowerShell.To revert to the default settings, follow these instructions:
1. Open PowerShell on one of the Mailbox servers that is a member of the DAG. Note: this only needs to be done on one of the DAG members as the setting will affect the entire DAG.
2. Type the following commands:
Import-module FailoverClusters
(Get-Cluster).SameSubnetThreshold=5
(Get-Cluster).SameSubnetDelay=1000
3. Close PowerShell.
Per Vmware Best Practice document section 5.2 (Cluster Heartbeat Settings):
To establish a baseline during the vMotion tests no changes were made to the default configuration of the database availability group. Each virtual machine was configured with 24GB of memory, and with the change rate of memory pages, due to the load generation tool running, the application stun time was just enough to cause database failovers when jumbo frames were not used. However, the vMotion migrations completed 100% of the time and database replication continued uninterrupted. It was clear that the cluster heart beat interval was too short to support no-impact vMotion migrations in our test environment.
To support vMotion migrations during heavy usage we adjusted the parameter of the cluster from the default of 1000ms to 2000ms. This parameter controls how often cluster heartbeat communication is transmitted. The default threshold for missed packets is five, after which the cluster service determines that the node failed. By increasing the transmission period to two seconds and keeping the threshold at five intervals we were able to perform all of the vMotion tests (when jumbo frames were not used) with no database failovers.
Note Microsoft recommends using a maximum value of 10 seconds for cluster heartbeat timeout. In our configuration the maximum recommended value is used by configuring a heartbeat interval of two seconds (2000 milliseconds) and a threshold of five (default)."
I am not sure whether increasing the heartbeat interval (SameSubnetDelay) or the threshold (SameSubnetThreshold) is better.
To see the current values:
First make sure you have Failover cluster module is imported:
[PS] Get-Module -ListAvailable
ModuleType Name ExportedCommands
———- —- —————-
Manifest ActiveDirectory {}
Manifest ADRMS {}
Manifest AppLocker {}
Manifest BestPractices {}
Manifest BitsTransfer {}
Manifest FailoverClusters {}
Manifest GroupPolicy {}
Manifest PSDiagnostics {}
Manifest ServerManager {}
Manifest TroubleshootingPack {}
Manifest WebAdministration {}[PS] import-module failoverclusters
[PS] get-cluster | select SameSubnetThreshold, samesubnetdelay, crosssubnetthreshold, crosssubnetdelay
SameSubnetThreshold SameSubnetDelay CrossSubnetThreshold CrossSubnetDelay
——————- ————— ——————– —————-
5 1000 5 1000
The default values are 5, 1000 for SameSubnetThreshold, samesubnetdelay respectively.
If you would like to use cluster.exe, the format is as follows:
cluster /cluster:<ClusterName> /prop SameSubnetDelay=<value>
cluster /cluster:<ClusterName> /prop SameSubnetThreshold=<value>
cluster /cluster:<ClusterName> /prop CrossSubnetDelay=<value>
cluster /cluster:<ClusterName> /prop CrossSubnetThreshold=<value>and to confirm:
cluster /cluster:<ClusterName> /prop
- Veeam Surebackup – PART 3: Surebackup Görevi - 26 September 2020
- Veeam Surebackup – PART 2: Application Group - 26 September 2020
- Veeam SureBackup – PART 1: Virtual LAB - 22 September 2020