Recovering a from a datacenter failure

May 23, 2014 Nathan OBryan

In previous posts (before I got all busy writing my sessions for IT connections), I promised to detail the process for recovering from a datacenter failure. For the purposes of this post I’ll assume we’re talking about a 2 site Exchange deployment with 2 Exchange servers at each site. All 4 Exchange servers are members of the same DAG, and all databases are replicated to all servers. We are also assuming that the DAG in question is running DAC mode.

In the event of a primary datacenter failure with the above setup, the recovery is pretty simple.

1. Run Stop-DatabaseAvailabityGroup at the failed site

Stop-DatabaseAvailabiltyGroup is used to mark a single server, or an entire site, as failed. In this case, we’re going to want to mark the whole site as failed. This is mostly intended for the case of a network failure where the site could potentially come back online with little or no notice. In the event of a catastrophic failure that makes it impossible to run this command, then don’t worry about it.

2. Run Restore-DatabaseAvailabilityGroup at the backup site

Restore-DatabaseAvailabilityGroup performs several operations that activate a backup site after a datacenter failure. First it forcibly evicts DAG members that are listed on the StopedMailboxServers list from the DAG. Then it configures remaining servers in the DAG to use the Alternate Witness Server to reestablish quorum for the DAG. At this point, you are recovered and your mailboxes should be back online. It’s really that easy.

3. Run Start-DatabaseAvailabilityGroup at the failed site once you are ready for those server to rejoin the DAG.

Start-DatabaseAvailabilityGroup rejoins the failed DAG members back into the cluster, and gets things back to normal. There really is not a whole lot more to the process than that.

This is truly a simple process. The simplicity of this process is part of the reason why I am not such a big fan of the 3^rd site witness. It is not that much harder to do a manually datacenter switchover and the process is much more predictable. If you are using a 3^rd site witness, the behavior of your DAG is going to be highly dependent on the features and configuration of your network. I prefer to keep Exchange’s behavior firmly in the hands of the Exchange admins.