Source mailbox already being moved errors while moving mailboxes

It's been a long time since I wrote a blog post, but this one comes up a lot so I figured I'd do a write-up on it.

When moving mailboxes using to Exchange 2010, Exchange 2013, Exchange 2016, or Exchange Online (Office 365), a common error you can hit is SourceMailboxAlreadyBeingMovedTransientException.  What this appears to means is that the source mailbox is currently locked for another move and you can't start another one.  In reality, it's a bit simpler and more complex at the same time.

When moving a mailbox, the Mailbox Replication Service (MRS) sets a "InTransitStatus" flag in the source mailbox to make sure other moves don't try to act on this source mailbox at the same time.  This flag is really just held in memory in the source Information Store (Store) process (Store.exe for 2010 and Microsoft.Exchange.Store.Worker.exe for 2013 and 2016).  This way, if a different MRS instance attempts to move the mailbox it will receive an error until this "lock" is released.  This flag doesn't have to be reset explicitly, but will be released when the Exchange RPC session between the on-premises MRS Proxy and the source mailbox Store is closed.

MRS Proxy (for those who aren't familiar) is a Windows Communication Foundation (WCF) web service exposed for moving mailboxes in Exchange 2010 and later.  It runs on-premises and is hosted at /EWS/MRSProxy.svc however it's actually not part of Exchange Web Services (EWS) at all.  In Exchange 2010, it is a WCF web service hosted in IIS by the ServiceAppPool w3wp.exe instance on Client Access Servers in the environment.  In Exchange 2013 and 2016, it is a WCF web service hosted inside the Mailbox Replication Service (MSExchangeReplication.exe) on the Mailbox servers in the environment.  The 2013/2016 CAS servers simply proxy the HTTP traffic to this back-end web service.  This is why MRSProxy config settings for 2010 are in the web.config for EWS on the CAS but in 2016 they are in the MRS Service configuration file (MSExchangeMailboxReplication.exe.config) on the mailbox servers.

This session-based locking scheme all works fine until you hit an error.  Networks are full of traffic, and hitting transient errors is inevitable. When the cloud MRS service hits a communication error with the on-premises MRS proxy, it has to recreate the WCF reliable session it had previously established; however, the associated RPC session is left behind.  Since the source on-premises MRS proxy may not even know that the cloud MRS encountered this error, the session sticks around and so does the InTransitStatus flag in the source Store.  The cloud MRS has retry logic for transient errors, so within a short period it will try to resume the move on a new WCF reliable session.  Since the old session hasn't expired on the source Store, you get a SourceMailboxAlreadyBeingMovedTransientException. You can see this if you were to look in the move report.  And since that flag is tied to the RPC session between on-premises MRS Proxy and the Store, it is intimately tied to the normal keep-alive behavior of Windows TCP sessions (specifically KeepAliveTime).  By default in Windows, this KeepAliveTime is set to 2 hours, but the cloud MRS is going to retry the move every few minutes.  This can accumulate a ton of SourceMailboxAlreadyBeingMovedTransientException errors before the move can resume and can slow down your move tremendously.

To recover from these errors faster, we recommend setting the KeepAliveTime on the server hosting the source mailbox to 5 minutes (instead of 2 hours).  This can be done in the registry of the source mailbox server but requires a reboot to take effect.

Important: If you're getting these errors frequently it probably means you're hitting other transient exceptions right before these errors start.  This error is never the actual problem, just a symptom of another problem that already happened.  A good way to see the cause for a string of these errors is to look at the failure right before it.  You could do something like this in Exchange Management Shell:

$stats = Get-MoveRequestStatistics <user> -IncludeReport
$firstMailboxLockedError = $stats.Report.Failures | where { $_.FailureType -eq 'SourceMailboxAlreadyBeingMovedTransientException' } | Select -First 1
$stats.Report.Failures | sort Timestamp -Descending | where { $_.Timestamp -lt $firstMailboxLockedError.Timestamp } | Select -First 1 | fl

Hope this helps understanding this error.

Update 2/21/2017: Created a new post for a common root cause of this error: It's always the Load Balancer