Enough With The Redirects Already

Here’s the scenario the customer gave us: They use Outlook’s MAPI and a single MAPI profile connect to an admin mailbox, and then use CreateStoreEntryID and OpenMsgStore to open various work mailboxes to do their processing. All day they’re opening and closing these mailboxes. In some environments, after they run for some time, one of the OpenMsgStore calls will return MAPI_E_FAILONEPROVIDER. There were a number of variations of the issue, as we’ll discuss below, but what they all had in common was that they were rapidly logging in to mailboxes, at least 5 times over a 10 second span.

This was key to the issue. When we traced the process and debugged the reason for the failure, we saw the Exchange server was returning ecServerPaused when we got the error. Digging further back, we saw we were getting ecWrongServer on earlier, successful connections. If you read through the documentation on these errors, you’ll see this is a defense mechanism for Exchange. The ecWrongServer error, which never bubbles up to the caller, indicates that the mailbox requested is not on the server we’re talking to. This initiates a redirect conversation to find the right server. This negotiation is expensive, especially since the server conducting it shouldn’t have been contacted in the first place. So if the same client triggers too many redirects, Exchange returns ecServerPaused as a signal to the client that maybe it’s doing something wrong. You can find this error by calling GetLastError after OpenMsgStore fails. Outlook itself will use this information to trigger an update to the profile to ensure it’s talking to the right server.

So, armed with this information, we went looking at the customer’s code to see what they were doing wrong. They would connect to mailbox Homer on server Alpha, obtain the DNs for mailbox Bart and server Beta, and pass these in to CreateStoreEntryID. They would then use this entry ID, and still see the problem if they logged in rapidly. They did nothing wrong and still had a problem.

This is where things get complicated. First, let’s look at the output of CreateStoreEntryID (as parsed with MFCMAPI):

MAPI Message Store Entry ID:
abFlags = 0x00000000
Provider GUID = {10BBA138-E505-1A10-A1BB-08002B2A56C2} = muidStoreWrap
Version = 0x00 = MAPIMDB_VERSION
Flag = 0x00 = MAPIMDB_NORMAL
DLLFileName = EMSMDB.DLL
Wrapped Flags = 0x00000000
WrappedProviderUID = {20FA551B-66AA-CD11-9BC8-00AA002FC45A} = g_muidStorePrivate
WrappedType = 0x0000000C = OPENSTORE_HOME_LOGON | OPENSTORE_TAKE_OWNERSHIP
ServerShortname = Beta
MailboxDN = /o=First Organization/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=Bart

Note that the entry ID contains the server name “Beta”, but not the full DN, even though the full DN was passed in to CreateStoreEntryID. This is because the entry ID format generated by CreateStoreEntryID does not support using the full server DN. So only the short name is kept, and that’s all emsmdb32 has to work with. To talk to the server, we need the full DN, so we have to manufacture a DN using the information we have at hand. We do this by grabbing the DN of a known good server (in this example, Alpha) and replacing the server name. We then have a dilemma – this new DN may point to a real server, or it may be total garbage, for instance, if the OUs of the servers were different, this made up DN doesn’t exist. So instead of using this newly constructed DN to connect, we instead connect to the only server we know is functioning, using the server DN of the mailbox in the profile.

Of course, this is wrong and we get redirected, and if we repeat this dance several times we’ll hit ecServerPaused. We recognized the potential to do better by implementing a cache of recently used DNs. If the DN we constructed was on this list, we’d go ahead and try it. So the first time we connect to Beta, we redirect through Alpha, but subsequent connections will use Beta’s DN since it’s in the cache. This fix first showed up in Outlook 2003 and 2007 via 929307 and 937949 respectively. So I had the customer check that their customer’s Outlook was up to date. Sure enough, one of their clients wasn’t, and updating it fixed the problem. This didn’t fix the others though, so we debugged again.

The next site we looked at had an updated Outlook, but the the two servers were in different OUs. We’d build a (wrong) DN for Beta using Alpha’s DN. This DN wouldn’t be in our cache, so we’d connect to Alpha for a redirect. We’d then cache the (correct) DN for Beta. This same sequence of steps would repeat the next time we log in to Beta. As long as the two servers are in different OUs, we’ll always redirect. This customer was able to move mailboxes around and eliminate the problem.

We then encountered a site with a single OU and up to date Outlook where the problem still happened. A close look at a trace of the problem revealed that in this particular configuration, the customer’s code managed to log all the way out of MAPI in between mailbox logons. So the cache we went through all the trouble to maintain was torn down and rebuilt on every pass, making it totally ineffective. Keeping one mailbox open the entire time was enough to keep the cache alive.

We were able to get their code running in every environment except the ones where they couldn’t move their mailboxes and had to keep them in different OUs. For those, the only workaround they can use right now is to slow down a bit. Of course, things would be easier if there was a documented entry ID format that included the FQDN. Maybe one day there well be…

Update: I just documented the v2 Store Entry ID format.