The case of the 32-bit program that tries to load a 64-bit DLL


One of our escalation engineers wrote up an epic case study of a particular problem that he was able to solve. I'll retell the story here, but redacted, and significantly abbreviated.

A customer reported that when their 32-bit program called IUpdate­Searcher::Begin­Search, the resulting dialog box was empty. As in the dialog box had no controls on it.

Debugging revealed that the dialog box tried to load the common controls, but instead of loading the 32-bit common controls library, it was trying to load the 64-bit common controls library. Since 32-bit processes cannot load 64-bit DLLs, this attempt fails, and consequently, no dialog controls appear.

The escalation engineer reported that the customer originally had five clients with this problem. Over time, the number of clients with the problem decreased to one. But not because they managed to solve the problem. The number of clients with the problem went down to one because the other four clients canceled their contracts and went with a competitor.

The escalation engineer eventually got access to a system that reproduced the problem (a story within a story that I'm redacting for space), and he found that the behavior on the problem system was very different from those on a clean system. In particular, the test scenario took ten times as long to run on the problem system as it did on a clean system: On the problem system, there was a DLL that was furiously allocating and freeing memory. The customer insisted that the DLL was not the problem, however.

I'm redacting another part of the story where the customer claims to have a virtual machine that reproduces the problem, but turns out that they don't.

Eventually, the customer admits that they just learned that the problem occurs only on systems with that mysterious DLL installed.

Okay, well, it took a long time to get there, but at least we have something that is common to all failures. The mysterious DLL is part of an anti-malware system that bills itself as something like "The most advanced anti-malware platform." Okay, mister advanced anti-malware, let's take a look at what you're up to.

Application compatibility guru extraordinaire Gov Maharaj suggested taking a look at Wow64­Disable­Wow64­Fs­Redirection and Wow64­Revert­Wow64­Fs­Redirection and to make sure they were matched up. And they were. So much for that theory.

Another week of debugging passes (mixed in with various failed attempts to get a virtual machine that reproduces the problem), and Gov's suggestion to look at Wow64­Disable­Wow64­Fs­Redirection echoes like a flashback scene in a cheesy movie. File system redirection is disabled and re-enabled thousands of times during the running of the scenario. There were many instances of the code in the DLL, and reverse-engineering revealed that they all looked like this:

// disable redirection while we do something
void* previousState;
if (Wow64DisableWow64Redirection(&previousState)) {
   ... do something ...
   // All done. Return to the previous state.
   Wow64RevertWow64Redirection(previousState);
}

except that there was one case where the code was subtly different:

// Code in italics is wrong.
// disable redirection while we do something
void* previousState;
if (Wow64DisableWow64Redirection(&previousState)) {
   ... do something ...
   // All done. Return to the previous state.
   Wow64RevertWow64Redirection(&previousState);
}

The anti-malware DLL fell into the trap of the unfortunate choice of data type for the file system redirection cookie. Furthermore, even though the code had a bug, the address of the redirection cookie was lucky enough to look enough like a genuine redirection cookie that it successfully restored the redirection state to what it should have been. But one time out of those thousands, its luck ran out, and it didn't restore the state properly.

That one call left file system redirection disabled, and thereafter, all the correct code sequences which temporarily disabled redirection would restore the redirection to its previous state, which was now "disabled" rather than the "enabled" it was supposed to be.

To verify that this needle-in-a-haystack situation was the root cause, the escalation engineer patched the DLL to fix the bug, and the test scenario ran to completion successfully.

The customer went back to the vendor of the anti-malware software to get an updated version of the DLL, and that fixed the problem.

Comments (36)
  1. DWalker07 says:

    "File system redirection is disabled and re-enabled thousands of times during the running of the scenario."

    Is that normal? What does it do for performance?

    1. Joshua says:

      If you read the recommendations and must operate on x64 program files from x86 you too will do this. Disclaimer: I do not do this. I find ways to avoid it.

      1. SimonRev says:

        Raymond strongly implies that someones antivirus is the one changing this setting (which as he would put it is akin to a houseguest deciding to remodel your home without asking).

        At first I assumed those functions were process wide, but since they are only thread wide, I suppose it is safe to do as long as you clean up after yourself (and don't have a bug like the code in the example).

        1. Ken in NH says:

          The effect is not even process wide; it only effects the current thread. From MSDN:

          Note The Wow64DisableWow64FsRedirection function affects all file operations performed by the current thread, which can have unintended consequences if file system redirection is disabled for any length of time. For example, DLL loading depends on file system redirection, so disabling file system redirection will cause DLL loading to fail. Also, many feature implementations use delayed loading and will fail while redirection is disabled. The failure state of the initial delay-load operation is persisted, so any subsequent use of the delay-load function will fail even after file system redirection is re-enabled. To avoid these problems, disable file system redirection immediately before calls to specific file I/O functions (such as CreateFile) that must not be redirected, and re-enable file system redirection immediately afterward using Wow64RevertWow64FsRedirection.

          Given that information, my assumption is that the anti-malware DLL is injected into the customer's process and interposes itself between the process and some API call. So maybe some thread makes a call that takes a detour through this bug in the anti-malware and then has a failed delay-load of the 64-bit DLL which hoses it for the whole process lifetime.

          1. DWalker07 says:

            Wow, yes, that's probably what's happening. Thanks for the additional background.

      2. ender says:

        Filesystem redirection doesn't affect Program Files - only System32/SysWoW64.

        1. Tanveer Badar says:

          Then why is there a Program Files (x86) folder in the first place?

          1. DWalker07 says:

            Your favorite Web search tool will give the answer! Also, you can search Raymond's blog. "File system redirection".

  2. ZLB says:

    Just searched "The most advanced anti-malware platform". Turns out they all are!

    Sorry tale all around really.

    Should have wrapped those FS redirect calls up in a RAII class.

    1. SimonRev says:

      Yeah, frankly that shocked me -- why they would hand code the redirection stuff each time you needed it (and get it right every time) instead of having a class to do that for them (and only get it right once) tells you all you need to know about that particular antivirus vendor.

  3. mikeb says:

    Wow.
    Wow64!

  4. alegr1 says:

    The lesson here is: Whoever decided to come up with that redirection kludge, screw you. And whoever felt they need to enable and disable it like crazy, screw you, too. Windows\Sysnative exists for a reason (although the reason is that you need to get around that redirection kludge).

    Every time Windows does things behind your back, you're screwed.

    1. voo says:

      Soo your solution would've been to just not allow any 32bit applications under 64-bit Windows? That does not seem like a particular viable solution either.

      In practice the redirects are not a problem for the vast, vast majority of programs. If you need to do something special you'll need additional code and complexity - that's quite fair.

      1. Joshua says:

        System64.

        1. Ken in NH says:

          Wow. I'm so excited that we're moving to 64 bit OSes and processes! I'm going to release my program as 64 bit binaries.

          [Frobs the bitness drop down in IDE.]
          [Recompiles]
          [Execute aaaanndd...program goes down in flames.]

          Windows sucks.

          https://technet.microsoft.com/en-us/library/ff955767.aspx

          1. Neil says:

            I don't remember that being a problem during the 16→32 bit transition, although I'm sure this blog has an article that will prove me wrong. (Most of its articles seem to.)

          2. Ray Joyal says:

            So because some programmers can't do proper pointers and datatypes and expect their compiler to magically fix their stuff, every sysadmin is now stuck with:
            System32 on 32 bit = 32 bit
            System32 on 64 bit = 64 bit
            SysWow64 on 64 bit = 32 bit
            blech.
            I think the Program Files team did it better, but they also managed to make it confusing with Program Files (x86) because if you're going with a "user friendly" name like "Program Files" why add the technical x86 name instead of 32/64?

            I guess we could have had System32 being a junction to SystemX64 and System64 being a junction to SystemX86.

            Oh, to have one of Raymond's time machines....

          3. Ken in NH: What didn't work? Everything has the same names as on 32-bit (down to the "system32") exactly so that you can just flip your IDE to "Target=x64".

            Ray Joyal: If they were called "Program Files (32)" and "Program Files (64)", then what would you call the directories on the Alpha? Native binaries were 32-bit Alpha AXP. Emulated binaries were 32-bit x86. Would you call them "Program Files (32)" and "Program Files (32) (no serious)"?

          4. Yuhong Bao says:

            There is already GetSystemDirectory though. Batch files might be more of an issue however.

          5. Michael says:

            Raymond, I think Ken in NH was suggesting that if Joshua's suggestion of System64 for the 64bit binaries, then "program goes down in flames" would happen with a final result of "Windows sucks", but since it remained System32 for the "native" bitness, it worked, for bad programs that hardcoded System32, and for not-so-bad programs that had no choice but to.

          6. Joshua says:

            @Raymond: sizeof(ptrdiff_t) != sizeof(int) anymore

            This exploded for us in production when realloc() finally moved a buffer more than 2GB.

          7. Simon Kissane says:

            @Neil, I think the 1632 bit transition was more seamless in that you could mix 16-bit and 32-bit code in the same process.

            Raymond argues https://blogs.msdn.microsoft.com/oldnewthing/20081020-00/?p=20523 that isn't feasible due to the different address space sizes.

            I'm unconvinced, because other vendors manage this. If you look at z/OS, you can mix 24-bit, 31-bit and 64-bit addressing in the same process. Of course, that is a different CPU architecture – so maybe this is easier to implement on z/Architecture than x86/x64, but I'm unconvinced it would be impossible on x86/x64.

            In fact, strictly speaking, all 32-bit processes on 64-bit Windows are mixed 32-bit/64-bit code because WoW64 is mixed 32-bit/64-bit code. It's just that Microsoft has decided not to support such mixing outside of the internal details of the WoW64 implementation. If WoW64 can do it internally, then why can't other folks do it too?

            Also, a lot of this filesystem redirection stuff could have been avoided if Microsoft adopted fat binaries like how Apple has. Then they could have shipped a single DLL including both 32-bit and 64-bit code, and both 32-bit and 64-bit software can load the same DLL path and get the version of the code they need.

    2. xcomcmdr says:

      > Every time Windows does things behind your back, you’re screwed.

      I don't think so.

      Every timr I enable some compatibilty shims on some crazy-old game so that Windows does stuff behind it's back, the game starts to work on Windows 10.

      So, screw you ! :p

      1. alegr1 says:

        Oh, when *you* enable compatibility shims, it's not Windows doing things behind your back.

        When Windows decides by some heuristics that it needs compatibility shims for some app, no matter what you say, *this* is doing things behind your back. "Oh, the app you are debugging crashed; how about we tweak some knobs for it?"

        1. xcomcmdr says:

          You are really getting angry over nothing.

          I'm very grateful Windows has a very long list of known problematic games / apps and knows how to make them work instantly. It saves me the headache of finding the fix myself.

          Besides, the question Windows asks when an app/setup crashed does nothing if you ignore it. There's nothing done behind your back in this case.

  5. cheong00 says:

    [The number of clients with the problem went down to one because the other four clients canceled their contracts and went with a competitor. ]

    I wonder... if it's antivirus DLL's bug that cause a common control DLL not loading, how come the same behavior not exhibited on competitor's system? (It's not like antivirus is not mandatory these days) Or worse, affects other windows applications as well?

    Is so, that should be pretty visible symptom. I think that company is in real bad luck to have such problem readily reproducible in their application only.

    1. Because the competitors didn't make the mistake of writing "Wow64RevertWow64Redirection(&previousState);"

    2. Alex Cohn says:

      Possibly the competitors had 64-bit app

      1. cheong00 says:

        @Alex Cohn: That could explain it. :)

        Btw, want to reply with Reply button and greeted with "'addComment' is undefined" javascript error. The address on address bar is updated though.

  6. cheong00 says:

    I think the faulty code is on the anti-malware DLL, not the DLL from the customer's application itself.

    1. Jamie says:

      Good sleuthing...
      "The customer went back to the vendor of the anti-malware software to get an updated version of the DLL, and that fixed the problem."

  7. Drak says:

    Any hint of when this happened (year, or year /month)? We had a mysterious period where 2 of our clients had the same type of problems, and after disabling their anti-virus it went away, Later on, re-enabling the anti-virus did not reproduce the problem anymore so we figured the anti-virus was updated.

    1. This happened in Q2 2016.

      1. Joshua says:

        Excellent use of unmasking exactly enough information. If it's a (key) collision he can reasonably check but nobody else can.

      2. Drak says:

        Thanks Raymond. It was possibly the same issue.

  8. Azarien says:

    So... it turns out that Wow64EnableWow64FsRedirection wasn't that bad after all. At least it always does what you really want.

Comments are closed.

Skip to main content