The case of the hung Explorer window


A Windows Insider reported that Explorer stopped responding whenever they opened their Downloads folder.

We were able to obtain a memory dump during the hang, and observed that most threads were waiting for the loader lock. The loader lock was being held by this thread:

ntdll!RtlpWaitOnCriticalSection
ntdll!RtlpEnterCriticalSectionContended
GdiPlus!GdiplusStartupCriticalSection::{ctor}
GdiPlus!GdiplusStartup
ShellExtension+...
ShellExtension+...
ShellExtension+...
ntdll!LdrpCallInitRoutine
ntdll!LdrpInitializeNode
ntdll!LdrpInitializeGraphRecurse
ntdll!LdrpInitializeGraph
ntdll!LdrpPrepareModuleForExecution
ntdll!LdrpLoadDllInternal
ntdll!LdrpLoadDll
ntdll!LdrLoadDll
KERNELBASE!LoadLibraryExW
[...]
combase!CoCreateInstanceEx
combase!CoCreateInstance
windows_storage!_SHCoCreateInstance
windows_storage!CRegFolder::_CreateCachedRegFolder
windows_storage!CRegFolder::_CreateCachedRegFolder
windows_storage!CRegFolder::_BindToItem
windows_storage!CRegFolder::BindToObject
windows_storage!CShellItem::_BindToHandlerLegacy
windows_storage!CShellItem::BindToHandler
[...]
explorerframe!CNscEnumTask::InternalResumeRT
explorerframe!CRunnableTask::Run

This thread was waiting on a GDI+ critical section, which was being held here:

KERNELBASE!WaitForSingleObjectEx
GdiPlus!BackgroundThreadShutdown
GdiPlus!InternalGdiplusShutdown
GdiPlus!GdiplusShutdown
shell32!CGraphicsInit::~CGraphicsInit
shell32!CImageFactory::{dtor}
shell32!CImageFactory::`scalar deleting destructor'
shell32!CImageFactory::Release
shell32!IsImageSizeSufficientForRequestedSize
shell32!_ExtactIconFromImage
shell32!_ExtractIconsFromImage
shell32!ExtractIconsUsingResourceManager
shell32!_ExtractIcons
shell32!SHDefExtractIconW
[...]
windows_storage!CLoadSystemIconTask::InternalResumeRT
windows_storage!CRunnableTask::Run
windows_storage!CShellTask::TT_Run
windows_storage!CShellTaskThread::ThreadProc
windows_storage!CShellTaskThread::s_ThreadProc

It should now be clear what the problem is.

On the second thread, GDI+ is shutting down because its last client decided to uninitialize it. (In this case, the last client was the system image list, which extracting the icon for a Store app, and Store app icons are PNG files, which is why GDI+ entered the picture.)

GDI+ is waiting for its worker thread to exit so it can finish cleaning up.

Just at this moment, the folder tree was populating itself on the first thread, and it found a third party shell extension. It dutifully loaded the third party shell extension (because that's what shell extensions are for), and that shell extension, as part of its DLL_PROCESS_ATTACH tried to initialize GDI+.

Here comes the deadlock.

GDI+ was prepared for this possibility that somebody would try to initialize GDI+ while GDI+ was already in the process of shutting itself down. It solves this problem by making the shutdown run to completion (seeing as it already started), and then starting a new initialization pass.

That shutdown is waiting for a worker thread to finish up and exit. But the thread cannot exit until it sends out its DLL_THREAD_DETACH notifications. And since DLL notifications are serialized, the DLL_THREAD_DETACH cannot be sent until the DLL_PROCESS_ATTACH completes. But the DLL_PROCESS_ATTACH for the third party shell extension is waiting for GDI+. There's our deadlock.

The root cause for this is that the third party shell extension is initializing GDI+ inside its DLL_PROCESS_ATTACH. This is already highly suspect even without any special insight into GDI+, and the suspicious are confirmed in the documentation for GdiplusStartup:

Do not call GdiplusStartup or GdiplusShutdown in DllMain or in any function that is called by DllMain.

My guess is that the vendor who wrote this shell extension thinks that the rule doesn't apply to them because they passed SuppressBackgroundThread = true, thinking that by removing the background thread, they successfully avoided any deadlocks with another thread. It didn't occur to them that the other thread might not be the GDI+ background thread.

It also didn't occur to them that GDI+ might already be initialized with a background thread. Furthermore, suppose the component that initialized GDI+ first (with a background thread) uninitialized GDI+ first. That call to GdiplusShutdown will not shut down GDI+ because there is still an outstanding client. And then when their DLL unloads, they call GdiplusShutdown, and that will cause a true shutdown of GDI+, which includes shutting down that background thread that they thought they had suppressed.¹

So basically it was a bad idea all around.

I transferred this issue to the application compatibility team for outreach to the vendor, who happens to be a major corporation, so hopefully they can spare some developers to fix the deadlock.

Bonus chatter: Identifying the vendor was a bit tricky because of the extremely vague DLL name.

Bonus chatter: When I originally composed the email with my analysis of the bug, I wrote application compatibility outrage instead of application compatibility outreach. Unfortunately, I caught the mistake before hitting Send.

¹Closer investigation shows that my guess was incorrect. The code that calls GdiplusStartup leaves the background thread enabled, so I have no idea how this ever worked in isolation. It "works" only because the calls to GdiplusStartup and GdiplusShutdown are no-op because somebody else initialized GDI+ first, and is still using GDI+ at the time they unload.

Comments (30)
  1. Damien says:

    Well, sure, it’s a bad way to write their code. But what’s the alternative? Understand windows conventions? **Read** the documentation?

    1. guest says:

      Of course not. The best alternative is to use an undocumented hook to bend the OS so that your bad code actually works.

      1. Antonio Rodríguez says:

        And that is, of course, what Microsoft does all the time, which explains why Windows is so buggy and hangs or bluescreens every five minutes. On the other hand, Microsoft only allows the outsider developers access to a meager subset of the APIs, so their work is more difficult and their applications can’t compete with Microsoft’s ones; which justifies, for example, why a certain open source office suite is slower than Microsoft Office. Or, at least, that is what the conspiranoic theory says :-) .

        1. Mike Caron says:

          If only M$ allows us to use their secret RenderOfficeDocument platform API, then everyone else could be fast too!

        2. Joshua says:

          Old troll is old; this was once true but cleaned up around year 2000.

          1. Antonio Rodríguez says:

            Not so old. Recently, a Linux user approached me after a computer history talk, and explained that Windows was so vulnerable to malware because “it didn’t have file level security control”. More than a decade after the Win9X family was phased out. Those are the same people that say that “Android isn’t Linux”, because, you know, Linux can’t have viruses, right? In both cases, they forget that malware developers will always target the dominant platform, and that no matter how secure an OS is, you can always sneak through the app store’s validation and trick the user into downloading/installing your trojan.

          2. smf says:

            I’m not sure what relevance your Linux user anecdote has.

            What version of windows are you running that bluescreens or hangs every five minutes? Even trying to develop applications on Windows 95 wasn’t that bad.

        3. roeland says:

          Ha ha ha.

          I once tried to use “a certain open source office suite” on Linux to make a scatter plot of a large amount of points (IIRC few 100,000 of them). It totally locked up the entire desktop for a few *minutes* trying to render that graph. Then I started a windows VM and opened Excel, which rendered the same graph in half a second.

          I should have tried “a certain open source office suite” on the VM. Maybe it would have needed a few minutes to render that graph as well, but at least I don’t think it would have locked up the entire desktop.

          1. tremors08 says:

            The open source suite would have run out of GDI objects (this is a poor attempt at a joke based on my recent VB6 experience)

  2. Christian says:

    What does this background thread do? How does GDI+ interact with event loops?

  3. SimonRev says:

    One thing that has always bothered me about DLLs is some sort of convenient place where you can put all of your complex initialization code. Something like: Ok, you are attached to the process, the loader lock is released go wild with your initialization. (I suspect someone can tell my why that is a bad idea, but it seems like it would solve most of these DllMain issues that come up over and over).

    It seems like the two main approaches are explicit secondary initialization (like GDI+) or some sort of lazy initialization that you insert at the start of all of the functions you export from the DLL.

    1. Peter says:

      In my experience, complex DLL initialization is hardly ever necessary. When designing your entry points, if you require two entry points to share complex, internal state, then force the user to pass it in for you. For example:

      state = CreateState();
      Func1(state, a, b, c);
      Func2(state, d, e, f);
      DestroyState(state);

      (And be sure to mimic my stellar naming conventions in all your production code.)

    2. Joshua says:

      1) Start a thread in DllMain DLL_PROCESS_ATTACH (the thread is immediately blocked by the loader lock…)
      2) After you exit from DllMain the loader will release the loader lock; your thread now runs.
      3) Have your normal API functions wait for the thread to finish. The “obvious” way is a manual reset event; however there are faster ways like flag-event.

      1. Peter says:

        What happens when your DLL is unloaded while the thread is still running?

        1. dpff says:

          That’s when API FreeLibraryAndExitThread should be used: https://blogs.msdn.microsoft.com/oldnewthing/20131105-00/?p=2733

      2. SimonRev says:

        Of course, MSDN has a handy list of things not to do in dllMain, one of which is “Call CreateThread”

        1. voo says:

          That seems like one of those times where you can violate the guidelines if you know what you’re doing I guess.

          Yes if you try to wait on the thread in your DLL_PROCESS_ATTACH, you just got yourself a deadlock, but if you don’t, this seems perfectly fine (with all the usual caveats for using threads of course).

          Personally I prefer the same solution as Peter does: Not only does it avoid having to do any complicated things in DLL_PROCESS_ATTACH, it also simplifies handling of concurrency in many cases.

    3. JoeWoodbury says:

      This sounds simple with one DLL, but what if you have several, all trying to do complex initialization with each other?

  4. fm says:

    It’s a tragedy that Windows doesn’t have a good way to name & shame when poorly coded third party libraries screw things up. Once nice thing is that recent versions of Windows handle display driver crashes very well & show the name of the driver, so when your display blinks you get a message saying “Nvidia display driver stopped working”.

  5. Koro says:

    I’m surprised Store apps load their icons using GDI+. It was my understanding that it was being phased out in favor of WIC.

    1. Ray Koopa says:

      So far as I get it, the ListView is still a control drawn with GDI

  6. Adrian says:

    Running other people’s code in your process is always dangerous. At some point, I expect Explorer is going to have to come up with a process-isolation scheme for shell extensions, or extensions will have to be certified and signed the way drivers are. The web browsers have all made significant moves in either eliminating third-party plugins and/or relegating that code to separate processes. It’s a lot of work to do process isolation and reliable interprocess communication, but it goes a long way to making software more robust.

    1. The_D0lph1n says:

      I think Explorer has a limited form of extension process isolation using the COM Surrogate.
      There’s a post on this blog about the COM Surrogate from 2009: https://blogs.msdn.microsoft.com/oldnewthing/20090212-00/?p=19173/

  7. Josh B says:

    I’ve done nearly the same thing for the same reason, only it was happening occasionally when opening the open/save dialog boxes. I ran into a wall when I reached Nvidia’s shell extension. Since I can’t very well uninstall Nvidia’s drivers, and things stop working if you try to just unregister the extension, I just lived with the occasional app hang.

    Fortunately, all of my hanging Explorer problems just went away after moving to Win10. I don’t know if Win10’s Explorer is more robust, or Nvidia just put a little more into QA this time around, but it’s nice having a consistently functional system again.

  8. Andomar says:

    This brings back memories of being a Windows C++ programmer in the 1990s. I suppose no vendor can get shell extensions right. The complexity is beyond ludicrous.

    If you invite vendors to do multi-threaded programming. you should accept responsibility for what must inevitably follow.

    1. No matter what you do, someone will call you an idiot. “You’re an idiot for not making Explorer extensible.” “You’re an idiot for making Explorer extensible.” Or are you saying, “Explorer should be extensible, but in order to reduce the complexity for extension authors, Explorer should be a single-threaded program.”?

  9. Sven Eberhardt says:

    What puzzles with all these loader lock problems: Why isn’t there a special flag to LoadLibrary which alters the loading behaviour to: “Do all your system initialization within the loader lock and then release the lock BEFORE calling DllMain”.

    The lock may be needed to ensure no other threads in the loading process call into the library before it’s fully initialized. But applications like Explorer, which potentially load a lot of 3rd party libraries, could use their own synchronization methods to make sure no calls into library happen before the DllMain function actually returned.

    Alternatively, there could be a second DllMain function that is called outside the loader lock but before LoadLibrary returns in the parent process. Dlls could put any initialization that requires locks there. Of course they’d need to check manually if the function finished before accepting any outside calls from other threads.

  10. alegr1 says:

    OK guys, a helpful advice for you.

    If you ever want to have your plug-in (or COM provider) DLL run a thread, do it as:

    1. Have your DLL link or load explicitly a second (private) DLL. The second DLL should actually contain the thread code. The second DllMain should start the thread (don’t use any API which waits for the thread to start, though). The second DLL gives the thread handle to the first DLL. First of second DLL should take an extra reference to the second DLL.

    2. When your plugin (first) DLL needs to unload, it should give the thread a signal to stop (by setting an event and/or a boolean flag), and to either explicit FreeLibrary or implicit unlink because of unload.

    3. When the thread ultimately needs to exit, it should call FreeLibraryAndExitThread on the DLL2.

  11. Eric Bouchard says:

    Thanks, this really helped me find out what was causing the Download folder to take too much time to populate (about 20 seconds). Turns out it was a third party icon overlay extension which I disabled in the registry.

    Funny thing is when I opened the download folder using the real path (c:\users\username\downloads) instead of the shortcut path (This PC > Downloads) the folder would be populated instantly.
    Only when I used the shortcut Downloads under This PC that was taking too much time to populate the list. I wonder why there was a difference ?

  12. James Curran says:

    I wonder if this explains why, when I try to display the Downloads folder, Explorer suddenly gets extremely slow. It doesn’t hang — I can click on a different folder and it works fine — but for Downloads, the file list displays slowly and the green progress bar displays across the top. Clicking a column header to sort does nothing. (The folder only has about 300 files)

Comments are closed.

Skip to main content