The case of the longjmp from nowhere trying to open a registry key


The crash telemetry team brought our attention to a bug a few weeks before the Creators Update was supposed to be released, and based on the high hit count of 3 million crashes in the past 30 days, the bug was marked "Blocking engineering sign-off".

Life is always exciting when you get a "Blocking engineering sign-off" bug.

Here's the relevant excerpt from the crashing stack. Let's see what we can make of it:

ntdll!RtlFailFast2
ntdll!RtlGuardCheckLongJumpTarget+0x72f9f
ntdll!RtlGuardRestoreContext+0x360
ntdll!RtlUnwindEx+0x767
0x00007ff7`26040a5a
0x00007ff7`25fcb0c6
0x00007ff7`25fc9bf8
0x00007ff7`25fc9d23
0x00007ff7`25fc9f27
0x00007ff7`25fe1d17
0x00007ff7`25fdfd7f
0x00007ff7`25fca5f1
0x00007ff7`25fca645
0x00007ff7`25fcae25
0x00007ff7`25fca870
0x00007ff7`25fc27be
0x00007ff7`25fc6aaf
0x00007ff7`25fcab3d
0x00007ff7`25fe11ef
0x00007ff7`25fca5f1
0x00007ff7`25fca645
0x00007ff7`25fcae25
0x00007ff7`25fca870
0x00007ff7`25fc27be
0x00007ff7`25fe949c
0x00007ff7`25fe93e6
0x00007ff7`25fe6709
0x00007ff7`25fe6200
ntdll!KiUserApcDispatch+0x2e
ntdll!ZwOpenKeyEx+0x14
KERNELBASE!Wow64pNtOpenKeyInternal+0x16
KERNELBASE!Wow64NtOpenKey+0x7c
KERNELBASE!LocalBaseRegOpenKey+0x1bc
KERNELBASE!RegOpenKeyExInternalW+0x13b
KERNELBASE!RegOpenKeyExW+0x19
windows_storage!SHGetMachineGUID+0x83
...

Working upward, the storage system is trying to read the machine GUID, so it's opening the HKEY_LOCAL_MACHINE\Software\Microsoft\Cryptography registry key. That Reg­Open­Key­ExW call eventually reached the kernel at Zw­Open­Key­Ex, and then somehow a user-mode asynchronous procedure call (APC) got dispatched back into the user-mode thread. That user APC executed a lot of code that got injected into the process, not associated with a DLL. It reached a point where it encountered an exception, and the operating system is trying to unwind the exception but something goes wrong with the unwind when it wants to check a long target: The jump target is not valid, so the process crashes with a fail-fast exception: Incorrect exception unwind information is not recoverable, and it may indicate the presence of malware.

There are multiple levels of mystery here. The first level of mystery is this chunk of code not associated with a DLL. How did it get into our process? This particular process was Runtime­Broker.exe, which isn't as promiscuous as explorer.exe with respect to shell extensions and other third-party extension points, nor is it a common target for code injection.

Second, why is opening a registry key dispatching user-mode APCs? This is not called out in the documentation, and it is not something expected in general. Dispatching user-model APCs is not something you do just for fun. It creates a situation where code is running inside the context of unrelated code. If a critical section was held at the time Reg­Open­Key­ExW was called, that critical sction is still being held when the APC is run, and you are now in danger of creating a deadlock. This is why functions which processes APCs usually make you opt into APCs explicitly: SleepEx, Wait­For­Single­Object­Ex, and Msg­Wait­For­Multiple­Objects­Ex don't process APCs unless you say you want them to, and the non-Ex versions never process APCs.

The third mystery is why the injected code is performing a longjmp. The exception being propagated is 0x80000026 which is STATUS_LONGJUMP: "A long jump has been executed."

The mystery code is trying to perform a longjmp. Thats right, a longjmp. Apparently we are still running code written in 1970.

We contacted the registry team for assistance, and they recognized this issue. They suspect that some third-party registry filter driver (perhaps a game's anti-cheat software) is monitoring attempts to access any registry keys under HKEY_LOCAL_MACHINE\Software\Microsoft\Cryptography and scheduling user-mode APCs as part of its processing. That APC then runs into a situation where it decides to try to longjmp out of its normal processing, but the longjmp buffer is either corrupted, or the long jump target has not been registered as a control flow guard jump target, so the exception dispatching code says, "No way, I'm not dispatching this exception."

The registry team noted that the vast majority of the crashes are coming from machines running the Anniversary Update, not the Creators Update, so this likely not a case of the Creators Update making a change that exacerbated a pre-existing problem, and the "Blocking engineering sign-off" marking should be removed. Furthermore, even though there were three million hits over the past 30 days, the crashes were not uniformly-distributed. Rather, the issue spiked and then died down within a week. This suggests that the third party recognized the issue and put out their own fix.

Comments (29)
  1. Yuri Khan says:

    Happy end! For once, you did not have to add a compatibility shim for a broken third-party piece of software.

  2. kantos says:

    Just curious, are there any viable uses for User Mode APCs other than IO operations? The only thing I can think of is to ensure an operation completes last, or is executed at low priority on a runloop thread as soon as it sleeps.

    1. SI says:

      I’ve used them to queue / trigger settings changes and events in a CPU intensive background thread without locking the UI thread.

      1. Clockwork-Muse says:

        …which would still be IO (one with a graphical output and random input from one or more hardware devices).

    2. Andre says:

      Better question: are there any viable uses for jongjmps? That is just scary stuff.

      Unless you’re trying to roll your own exception handling (or is that implemented the other way round??), that sounds like the worst spaghetti code possible. I thought we left that behind last century.

      1. Zan Lynx' says:

        How else are you going to get out of a Unix signal handler for something like SIGSEGV aka segmentation fault? If you just return from the handler you are right back at the fault instruction.

        Yes, yes, I know that all of you are Windows programmers and believe the entire world uses SEH. But it doesn’t.

        There may be a case for saying that, on Windows, longjmp should be replaced with SEH, but that isn’t the same as claiming it is no use anywhere.

        1. Darran Rowe says:

          The Linux equivalent of TerminateProcess?
          Segfaults are an obvious sign of things not going well in the program. This is why the recommendation for access violations in Windows is to just let the process die.

          1. Joshua says:

            It turns out that most critical software will keep running if you longjump back to the top level loop and just leak all the memory. init does this as init cannot be allowed to die. Interactive shells also do this (non-interactive shells just die).

          2. Kevin says:

            Under *nix, the default signal handler for SIGSEGV terminates the process. If you want SIGSEGV to terminate the process, then you would not install an alternate handler in the first place. But then most reasonable applications have no business catching SIGSEGV anyway.

            My personal favorite use case for longjmp(3) is that, if you call abort(3), catch the SIGABRT, and then longjmp(3) out of the signal handler, the process is not terminated. This is, in fact, the only way to abort an abort(3). Also falls into the “WTF are you doing?” bucket, but in an entirely different way.

        2. Andre says:

          Why/how would I ever handle a SIGSEGV instead of crashing?
          Under Linux I can’t even catch an std::bad_alloc, because it rather overcommits and then OOM kills instead of honoring its API contracts.
          (Yes, I know that can be turned off. But I can’t tell my customers to tell their IT how to set up the OS.)

        3. Dave says:

          >How else are you going to get out of a Unix signal handler for something like SIGSEGV aka segmentation fault?

          If you’re ending up in a Unix SIGSEGV signal handler from a Windows machine then I think you may have longjmp’d a bit too far…

          1. Markus Schaber says:

            That one really made my day! :-)

        4. Yuri Khan says:

          > How else are you going to get out of a Unix signal handler for something like SIGSEGV aka segmentation fault?

          You don’t. You dump core, die and respawn (with the help of your supervisor daemon of choice).

      2. Joshua says:

        The original documentation gave one use; returning from the final case of a deeply nested recursive call. That use is still valid; but you might want to turn that code into a tail call and from there the while loop.

        1. voo says:

          I wouldn’t call that use case “valid” by any means. The idea here is to optimise something that is already incredibly cheap (function epilogues). But that ignores two things: a) available compilers already do this optimisation if possible (gcc doesn’t generate a call but a simple jump when the last action in a function is calling another function) and b) that optimisation is actually a pretty bad pessimisation for modern x86 CPUs since it corrupts the return address predictor stack (https://blogs.msdn.microsoft.com/oldnewthing/20041216-00/?p=36973 ).

          Just goes to show that such low-level optimisations are better left to the compiler.

      3. Medinoc says:

        I was once told gcc implements its C++ exception handling in terms of longjmp (I’ve experimented with this kind of thing too, in a “for fun” program). Apparently, for the Microsoft C Run-Time Library it’s “the other way around” (only with SEH instead of C++ exceptions). I wonder if MinGW uses the MS CRT’s longjmp.
        (as far as I know, Visual C++ implements C++ exception handling directly over SEH, which causes some weird interactions between the two, modifiable by the /EHxx compiler switch)

        1. Cesar says:

          The gcc port for Windows most people use (mingw) has three different exception handling options: sjlj (setjmp/longjmp), dwarf (used by gcc on most other operating systems), and seh (windows-only). See for instance https://wiki.qt.io/MinGW-64-bit for a discussion.

    3. Alex Guteniev says:

      I’m aware of an use of User Mode APCs other than IO operations.
      It’s NotifyServiceStatusChange callbacks.

      (They are even an example of User Mode APCs coming from another process. It is Service Control Manager knows that a service status has changed, but not your process)

      1. Harry Johnston says:

        That’s a form of IPC, so is arguably I/O from the perspective of the individual processes involved. :-)

    4. Mike Biddlecombe says:

      I first came across QueueUserAPC while tracking down a hang in some code that was being executed in Unity3D on Windows (C# running in Unity’s implementation of the Mono Runtime)

      https://github.com/Unity-Technologies/mono/blob/unity-staging/mono/mini/debugger-agent.c#L2168

      Library routines written in C++ and launched through interop by our ‘game code logic’ were calling Sleep, not SleepEx in their idle loop.

      When the code was run with the Mono debugger attached, the debugger’s UserAPC handler routine that is intended to “pause the thread by asking it to run a routine that will not return until we release it” was never getting processed.

      (Triggering a UserAPC call that doesn’t return right away but instead calls into the debugger code is how the Mono soft-debugger works …sort of? http://www.mono-project.com/docs/advanced/runtime/docs/soft-debugger)

      If the debugger was paused, either manually or as the result of hitting a breakpoint, and the child thread was idling in a loop that used a ‘non-Ex’ variation of a sleep/wait, the offending thread will never pause. The debugger, and the Unity Editor host, will live-lock waiting for all of its ‘managed Mono threads’ to respond to the pause signal.

      Workers were sad. Managers were angry. Unity was blamed.

      Switching to ‘Ex’ calls allowed us to debug what was broken and ultimately remove the offending libraries altogether.

      Threads launched by our ‘game code logic’ were calling Sleep, not SleepEx.

      When the code was run with the mono debugger attached, the debugger’s “notify signal the thread that we want it to run a UserAPC routine that will not return until we release it” was never processed. Triggering a UserAPC call that doesn’t return right away is how the mono soft-debugger works.

      If the debugger was paused, either manually or as the results of a breakpoint, and the child thread was idling in a loop that used a ‘non-Ex’ variation of a sleep/wait this thread would never paused, and the debugger will live-lock waiting for all ‘mono’ threads to respond.

      1. Mike Biddlecombe says:

        Sorry about the cut-and-paste-paste error ^

  3. Zarat says:

    The Lua scripting language uses longjmp to bail out of scripts in case of errors. If the scripts are JIT compiled it also explains not being associated with a DLL. A games anti-cheat software may have not been a bad guess considering how common lua scripting is in game engines.

    1. Zarat says:

      (and yes this is just random speculation, it just fits the symptoms, but that doesn’t have to mean anything)

      1. Pietro Gagliardi (andlabs) says:

        But wouldn’t that mean the anti-cheat system was written in Lua too? I find this highly unlikely (it could potentially make the anti-cheat somewhat easier to break, and IIRC most anti-cheat systems are licensed from outside vendors to begin with)…

      2. Ian Yates says:

        Decent speculation. I guess the registry access hook probably discounts it but otherwise this would be a clever deduction in the context of games

  4. Last I checked (which admittedly was several years ago), libpng still used longjmp for error handling and required callers to set up and pass in a valid setjmp buffer to certain API calls.

  5. Henrik says:

    “… Dispatching user-model APCs is not something you do just for fun….”
    Yeah, the might run into a freak gasoline fight accident on the way to their destination. :-)

  6. cheong00 says:

    [perhaps a game’s anti-cheat software]

    Or perhaps a cryptowall malware detector. I think my company is installing something like that to all our client machines. It (is supposed to) intercepts cryptographic functions on Windows then check if it’s any of the crytowall variant that it knows, and block it if it matches.

Comments are closed.

Skip to main content