The case of the file that won’t copy because of an Invalid Handle error message


A customer reported that they had a file that was "haunted" on their machine: Explorer was unable to copy the file. If you did a copy/paste, the copy dialog displayed an error.

1 Interrupted Action

Invalid file handle

⚿  Contract Proposal
Size: 110 KB
Date modified: 10/31/2013 7:00 AM

Okay, time to roll up your sleeves and get to work. This investigation took several hours, but you'll be able to read it in ten minutes because I'm deleting all the dead ends and red herrings, and because I'm skipping over a lot of horrible grunt work, like tracing a variable in memory backward in time to see where it came from.¹

The Invalid file handle error was most likely coming from the error code ERROR_INVALID_HANDLE. Some tracing of handle operations showed that a call to Get­File­Information­By­Handle was being passed INVALID_FILE_HANDLE as the file handle, and as you might expect, that results in the invalid handle error code.

Okay, but why was Explorer's file copying code getting confused and trying to get information from an invalid handle?

Code inspection showed that the handle in question is normally set to a valid handle during the file copying operation. So the new question is, "Why wasn't this variable set to a valid handle?"

Debugging why something didn't happen is harder than debugging why it did happen, because you can't set a breakpoint of the form "Break when X doesn't happen." Instead you have to set a breakpoint in the code that you're pretty sure is being executed, then trace forward to see where execution strays from the intended path.

The heavy lifting of the file copy is done by the Copy­File2 function. Explorer uses the Copy­File2­Progress­Routine callback to get information about the copy operation. In particular, it gets a handle to the destination file by making a duplicate of the hDestination­File in the COPY­FILE2_MESSAGE structure. The question is now, "Why wasn't Explorer told about the destination file that was the destination of the file copy?"

Tracing through the file copy operation showed that the file copy operation actually failed because the destination file already exists. The failure would normally be reported as ERROR_FILE_EXISTS, and the offending Get­File­Information­By­Handle would never have taken place. Somehow the file copy was being treated as having succeeded even though it failed. That's why we're using an invalid handle.

The Copy­File2 function goes roughly like this:

HRESULT CopyFile2()
{
 BOOL fSuccess = FALSE;
 HANDLE hSource = OpenTheSourceFile(); // calls SetLastError() on failure
 if (hSource != INVALID_HANDLE_VALUE)
 {
  HANDLE hDest = CreateTheDestinationFile(); // calls SetLastError() on failure
  if (m_hDest != INVALID_HANDLE_VALUE)
  {
   if (CopyTheStuff(hSource, hDest)) // calls SetLastError() on failure
   {
    fSuccess = TRUE;
   }
   CloseHandle(hDest);
  }
  CloseHandle(hSource);
 }
 return fSuccess ? S_OK : HRESULT_FROM_WIN32(GetLastError());
}

Note: This is not the actual code, so don't go whining about the coding style or the inefficiencies. But it gets the point across for the purpose of this story.

The Create­The­Destination­File function failed because the file already existed, and it called Set­Last­Error to set the error code to ERROR_FILE_EXISTS, expecting the error code to be picked up when it returned to the Copy­File2 function.

On the way out, the Copy­File2 function makes two calls to Close­Handle. Close­Handle on a valid handle is not supposed to modify the thread error state, but somehow stepping over the Close­Handle call showed that the error code set by Create­The­Destination­File was being reset back to ERROR_SUCCESS. (Mind you, this was a poor design on the part of the Copy­File2 function to leave the error code lying around for an extended period, since the error code is highly volatile, and you would be best served to get it while it's still there.)

Closer inspection showed that the Close­Handle function had been hooked by some random DLL that had been injected into Explorer.

The hook function was somewhat complicated (more time spent trying to reverse-engineer the hook function), but in simplified form, it went something like this:

BOOL Hook_CloseHandle(HANDLE h)
{
 HookState *state = (HookState*)TlsGetValue(g_tlsHookState);
 if (!state || !state->someCrazyFlag) {
  return Original_CloseHandle(h);
 }

 ... crazy code that runs if the flag is set ...
}

Whatever that crazy flag was for, it wasn't set on the current thread, so the intent of the hook was to have no effect in that case.

But it did have an effect.

The Tls­Get­Value function modifies the thread error state, even on success. Specifically, if it successfully retrieves the thread local storage, it sets the thread error state to ERROR_SUCCESS.

Okay, now you can put the pieces together.

  • The file copy failed because the destination already exists.
  • The Create­The­Destination­File function called Set­Last­Error(ERROR_FILE_EXISTS).

  • The file copy function did some cleaning up before retrieving the error code.

  • The cleanup functions are not expected to alter the thread error state.
  • But the cleanup function had been patched by a rogue DLL, and the hook function did alter the thread error state.

  • This alteration caused the file copy function to think that the file was successfully copied even though it wasn't.

  • In particular, the caller of the file copy function expects to have received a handle to the copy during one of the copy callbacks, but the callback never occurred because the file was never copied.

  • The variable that holds the handle therefore remains uninitialized.
  • This generates an invalid handle error when the code tries to use that handle.

  • This error is shown to the user.

An injected DLL that patched a system call resulted in Explorer looking like an idiot. (As Alex and Gaurav well know, Explorer is perfectly capable of looking like an idiot without any help.)

We were quite fortunate that the error manifested itself as a failure to copy the file. Imagine if Explorer didn't use Get­File­Information­By­Handle to get information about the file that was copied. The Copy­File2 function returns S_OK even though it actually failed and no file was copied. Explorer would have happily reported, "Congratulations, your file was copied successfully!"

Stop and think about that for a second.

A rogue DLL injected into Explorer patches a system call incorrectly and ends up causing all calls to Copy­File2 to report success even if they failed. The user then deletes the original, thinking that the file was safely at the destination, then later discovers that, oops, looks like the file was not copied after all. Sorry, it looks like that rogue DLL (which I'm sure had the best of intentions) had a subtle bug that caused you to lose all your data.

This is why, as a general rule, Windows considers DLL injection and API hooking to be unsupported. If you hook an API, you not only have to emulate all the documented behavior, you also have to emulate all the undocumented behavior that applications unwittingly rely on.

(Yes, we contacted the vendor of the rogue DLL. Ideally, they would get rid of their crazy DLL injection and API hooking because, y'know, unsupported. But my guess is that they are going to stick with it. At least we can try to get them to fix their bug.)

¹ To do this, you identify the variable and set a breakpoint when that variable is allocated. (This can be tricky if the variable belongs to a class with hundreds of instances; you have to set the breakpoint on the correct instance!) When that breakpoint is hit, you set a write breakpoint on the variable, then resume execution. Then you hope that the breakpoint gets hit. When it does, you can see who set the value. "Oh, the value was copied from that other variable." Now you repeat the exercise with that other variable, and so on. This is very time-consuming but largely uninteresting so I've skipped over it.

Comments (40)
  1. 12BitSlab says:

    Raymond, when Microsoft does contact a third party about bugs in their code, generally speaking, how receptive are to fixing things?

  2. Joshua says:

    To be fail, if I were the vendor I'd fix that bug very quickly by calling GetLastError and SetLastError and be looking at a patch release the next day.

  3. Joshua says:

    To be fair not to be fail. Stupid autocorrect

  4. jmac_the_man says:

    Raymond/Raymond's Microsoft Friends,

    A while ago, you posted a story that briefly mentioned Use Case documentation that Microsoft uses in system design has a standardized set of names of the actors, and the actor's first name implies their skill level and role. (So, for example, if you're reading a use case and you see the name "Bob," you know he's a mid-level user, whereas "Charlie" is the supervisor of the user with their hands on the keyboard of the program under study or whatever.)

    Is there any chance there's documentation available on this?

  5. SimonRev says:

    Ignoring the rogue DLL (which is poor form, I agree), to me the real bug still seems to be in CopyFile2, which is relying on CloseHandle not to mess with the last error on success, which does not seem to be contractually obligated. So hopefully that at least got logged as a bug to the appropriate team.

    [I agree that CopyFile2 could be more defensive, but on the other hand, the people who maintain CopyFile2 are also the people who maintain CloseHandle, so they would just be defending against themselves. If they change CloseHandle to mess with the last error code, then they can also change CopyFile2. -Raymond]
  6. Count Zero says:

    @12BitSlab – While you was asking Raymond, let me "say" my answer to that as a non-Microsoft developer. It strongly depends. There are project companies, which are asked and payed to produce some software or software component (usually by another development company which sells the finished product), and get no financial benefit from fixing errors in already finished (and paid) products. I have worked for some of those. Those companies are always overrun with work and deadlines therefore they would not even have resources to fix buggy code if they were intended to (and usually upper management definetly forbids such fixes unless they are paid for).

    The other type of software development firm typically develops only a few products and directly profits after the sales of those, so they have a financial interest in fixing bugs as soon as possible. I'm working for one of those nowadays. Theres is something more to the non-project companies: their connection to their software (source code). They know, like, and live in symbiosis with that. They are always up-to-date about one-or-another aspect of the programs, constantly testing and developing it so if it is buggy they happily fix it.

  7. Evan says:

    "When it does, you can see who set the value. "Oh, the value was copied from that other variable." Now you repeat the exercise with that other variable, and so on."

    We want time-travelling debuggers!

    When do we want them?

    Yesterday!

  8. JM says:

    Hooking CloseHandle, what a capital suggestion. Nothing could possibly go wrong with that. The only right hook here is the one the programmer's face deserves.

  9. Joshua says:

    @JM: I've seen it before when somebody hooked CreateFile to create a virtual file system from userspace. It's nearly a bad idea as it sounds.

  10. Ken Hagan says:

    @Evan: Given a time-travelling debugger, I'd rip out the debugger and replace it with a bomb before dispatching the package to whoever first dreamt up the "errno" idiom for reporting errors.

  11. Ken Hagan says:

    "At least we can try to get them to fix their bug."

    Perhaps they could fix it by hooking SetLastError() ?

  12. Joshua says:

    @Ken Hagan: I wrote a time traveling debugger when debugging a crash so bad it would bring the process down immediately because it was in so bad a state even the debugger couldn't function in the process anymore.

    Unfortunately the technique involved is read-only. Writing to the past didn't propagate.

  13. Ben Voigt says:

    I'm also of the opinion that the bug is in the CopyFile2 implementation.  They took a dependency on an undocumented (if it is documented that "Close­Handle on a valid handle is not supposed to modify the thread error state", I'd love to see it… but there is no such statement in CloseHandle's own MSDN page) implementation detail.  And GetLastError explicitly says:

    > Most functions that set the thread's last-error code set it when they fail. However, some functions also set the last-error code when they succeed. If the function is not documented to set the last-error code, the value returned by this function is simply the most recent last-error code to have been set; some functions set the last-error code to 0 on success and others do not.

    > Functions executed by the calling thread set this value by calling the SetLastError function. You should call the GetLastError function immediately when a function's return value indicates that such a call will return useful data. That is because some functions call SetLastError with a zero when they succeed, wiping out the error code set by the most recently failed function.

    When a hook conformed to the documented interface, the function relying on the implementation details started failing.  That shouldn't surprise any readers of this blog — Raymond is always writing about the superhuman efforts needed to bring implementation details into the contract to avoid breaking application software.  This is just the opposite perspective; if there are multiple implementers, making the contract more tolerant of users breaks existing implementers.

    I don't see that being the implementer of the shipped version of CloseHandle makes it ok for the CopyFile2 developer to ignore the actual rules documented for GetLastError and rely on details of one particular CloseHandle implementation.  Yes, I know both functions live in the same DLL, but it's not like Detours is some third-party approach.  Things would be different if they called a non-public function (probable name: ZwForInternalUseOnlyThisIsTheRealCloseHandle), of course, but they didn't.

    [And then people who detour Close­Handle will say, "Hey, when a file is copied, we never see the handle get closed! Now we have a huge memory leak." And then maybe they will find Zw­For­Internal­Use­Only­This­Is­The­Real­Close­Handle, and detour that, and you're back where you started. -Raymond]
  14. Mark Y says:

    Wait, so the end result is that the file wasn't haunted after all?  Every attempt to copy a file that would involve an overwrite would have had the same effect?  (On an unrelated note: I suppose that as a side effect of this adventure, the function has been changed to grab the error code sooner?)

  15. Bob says:

    How about a TlsGetValueEx function that doesn't do that weird SetLastError behavior?

  16. Thomas says:

    @Joshua – there are defendable reasons for doing so – for *individual apps* – but not for the whole system ;)

  17. Gabe says:

    Bob: Anybody aware enough to realize the value of calling TlsGetValueEx could just as easily save and reset the last error.

    BTW, I've been bitten by the TlsGetValue bug, too. Fortunately my bug was reasonably easy to spot once I remembered the behavior. It's easy to forget that TlsGetValue can fail, so it's easy to forget that it sets the last error.

  18. Thomas says:

    People who say that CopyFile2 should get the error code sooner are correct, but missing the point.  Consider if a rogue dll hooked CopyFile2 instead with code like this:

    HRESULT Hook_CopyFile2()

    {

    HRESULT hr = Original_CopyFile2();

    HookState *state = (HookState*)TlsGetValue(g_tlsHookState);

    if (state && state->someCrazyFlag) {

    .

    … crazy code that runs if the flag is set …

    }

    return hr;

    }

  19. Jonathan says:

    @Evan – Here you go: http://www.thewindowsclub.com/microsoft-time-travel-tracing-diagnostic

    I've used it before – not an easy tool to operate, but it does work.

  20. Tom says:

    It seems like the real problem here is relying on a global variable to communicate information across function calls.  Earth shattering news, Global Variables Are Bad.  If something is a return value from a function call, return it.

  21. Ben Voigt says:

    @Thomas: That hook is clearly broken, because it doesn't conform to the documented rules which I quoted directly from MSDN.

    The thing about the case Raymond presented is that the hook DOES conform to the documented contract.

  22. Crescens2k says:

    @Ben Voigt:

    The thing is, you are quoting rules about hooks and saying that CopyFile2 is broken because of it. But the first hook itself clearly doesn't follow all the quoted rules. Specifically

    "Most functions that set the thread's last-error code set it when they fail. However, some functions also set the last-error code when they succeed. If the function is not documented to set the last-error code, the value returned by this function is simply the most recent last-error code to have been set; some functions set the last-error code to 0 on success and others do not."

    Because it is not documented what happens on success, and it is clearly documented that the behaviour on success isn't pre-determined, the hook function is breaking the contract by not saving the error code. From the re'd code that Raymond posted, they don't even attempt to save the error code at all. This is obviously going to change the behaviour of the function it hooked.

    I also agree with Raymond on the "they will be defending against themselves" part. They control both sides here, and they right the implementation. So in this case, they are not relying on undocumented or implementation defined behaviour, they are relying on behaviour that they write themselves. I'm sure any large software implementation has this somewhere, but it is easy to bash Microsoft here because there is a concrete case of something going wrong, and it's Microsoft.

  23. jas88 says:

    At least this only gave an inaccurate error message – I was immediately reminded of a few hours of hair loss years ago when my userspace program was triggering a BugCheck – when I closed a file handle. It turned out the AV product I had installed at the time (not an MS one) was, for obvious reasons, hooking file access – and, somehow, screwed up the ZwClose/NtClose hook. It looked as if they were *usually* allocating a buffer when a file was opened, then *always* freeing it when that file was closed – I'd hit the corner case where it didn't allocate the buffer, but still tried to free it later, breaking the system in the process.

  24. Every time I read an article about a strange behavior in Windows components, the reason is always hooking/injection or plug-ins.

    Is it illogical to think Windows needs to outlaw hooking and develop a plug-in model for Explorer instead?

    [Um, Explorer has had a plug-in model since day one. Some people abuse this plug-in model in order to inject themselves in to Explorer so they can hook it. -Raymond]
  25. Joshua says:

    @Fleet Command: Considering the techniques currently used, documenting well how to do it is far more likely to bring real quality results. The only way to outlaw hooks is to outlaw debuggers.

  26. Nitpicker (Corner?) says:

    @Fleet Command – As soon as I saw the name of the post, I said (out loud) 'buggy shell extension'. I wasn't too far off.

  27. alegr1 says:

    If you want to save you some grief, don't assume that last error doesn't change across some function calls. If you'll need it, save it to the variable immediately after the call of interest. Because you may later add tracing calls, printfs, etc, and break your GetLastError() state, unless you've saved it.

    I suspect, such approach was the reason the defective component didn't cause catastrophic damage in most other cases of CloseHandle.

  28. Alois Kraus says:

    I guess you had a TT Trace which makes debugging back and forth in time much easier. Otherwise I would not think that you could do such an analysis within a few hours. How about giving this power to the official Windbg version?

  29. Radosław Cybulski says:

    As Alois pointed out – the real question is: when time travel will be available to everyone? Because the problem with CopyFile2 is easy once You dig into source code – it relies on not-visible-in-code dependency about CloseHandle not changing error code, when successed. Once it does (for any reason) the code breaks.

    It doesnt really matter, that both functions are managed by the same people. That sort of invisible dependency makes source code much, much harder to read and change. Just code defensively and read the error code into local variable. This is CopyFile2 bug, unless CloseHandle is defined as not changing error code, when successed.

    This paragraph about emulating undocumented behavior is simply wrong. As long as such a dependency is invisible, there's no difference between some happy coder in India writing a hook and also happy coder in microsoft, working in another division. They both wont see it coming, make the same error and in both cases sh*t will hit the fan later on. Either make the dependency visible (ie document it) or remove it from the code (ie copy error value to local variable right after failure).

    You can also it CopyFile2 from different perspective. It stores its success flag in two different places, fSuccess boolean flag (Hungarian notation is so 80…) and in global variable returned by GetLastError. Function makes decision based on first, but returns the second. Which caller makes decision on. You dont need to check any documentation about used calls to know this spells troubles. In such case there should be an assert on exit, making sure fSuccess and returned value both report either success or error. This would make Your debugging lasts maybe 3 seconds.

    Also global variables.

  30. EduardoS says:

    I think the bug is in GetLastError/SetLastError, in the sense, they exists at all, how many bugs already appeared because of those functions writing global variables?

  31. – "Um, Explorer has had a plug-in model since day one. Some people abuse this plug-in model in order to inject themselves in to Explorer so they can hook it. -Raymond"

    I mean a real deal: Plug-ins come in a package, there is a plug-in manager that can enable or disable them, etc. Like Chrome and Firefox. Current state is like IE: Individual components that each accomplish a non-atomic task are developed without any real tangible connection to form a bigger whole that remains abstract. So, overall, it does not have the properties of a model.

  32. Gabe says:

    EduardoS: What do you suggest should have taken the place of GetLastError/SetLastError back in 1983?

    Should every API have an out-parameter for the error? Or perhaps the return value from every function should be the return status and any returned value would have to be an out-parameter? Or maybe they should have invented structured exception handling?

  33. Remko says:

    That's why functions should return a boolean for succes or failure instead of calling getlasterror

  34. Ben Voigt says:

    @Crescens2k: Good job cutting out the other half of the quote, the half that is pretty clear that the default contract has no postconditions on the thread last-error value when a function succeeds.  If you define an interface as "behaves exactly like the shipped implementation in all cases", that's no interface at all.  A substitute implementation only has to abide by the documented contract; that's pretty fundamental to polymorphism.

    @Raymond: "maybe they will find Zw­For­Internal­Use­Only­This­Is­The­Real­Close­Handle, and detour that"  By non-public, I meant that it isn't even exported, and after optimization might not even exist as a separate entity.  Besides, there's no memory leak the way I envision it, because you close handles with Zw­For­Internal­Use­Only­This­Is­The­Real­Close­Handle only if you opened them with Zw­For­Internal­Use­Only­This­Is­The­Real­CreateFile (see: pairing of malloc=free, new=delete, new[]=delete[], CoTaskAlloc=CoTaskFree, etc), and the hook doesn't have any private data associated with the handle because it hasn't seen it.  It is purely an implementation detail of CopyFile2.  If hook developers get upset because hooks installed on the public interface don't catch all internal activities deep within kernel32.dll, too bad, they've misunderstood what hooking a function does.  Hooking CreateFile does not intercept all file access of any variety, anywhere, it intercepts calls to CreateFile.

    [Then "Why does CopyFile2 use undocumented APIs we cannot hook? We are hooking file I/O in order to implement our virtual file system." -Raymond]
  35. ErikF says:

    @Ben Voigt: If you're substituting a function that relies on global state, that global state is part of the contract too! I remember several TSRs that I used back in DOS days that didn't remember that and could crash the system in weird ways.

  36. Someone says:

    [Then "Why does CopyFile2 use undocumented APIs we cannot hook? We are hooking file I/O in order to implement our virtual file system." -Raymond]

    Anyone trying to do this should be forced to implement a real driver. Even and especially AV software should not hook into processes in some strange way, it should do it's real-time scanning by a filter driver, using the official interfaces.

    So, from my point of view, your reason does not exist / is not justified.

    ["But there are some things you can't do in kernel mode, like display UI." Another scenario might be "We are writing a diagnostic tool, and we want to log all I/O operations with stack traces and other metadata. Our traces do not include I/O due to file copying." -Raymond]
  37. Zack says:

    If I could go back to 1970 with a set of changes for Kernighan and Ritchie to implement in the very first version of C, along with mandatory prototyped function declarations, sane operator precedence, postfix unary *, and the extirpation of <code>gets<code>, I would seriously consider fixed-length, scalar-only multiple value return, so you could write

    <pre>(err, fd) = open("path", flags);</pre>

    At the ABI level this is no more difficult than accepting multiple arguments to a single function, it eliminates all the problems with errno/GetLastError, and it doesn't have the yuck factor associated with out-parameters for the result the programmer actually wants.

  38. EduardoS says:

    @Gabe, I think it would be nice if they invented structured exception handling in 1983, but 1983 doesn't matter, nobody cares about Windows 1.0, by the time 32 bit Windows come there was another option, HRESULT with an out parameter for the result, instead, MS made no effort to deprecate the old API.

  39. Gabe says:

    EduardoS: It sounds like you are suggesting that Microsoft should have created a whole new API for Win32 that is completely different from Win16. This would prevent people from easily porting their code to Win32, preventing the adoption of Win32.

    Or are you suggesting that old APIs still have the old error return mechanism while all new ones would use HRESULTs? That would result in lots of confusion among developers, making reliable code harder to write.

  40. EduardoS says:

    @Gabe, a mix of the two, there should be a easy way to port from Win16, but it is porting, 16 bitness had some weird things (Global*/Local*, cooperative to preemptive, etc) so some changes were required, it wouldn't be that bad if a bit more changes were required, and this porting helper (wich may not even be part of the actual API, rather, could be just a header) should not be the default way to write new applications, COM come about the same time Win32 come out creating a lot of confusion anyway, MS could have cleared a bit the confusion by saying THIS is the new way to write apps and, of course, providing the full API in this new way.

    BTW, COM had it's own problems and was a bit boring to use, but some things just needs time and effort to evolve wich half attempts never get.

Comments are closed.

Skip to main content