The ways people mess up IUnknown::QueryInterface, episode 4

One of the rules for IUnknown::Query­Interface is so obvious that nobody even bothers to state it explicitly as a rule: "If somebody asks you for an interface, and you return S_OK, then the pointer you return must point to the interface the caller requested." (This feels like the software version of dumb warning labels.)

During compatibility testing for Windows Vista, we found a shell extension that behaved rather strangely. Eventually, the problem was traced to a broken IUnknown::Query­Interface implementation which depended subtly on the order in which interfaces were queried.

The shell asked for the IExtract­IconA and IExtract­IconW interfaces in the following order:

// not the actual code but it gets the point across
IExtractIconA *pxia;
IExtractIconW *pxiw;
punk->QueryInterface(IID_IExtractIconA, &pxia);
punk->QueryInterface(IID_IExtractIconW, &pxiw);

One particular shell extension would return the same pointer to both queries; i.e., after the above code executed, pxia == pxiw even though neither interface derived from the other. The two interfaces are not binary-compatible, because IExtract­IconA::Get­Icon­Location operates on ANSI strings, whereas IExtract­IconW::Get­Icon­Location operates on Unicode strings.

The shell called pxiw->Get­Icon­Location, but the object interpreted the szIcon­File as an ANSI string buffer; as a result, when the shell went to look at it, it saw gibberish.

Further experimentation revealed that if the order of the two Query­Interface calls were reversed, then pxiw->Get­Icon­Location worked as expected. In other words, the first interface you requested "locked" the object into that interface, and asking for any other interface just returned a pointer to the locked interface. This struck me as very odd; coding up the object this way seems to be harder than doing it the right way!

// this code is wrong - see discussion above
class CShellExtension : public IExtractIcon
  HRESULT CShellExtension::QueryInterface(REFIID riid, void **ppv)
   *ppv = NULL;
   if (riid == IID_IUnknown) *ppv = this;
   else if (riid == IID_IExtractIconA) {
    if (m_mode == MODE_UNKNOWN) m_mode = MODE_ANSI;
    *ppv = this;
   } else if (riid == IID_IExtractIconW) {
    if (m_mode == MODE_UNKNOWN) m_mode = MODE_UNICODE;
    *ppv = this;
   if (*ppv) AddRef();
   return *ppv ? S_OK : E_NOINTERFACE;
  ... AddRef / Release ...

  HRESULT GetIconLocation(UINT uFlags, LPTSTR szIconFile, UINT cchMax,
                          int *piIndex, UINT *pwFlags)
   if (m_mode == MODE_ANSI) lstrcpynA((char*)szIconFile, "foo", cchMax);
   else lstrcpynW((WCHAR*)szIconFile, L"foo", cchMax);

Instead of implementing both IExtract­IconA and IExtract­IconW, my guess is that they implemented just one of the interfaces and made it alter its behavior based on which interface it thinks it needs to pretend to be. It never occurred to them that the single interface might need to pretend to be two different things at the same time.

The right way of supporting two interfaces is to actually implement two interfaces and not write a single morphing interface.

class CShellExtension : public IExtractIconA, public IExtractIconW
  HRESULT CShellExtension::QueryInterface(REFIID riid, void **ppv)
   *ppv = NULL;
   if (riid == IID_IUnknown ||
       riid == IID_IExtractIconA) {
    *ppv = static_cast<IExtractIconA*>(this);
   } else if (riid == IID_IExtractIconW) {
    *ppv = static_cast<IExtractIconW*>(this);
   if (*ppv) AddRef();
   return *ppv ? S_OK : E_NOINTERFACE;
  ... AddRef / Release ...

  HRESULT GetIconLocation(UINT uFlags, LPSTR szIconFile, UINT cchMax,
                          int *piIndex, UINT *pwFlags)
   lstrcpynA(szIconFile, "foo", cchMax);
   return GetIconLocationCommon(uFlags, piIndex, pwFlags);

  HRESULT GetIconLocation(UINT uFlags, LPWSTR szIconFile, UINT cchMax,
                          int *piIndex, UINT *pwFlags)
   lstrcpynW(szIconFile, L"foo", cchMax);
   return GetIconLocationCommon(uFlags, piIndex, pwFlags);

We worked around this in the shell by simply changing the order in which we perform the calls to IUnknown::Query­Interface and adding a comment explaining why the order of the calls is important.

(This is another example of how the cost of a compatibility fix is small potatoes. The cost of deciding whether or not to apply the fix far exceeds the cost of just doing it for everybody.)

A different shell extension had a compatibility problem that also was traced back to a dependency on the order in which the shell asked for interfaces. The shell extension registered as a context menu extension, but when the shell tried to create it, it got E_NO­INTERFACE back:

CoCreateInstance(CLSID_YourAwesomeExtension, NULL,
                 CLSCTX_INPROC_SERVER, IID_IContextMenu, &pcm);
// returns E_NOINTERFACE?

This was kind of bizarre. I mean, the shell extension went to the effort of registering itself as a context menu extension, but when the shell said, "Okay, it's show time, let's do the context menu dance!" it replied, "Sorry, I don't do that."

The vendor explained that the shell extension relies on the order in which the shell asked for interfaces. The shell used to create and initialize the extension like this:

// error checking and other random bookkeeping removed
IShellExtInit *psei;
IContextMenu *pcm;

CoCreateInstance(CLSID_YourAwesomeExtension, NULL,
                 CLSCTX_INPROC_SERVER, IID_IShellExtInit, &psei);
psei->QueryInterface(IID_IContextMenu, &pcm);
// use pcm

We changed the order in a manner that you would think should be equivalent:

CoCreateInstance(CLSID_YourAwesomeExtension, NULL,
                 CLSCTX_INPROC_SERVER, IID_IContextMenu, &pcm);
pcm->QueryInterface(IID_IShellExtInit, &psei);

(Of course, it's not written in so many words in the code; the various parts are spread out over different components and helper functions, but this is the sequence of calls the shell extension sees.)

The vendor explained that their shell extension will not respond to any shell extension interfaces (aside from IShell­Ext­Init) until it has been initialized, because it is at that point that they decide which extensions they want to support. Unfortunately, this violates the first of the four explicit rules for IUnknown::Query­Interface, namely that the set of interfaces must be static. (It's okay to have an object expose different interfaces conditionally, as long as it understands that once it says yes or no to a particular interface, it is committed to answering the same way for the lifetime of the object.)

Comments (27)
  1. Anthony says:

    What would have been the correct solution to the last vendor's problem? Should they have made the decision about supporting each interface during the first call QueryInterface() rather than in Initialize()?

    [What's the point of registering an object as a context menu extension if you're going to say, "Nah, I don't support IContextMenu"? The decision to support IContextMenu should have been made when you registered the object! -Raymond]
  2. Anthony says:

    Ah, that's the complaint. Not that they couldn't dynamically decide later what the REST of their interface set looked like, but that they should always support for the one they registered as in the first place.

  3. jader3rd says:

    So the Windows shell team is held hostage by a bad implementation of some third party extension? I think if somebody's extension stopped working, users would stop using that extension.

    [And if that extension is essential to your business? -Raymond]
  4. Peter says:

    That's quite the tale.  It makes we want to run away and never program again.

  5. zonk says:

    You are violating your own contract, "This is the first method that the Shell calls after it creates an instance of a property sheet extension, shortcut menu extension, or drag-and-drop handler." QueryInterface must be the second.

    [The spec was not written by language lawyers but by people who assume you can read sentences in context. The context here is among the various shell-specific interfaces and methods. After all, you can also get an IMarshal request as part of CoCreateInstance. -Raymond]
  6. Joshua Ganes says:

    I have to assume that these extensions were around for a while in previous Windows versions, but were causing problems for new Vista development. What prompted the reordering of calls in Vista?

  7. Joshua says:

    This stuff is about as bad as the vendor that did:

    bool Driver::HasDirectXCapability(GUID cap) { return true; }

    This stuff must be frustrating for development. The more I read about the weird games people play in shell extensions the more I want to re-implement the open & save dialogs just so that I don't load any weird DLLs into my memory space.

  8. Mason Wheeler says:

    @Peter: Before you do that, try Delphi.  It makes COM much, much simpler.  Microsoft may have invented COM, but it was the Delphi team that truly understood it.  A lot of the "COM horror stories" I see on here and elsewhere would either not happen at all or simply be harder to get wrong than to get right in Delphi.

  9. Ken Hagan says:

    @Mason: I think you are rather missing the point that Raymond made in the article: even in raw COM, these mistakes required more effort on the part of the programmer than simply getting it right. These shell extensions would have been broken in Delphi too, but it would have taken the idiots longer to figure out how to work around Delphi's support, so they used raw COM instead. :(

    [Yeah, it's like these people went out of their way to do something weird. I mean, really, adding weird logic to your QI? -Raymond]
  10. 640k says:

    If you can't handle COM's complexity – use .net, it's often dumbed down enough for everyone to grasp.

  11. Danny says:

    It's called KISS principle dude. COM got "kissed" by .net :D:D.

  12. Alex Grigoriev says:

    Again, I call for the list of shame. Windows should keep (and dynamically update) a list of components with problems, and if an user installs one, there should be a message with the description of the vendor's sin. This would put a heat on the ISVs.

  13. Wayne says:

    I've been here long enough to answer that one, Alex.  What if the vendor doesn't exist anymore?  What if the problem is in version 2000 of their product and they've sold 3 new versions since then?  And so on…

  14. Anonymous Person says:

    If Microsoft does the "right thing" and breaks code that is broken, people blame Windows when things stop working. If Microsoft does the "right thing" and fixes other people's broken code, people won't stop complaining about "bloat" and people won't fix their code. And even if they do everything perfectly, people expect them to use a time machine to port new code back to when the 286 was a shiny new and powerful processor.

    There are times I'm glad I don't work for Microsoft.

    However, there is one obvious interim workaround. The language lawyers need to stand in front of the regular lawyers, when the revolution comes.

  15. Worf says:

    So what would happen if two popular shell extensions required mutually exclusive fixes? I don't think the app shims work at the DLL level…

    [How could you get into that situation if you assume that both shell extensions worked at the time they were released? -Raymond]
  16. Leo Davidson says:

    I agree that it is good to re-order COM calls to avoid breaking stuff that user depend on (even if they are poorly written).

    However, since this isn't documented anywhere it makes it a nightmare for anyone else trying to host the same objects.

    ActiveX control hosting is a great example. Unless you painstakingly work out which order IE queries & initialises interfaces, through trial & error and writing debug/test objects that you feed to IE to work out what it does, you'll run into all sorts of problems like the one described.

    Sometimes the order of calls required to make one component work breaks another, too. There are some bits in the registry which certain objects are flagged with to deal with that, but I suspect some of it is also hardcoded within the hosting components.

    Since COM objects only tend to be tested against one host (IE for ActiveX, Explorer for shell extensions, etc.), it's easy for them to accidentally rely on non-contractual behaviour of those hosts. And since new versions of those hosts are then tested against those buggy COM objects to avoid breaking them, those behaviour quirks will never change. But nobody else knows what those behaviours are. They're effectively an undocumented part of the API.

    Of course, very few people write their own hosts and it absolutely makes sense to push the complexity towards the host rather than the hosted objects. I'm not arguing against that. I just wish all this voodoo was documented somewhere.

    It doesn't have to be formal documentation or part of an API contract. Unofficial guidance would be enough, although if it is something that is never going to change because it would break things then it might as well be part of the API contract. We might not want it to be, but it is.

  17. cheong00 says:

    @Anonymous Person: That's why it's a good idea to push companies register the "Software Publisher's Digital ID for Authenticode" of your company to Windows Error Reporting.

    If there's crash related to an extension that vendor information could be found, at least Microsoft might be able to contact the vendor directly.

  18. grumpy says:

    @640k: yeah, but you've gotta realize that you can't handle it before you decide tp switch. How many bad coders are actually aware that they're bad coders…? I've met none.

  19. Neil says:

    I'm going to assume that these gotchas eventually make it into documentation updates, but I was wondering if there are programs that can test your components to see if they depend on these sorts of undocumented behaviours.

    [The rules for IUnknown are already spelled out in the existing documentation. It doesn't seem necessary to list every possible way of violating them. -Raymond]
  20. jader3rd says:

    "[And if that extension is essential to your business? -Raymond]"

    It's okay for people to not update to the latest and greatest version of Windows. If a company looks at the next version of Windows and finds out that their favorite shell extension doesn't work with it they can A) ask the vendor for an upgrade, B) not upgrade the OS. At this point there's a cost/benefit decision. Do the new features of the OS justify an upgrade? Are they worth more than the shell extension? Does something else provide the functionality that we need in the shell extension, even if it's not as nice?

    While I love the work that Windows does to be backwards compatible, this is one of those situations where I feel it would be a benefit to Windows, in the long run, if they didn't do this. Because then, people will start to expect greater things from new versions of Windows, instead of thinking a new version of Windows is lipstick put on another layer of old bloat.

    I love how a well written program written decades ago can run on modern Windows. I don't like how a poorly written program written decades ago can run on modern Windows.

    [And what if the vendor is out of business, or the vendor wants to charge money, or (as is common with LOB applications) you are the vendor and there's nobody left at your company who understands the code (or you lost the source code). It seems that you have forgotten that Windows exists in order to make customers happy and productive, not to piss off its customers in pursuit of the world's most architecturally pure and beautiful operating system. -Raymond]
  21. Gabe says:

    Jader3rd: Are you aware that, even nearly 10 years after its release and 5 years after being superceded, Windows XP is still the most popular PC OS on the planet? And that's with MS doing everything humanly possible to make it compatible with old apps!

    It turns out that the answer to "Do the new features of the OS justify an upgrade?" is "No" at least 99% of the time. If MS took the "It's okay for people to not update to the latest and greatest version of Windows" attitude, Windows 7 might be 10% awesomer, but that wouldn't help you as a developer because 90% of your customers wouldn't be using it.

  22. cheong00 says:

    @Gabe: Actaully there's more problem than that.

    For some industries (such as Banking/Stock trading related ones) there is security requirement that the company should not use any of the OSs that no longer has security updates. When WinXP goes out the extended support period in 2014, while some of the larger companies can still pay for extension beyond the extended support period, most other companies that have software that can't run on Vista or above will face difficult decision: whether to risk system rewrite of old components or allow the client OS continue to run without further security fixes, hence risk having a virus friendly network environment inside the company. (Note that if people write malicious software that take advantage of OS vulnerabilities to spread, antivirus softwares may not be able to stop them effectively.

    So for the benefit of most OS users, people will want Microsoft to continue make those software "depending on undocumented implementation details" to run on newer versions.

  23. skSdnW says:

    @Cesar: Exactly what I was thinking. This thing was probably designed back in the 9x/NT split and 9x would only use the A version. I'm sure NT4&5 would ask for the W version first and if it failed, it would fall back to A. Why this was changed in Vista makes little sense to me since you want to use the W version if the object implements it. (Even if it was rewritten from scratch it would still make sense to ask for W first) Why Vista wants both versions is a even bigger question…

  24. Cesar says:

    I can imagine how the weird behavior in the first example happens.

    The class probably has an instance variable like:

    IUnknown *pExtractIconImpl;

    Which is initialized to NULL in the constructor.

    When receiving a QueryInterface for IID_IExtractIconA, if that pointer is NULL, it creates an IExtractIconA and stores it there. The same for IExtractIconW. If the pointer is already set, it is returned directly. Basically, simple lazy initialization of the part of the object which deals with whatever IExtractIcon* does. A perfectly legitimate thing to do, and not that complex.

    The only mistake in this code is to use a single pointer for both interfaces, perhaps due to a mistaken assumption that Unicode hosts will only want to create the IExtractIconW, and that ANSI hosts will only want to create the IExtractIconA. The fix is simply to use two instance variables instead of one.

  25. Leo Davidson says:

    @WndSks: My guess is that the code in Explorer was changed to gets all the interfaces it might want to use at once, so all that code is in one place and separated from the logic of actually using the object.

    That's a perfectly valid thing to do. QueryInterface on an in-process object is virtually free, so if you can simplify the code by doing that then it makes sense to do so.

    This obviously depends how the object is being used, but there's nothing that weird about getting an interface on an object and then deciding not to use it in the end.

  26. Alfred E. Neumann says:

    @640k: If you can't handle COM's complexity – use .net, …

    You just replaced one problem with two others!

  27. Yuhong Bao says:

    [Yeah, it's like these people went out of their way to do something weird. I mean, really, adding weird logic to your QI? -Raymond]

    Probably because they were clueless that C++ had such a feature.

Comments are closed.