Adding flags to APIs to work around driver bugs doesn’t scale


Some people suggested, as a solution to the network interoperability compatibility problem, adding a flag to IShellFolder::EnumObjects to indicate whether the caller wanted to use fast or slow enumeration.

Adding a flag to work around a driver bug doesn't actually solve anything in the long term.

Considering all the video driver bugs that Windows has had to work around in the past, if the decision had been made to surface all those bugs and their workarounds to applications, then functions like ExtTextOut would have several dozen flags to control various optimizations that work on all drivers except one. A call to ExtTextOut would turn into something like this:

ExtTextOut(hdc, x, y, ETO_OPAQUE |
           ETO_DRIVER_REPORTS_NATIVE_FONTS_CORRECTLY |
           ETO_DRIVER_WILL_NOT_DITHER_TEXT_DURING_BLT |
           ETO_DRIVER_DOES_NOT_LIE_ABOUT_LOCAL_TRANSFORMS |
           ETO_DRIVER_DOES_NOT_CRASH_WITH_STOCK_BRUSHES,
           &rcOpaque, lpsz, cch, NULL);

where each of those strange flags is there to indicate that you want to obtain the performance benefits enabled by each of those flags because you know that you aren't running on a version of the video driver that has the particular bug each of those flags was created to protect against.

And then (still talking hypothetically) with Windows Vista, you find that your program runs slower than on Windows XP: Suppose a bug is found in a video driver where strings longer than 1024 characters come out garbled. Windows Vista therefore contained code to break all strings up into 1024-character chunks, but as an optimization you could pass the ETO_PASS_LONG_STRINGS_TO_DRIVER flag to tell GDI not to use this workaround. Your Windows XP program doesn't use this flag, so it now runs slower on Windows Vista. You'll have to ship an update to your program just to get back to where you were.

It's not limited to flags either. By this philosophy of "Don't try to cover up for driver bugs and just make applications deal with them", you would have had the following strange paragraph in the FindNextFile documentation:

If the FindNextFile function returns FALSE and sets the error code to ERROR_NO_MORE_FILES, then there were no more matching files. Some very old Lan Manager servers (circa 1994) report this error condition prematurely. If you are enumerating files from an old Lan Manager server and the FindNextFile function indicates that there are no more files, call the function a second time to confirm that there really are no more files.

Perhaps it's just me, but I don't believe that workarounds for driver issues should become contractual. I would think that one of the goals of an operating system would be to smooth out these bumps and present a uniform programming model to applications. Applications have enough trouble dealing with their own bugs; you don't want them to have to deal with driver bugs, too.

Comments (20)
  1. AC says:

    Yes, indeed! The "driver" or whatever under the API surface should recognize the problem with "old server", that’s not application’s business. That’s why I don’t understand why Explorer was mentioned in the original story anyway, why isn’t this all to be solved somewhere "lower"? If I’m only using API from the application, I certainly don’t want to know it it’s "old server" or not, and if the particular server "responds incorrectly to the fast api and correctly to the older one".

  2. dave says:

    Another way to look at such flags is that they represent a complete violation of the software layering.

    The average Win32 app isn’t supposed to even know that video drivers exist, much less be concerned with the capabilities and/or defects that a particular version of a particular driver might have.

  3. Brad Corbin says:

    A related question:

    If the "fix" has to occur at the API call level, and not the application level, does that mean that we are prevented from interacting with the user?

    The original discussion had some options like "notify the user that this list might be incomplete." Would this have to be an application-level communication (Explorer), or could this message come from the IShellFolder::EnumObjects directly?

    Its a bit of a violation of best practices, but if the user dialog could come directly from the API, it would bypass the need for the application itself to be aware of and throw the dialog. Remember, we’re not just talking about Windows Explorer here, we’re talking about every application that happens to call IShellFolder::EnumObjects.

  4. Phylyp says:

    I don’t believe that workarounds for driver issues should become contractual

    That’s a neat summary.

  5. Joe Dietz says:

    The solution lies in the CIFS redirector file system code (kernel) and lanmanserver code, NOT in user-space.  Applications should simply never need to know any of this, not even NT subsystems should need to know any of htis.  Simply add yet-another-compatibility attribute to the vista/w2ksp3 lanmanserver that clearly indicates that this variant of lanmanserver is hip with fast enumeration.  So the redirector looks for this, if present all is well, if not revert to slow enumeration.  SAMBA will catch up in time and impliment this new attribute and when it does, its fixed fast enumeration code will work.  Then as various vendors finally roll out newer versions of SAMBA  that contain both the fix and the new attribute they will just start working.

  6. Mike Dimmick says:

    Brad: there are many, many applications which use the FindFirstFile/FindNextFile API, and not all of them will have UI. You really, really should not throw UI in the middle of a data processing API, and certainly not in a legacy API!

    Besides, which window would the hypothetical dialog be parented to? You can’t use the desktop window because there’s a good chance other applications will obscure it. A thread may manage multiple top-level windows, so if you just pick one, you may end up picking the wrong one. If you make it always-on-top parented to the desktop, you’ll annoy the user because now it’s obscuring some other application.

    This is completely avoiding the fact that the UI is unlikely to be of any use to the end user. You’ll just leave the user with a strong sense of unease, but they’ll probably just dismiss the dialog without reading it. When quizzed by a more experienced/technically proficient user who observes it they’ll probably just say, ‘oh, it always says that’ without actually having reported it to anyone. I’ve experienced this.

  7.  

    In Raymond’s post today (Adding flags to APIs to work around driver bugs doesn’t scale), Raymond…

  8. Gabe says:

    Adrian is right about raising the bar on vendors. The problem is that his example is printers, which are somewhat unique. There’s no automated test you can run to verify that the output you get on paper is what you are supposed to get.

  9. Tim says:

    "The problem is that his example is printers, which are somewhat unique. There’s no automated test you can run to verify that the output you get on paper is what you are supposed to get."

    No, but you can do a glance test :)

    I once worked on the printing system for a product – never, ever accept that job, btw; the printer driver bugs will give you nightmares.

    A particular HP deskjet printer would sometimes go crazy ape bonkers and produce a page that looked like a toddler had scribbled all over it if you dared to use the Win32 GDI bezier path facility.

    That stupidity was topped only by the fantastic PostScript printer driver that shipped with Windows 95 – if you used the GDI beziers to draw curves, the resulting PostScript output was full of flattened line segments…d’oh!

    I cursed MS that day…then checked the credits to find that the driver was supplied by Adobe…oops!

    Not to mention the bug in NT4 GDI that would crash the kernel if you asked it to draw a particular bezier curve…luckily I wasn’t the poor person who had to spend 3 days rebooting their machine while they tracked down exactly what curve would make NT4 freeze and die.

    We disabled GDI paths in the end – they just weren’t ready for production, sadly.

    One thing that puzzles me about the original network enumeration problem, and this has only just occurred to me –  is, if even XP doesn’t try ‘fast queries’, then…what MS product does?  What I mean is, presumably the Samba guys reverse engineered this functionality – but from what?  Does 2003 Server support these fast queries?  If so, what was the point, if no clients ever used them?  I’m curious. :)

  10. Mark Steward says:

    Tim – as https://bugzilla.samba.org/show_bug.cgi?id=3526 says, NtQueryDirectoryFile with the level FileIdBothDirectoryInformation exposes it.  The FindXxxFile calls don’t use it yet, but the presumably will on Vista, and SMB2 might make it more important (although I know next to nothing about SMB2).  However, any program can run into the bug if it uses NtQueryDirectoryFile (the post was about rewriting Cygwin to use it).  Perhaps it’s used for DFS or offline files?

  11. Eric Newton says:

    This is why I program in the .NET VM… All of these silly problems are abstracted away.

    When I write a program, I’m simply going to call into a system function, and if there’s any OS/Driver oddity, then I assume that the VM will handle it…

    Unless of course I really need the optimization, then obviously I’ll probably extern call into them myself and acknowledge the fact that it’ll be OS-dependent.

  12. Adrian says:

    Sounds like the Windows team does a lot for driver compatibility.  

    But that doesn’t scale either.

    Printer drivers, for example.  There are many, many buggy drivers out there whose quirks the OS doesn’t manage to shield from the application.

    One product I worked on shipped with a database of printer drivers and lists of which compatibility hacks had to be turned on.  Nearly every driver we ever tested required at least one workaround.  Printing was a huge tech support cost.  We built a lab with three hundred printer models and three full-time QA testers for compatibility testing.

    Of course, there were always new drivers and printers coming out, so we had a way for the user to turn on the workarounds (as directed by tech support).

    The *only* scalable solution to all of these compatibility nightmares is to raise the bar on driver, application, and OS vendors to correctly support their sides of the API contracts.

  13. steveg says:

    Eric: said ".NET == no silly problems"

    Oh how, you young innocent types make me laugh.

    Your current .Net code is probably going to be jumping through silly hoops in the future when it’s running under .NET 2015.

    Read this article about .NET 2.0 backwards compatibility (read the two sidebars, they’re very interesting real-life examples).

    http://msdn.microsoft.com/msdnmag/issues/06/03/CLRInsideOut/default.aspx

    Quoting the article, this was a proposed (and rejected) fix to String.StartsWith() to work around a compatibility problem:

    if (result == true && input.Length == 6 && this.Length == 19 &&

       this.Equals("configProtectedData") &&

       MethodInfo.GetCurrentMethod().Name.Equals("VerifySectionName"))

    {

       return false;

    }

    Nothing’s immune from silliness.

  14. Mark Steward says:

    I certainly agree with "I don’t believe that workarounds for driver issues should become contractual" – in fact, the whole post is spot on – but various programs *do* need to know about the underlying hardware’s capabilities (especially games, although they’re a bit special).  There loads of places where you can know enough of your system to decide between a two types of reliability, just as you can choose between UDP and TCP.

    The redirector doesn’t (yet) guarantee an accurate file listing, and FindXxxFile and IEnumIDList::Next are already contracted to pass on error codes.  However, users demand an accurate list, so I feel a flag  RETRY_ON_TRANSITORY_ERROR would be useful, removing the work of determining whether a restart would work from the programmer.  But it’s all a matter of layer and utility – compare/contrast with http://blogs.msdn.com/489807.aspx , especially Explorer’s behaviour.

    Not that I’m suggesting any of this for the current bug, as that appears to be fixable at the redirector level, or by simply not using fast mode (although I took that as a given).

    Mark

  15. Gabe says:

    Tim, I was pointing out the fact that the OS cannot automatically hide printer driver flaws from your application because it does not have any way to test for flaws.

    I can write some code to verify if my disk driver is writing the correct data to the disk or if my graphics driver is writing the correct data to the frame buffer, but I cannot write anything that will tell me if my printer’s driver is writing the correct data to paper.

  16. grg says:

    The check should be

    if (thisl.Length == 19 && input.Length == 6 && result == true && this.Equals("configProtectedData") &&

      MethodInfo.GetCurrentMethod().Name.Equals("VerifySectionName"))

    {

      return false;

    }

    the check for length == 19 is the most easyest way the if() to be proven false

  17. Anon says:

    <i>

    The check should be

    if (thisl.Length == 19 && input.Length == 6 && result == true && this.Equals("configProtectedData") &&

     MethodInfo.GetCurrentMethod().Name.Equals("VerifySectionName"))

    {

     return false;

    }

    the check for length == 19 is the most easyest way the if() to be proven false</i>

    Perhaps… But depending on the String.Length implementation, result==true is almost certainly the cheapest check to do, and input.Lenght==6 may very well be faster than this.Length==19. But of course, there’s no way to show any of this without knowing all the fine details of the inner workings of the string class and the .NET optimizer.

  18. peterchen says:

    > result==true is almost certainly the cheapest check to do

    but it is not very decisive.

    Anyway, such a fix would be a dangerous precedence.

  19. FYI: This is the fix that tridge suggested to Raymond. No reason why this shouldn’t work – and be completely transparent to application code.

    No reason to bother userspace code about it,

    no reason for GUI changes or looking for

    specific versions or detecting Samba as

    opposed to any other server, no need to

    keep things in "slow" mode now the bug

    is fixed.

    Jeremy

    From tridge:

    "If we had run across the error you

    described (INVALID_LEVEL from a continue)

    then we would have added a bit flag on the

    current connection structure to mark this

    connection so it won’t use that level in

    future, then repeat the search using a

    different level. That means you would get

    one useless search on the network with each

    connection to a buggy server, but no impact

    against non-buggy servers and no user observable

    affects. The denial of service attack you

    mention with this type of fix doesn’t happen

    as the extra bit is per-connection, not long

    lived (trying to remember long lived info

    about specific servers is a losing game)."

  20. Another example of changing the rules after the game is over.

Comments are closed.