Be very careful if you decide to change the rules after the game has ended


One suggestion for addressing the network compatibility problem was returning an error code like ERROR_PLEASE_RESTART which means “Um, the server ran into a problem. Please start over.” This is basically the same as the “do nothing” option, because the server is already returning an error code specifically for this problem, namely, STATUS_INVALID_LEVEL. Now, sure, that error doesn’t actually mean “Please try again,” It actually means “Sorry, I can’t do that.” This is the error code that is supposed to come if you ask a server to go into fast mode and it doesn’t support fast mode.

But the effect from a coding standpoint is the same. “If FindNextFile return the error xyz, then the server ran into a problem and you should start over.” Call xyzERROR_PLEASE_RESTART” or “STATUS_INVALID_LEVEL” or “PURPLE_LILACS“. No matter what you pick, the net effect is the same: Existing code must be changed to specifically check for this new error code and react accordingly. Programs that aren’t updated will behave strangely.

And that’s the issue faced by today’s topic: When do you decide that a problem requires you to change the rules of the game after it has ended?

Programs out there were written to one of may sets of rules. Most of them were written to Windows XP’s rules. Some were written to Windows 2000’s rules. Even older programs may have been written to Windows 95’s or Windows 3.1’s rules. One aspect of backwards compatibility is accomodating programs that broke the rules and got away with it. But here, the issue is not fixing broken programs; it’s keeping correct programs correct.

If you introduce a new error code and specify an unusual recommended action (i.e., something other than “fail the operation”), then all programs written prior to the introduction of this rule have suddenly become “wrong” through no fault of their own. Depending on how “wrong” they are, the severity of the problem can range from inconvenient to fatal. In the Explorer case, the directory comes up wrong the first time but fixes itself if you refresh. But if a .NET object’s enumerator suddenly threw a new ServerFailedMustRestartEnumeration exception, you’re probably going to see lots of programs crash with unhandled exception failures.

At this point, the usual suspects come to the surface: How will users get updated programs that conform to the new rules? The original program’s author may no longer be alive. The source code may have been lost. Or the knowledge necessary to understand the source code may have been lost. (“This program was written by an outside contractor five years ago. We have the source code but nobody here can make heads nor tails of it.”) Or the program’s author may simply not consider updating that program to Windows Vista to be a priority. (After all, why bother updating version 1.0 of a program when version 2.0 is available?)

Mind you, Microsoft does change the rules from time to time. Pre-emptive multi-tasking changed many rules. The new power management policiies in Windows Vista certainly changed the rules for a lot of programs. But even when the rules change, an effort is usually made to continue emulating the old rules for old programs. Because those programs are following a different set of rules, and it’s not nice to change the rules after the game has ended.

Comments (35)
  1. Rosyna says:

    Isn’t this what unexpected error handlers are for?

    Is every single error case that windows returns documented well and only one level deep? (As in a function will never return an error returned by a function it calls?)

  2. e.thermal says:

    consider me dumb, but i’m just curious how this might effect me or the programs I support.  For example if a program is written that uses built-in .net runtime direcoty and file enumerating functions, aside from the odd exception would this be visible?  Or are these programs/problems for people who are writing lower level functions?   On another do the built-in functions even change how they operate when operating over a network?  Is there a broker that kicks in and translates local style requests into bandwidth/latency friendly remote calls?

  3. Adam Gates says:

    Blah why not just have the error point to an online archive of your blog?

  4. Toma Bussarov says:

    Yes, I fully understand your point of view, Raymond and my opinion is the same. Many people just don’t think enough before they end up with solution.

    That’s why my solution to the network compatibility is different:

    http://blogs.msdn.com/oldnewthing/archive/2006/03/31/565878.aspx#567123

  5. PatriotB says:

    "Most [programs] were written to Windows XP’s rules."

    I’d say that statement is a bit optimistic.  The huge mass of programs that fail under limited users all were designed for 95’s rules; even NT and 2000’s rules would have had them working under LUA.

  6. oldnewthing says:

    Toma: In other words, you want to make a driver workaround contractual. I discussed this last week.

  7. gher5 says:

    "At this point, the usual suspects come to the surface: How will users get updated programs that conform to the new rules? The original program’s author may no longer be alive. The source code may have been lost. Or the knowledge necessary to understand the source code may have been lost. ("This program was written by an outside contractor five years ago. We have the source code but nobody here can make heads nor tails of it.") Or the program’s author may simply not consider updating that program to Windows Vista to be a priority. (After all, why bother updating version 1.0 of a program when version 2.0 is available?)"

    This isn’t a problem with Free software.  The user can just edit the source code and recompile!

  8. Vorn says:

    You forgot one in your list, and it’s the one that this problem we’ve been talking about all week has:

    The program cannot be patched at all (despite a fix being available) because it is burned into something that cannot be changed.

    Every version of every program that has ever come out on CD or on write-protected floppies, as must every program ever burned into a device Windows might have to talk to.  It sucks, yes, and it means that improving Windows is approximately the hardest programming job on the face of the earth.

    Vorn

  9. sfb says:

    """consider me dumb, but i’m just curious how this might effect me or the programs I support.  For example if a program is written that uses built-in .net runtime direcoty and file enumerating functions, aside from the odd exception would this be visible?"""

    If the call returned a new error, such as "Error_Oldnewthing_Bug_Found", and the .Net runtime turned it into a new exception, and you were running a program that was not written to handle unexpected errors, you would get this kind of popup:

    http://www.devcity.net/net/files/articles/practical_oop_1_5.jpg

    and the program would just close.

    Good programs would be written to catch unexpected errors, and either close down cleanly or prompt you to retry whatever you were doing. This could just be annoying.

  10. Gabe says:

    "PURPLE_LILACS"? That sounds like something a girl would write. Raymond, I’m afraid you xyz like a girl.

  11. BryanK says:

    sfb: The popup would be sort of similar to that, but it wouldn’t be exactly the same.  That’s the popup that you get when debugging the program under VS.net 2003 at least; end-users get one of two other dialogs, depending on what caught the exception.

    If the exception gets caught by some code somewhere "down" in the call stack from the Application.Run(Form) method (but "up" in the stack from your form’s WndProc, if you override that), then the user gets a dialog with buttons for "Quit" (which terminates the process), "Continue" (which restarts the window loop from the beginning, ignoring anything you were doing at the time of the exception), and "Details", which shows the type and stack backtrace of the exception.  (This is a ThreadExceptionDialog, I think.)

    If the exception got all the way back to the WinMain of the runtime (because it happened outside the event loop or because it happened in the "Application.ThreadException" handler that you can register), then the user gets a dialog saying only something about "an unhandled exception has occurred" with an OK and Cancel buttons.  (Cancel tries (and fails) to launch a debugger, OK exits the process.)

    Yes, the user gets a dialog.  But it isn’t that one.  ;-)

  12. Nawak says:

    @gher5

    Yes, let’s tell users to recompile and flash a new firmware for their network attached storages.

    Their Vista Install experiences will be great and they will tell all their friends how great it is!

  13. orcmid says:

    It’s even more exciting when the breaking change is against something like Microsoft Office (and WordPerfect and …) by a group that didn’t vet their downlevel compatibility when making a level 2.0 interface that monkeyed with some of the level 1.0 APIs: http://ODMA.info/faq/2000/07/Q000702c.htm was produced because I just found a case of this in production software.

    Unfortunately, treating every unexpected result code as a hard failure is particularly messy when some of the new codes are simple warnings or advisories of other kinds.  Also, some software in the case I’m dealing with apparently treats unexpected results as silently-obeyed "cancels" so the end-user sees nothing and has no idea why something like "File | Open …" quit working.  So users will think that it is that @#(? Office software that is broken.

  14. Jules says:

    I still think this is the best option.  Return an error code; existing applications should display an error message (or do something else useful) when they see it; it must have always been possible for an IO error to occur at this point (what if the network went down in the middle of a slow mode enumeration?), so they should have been written to account for this.  Update all the apps included in Windows to recognise this error and start the enumeration again.

    Include a registry entry to switch off fast mode.  Document it in the MSDN entries for FindFirstFile/FindNextFile and a knowledge base article.  Have explorer record an event the first time in each session it encounters the problem so that alert admins can spot it; the message should give brief details and a reference to the KB article. (Actually, I guess it should be possible for the SMB FS driver to record the event; this might be better, as explorer isn’t necessarily going to encounter the area.)

  15. Cooney says:

    Given the new information that raymond has provided, I don’t really care if embedded SAMBA is stuck in slow mode, as the primary use case is small office environments. Default to slow mode – if a server has more than 100 files in a dir, then test it for fast compatibility and act accordingly.

    Error messages are the wrong solution, because they are, as raymond says, changing the rules after the game ends. SAMBA worked fine with the old interface – you have no business breaking them. If you’re interested in future compatibility, then publish the specs and preannounce changes like this.

  16. memodude says:

    How about restarting automatically in slowmode when ERROR_INVALID_LEVEL is received unless a flag disabling that functionality is specified in the call to FindNextFile?

  17. memodude says:

    What I mean in my previous post is:

    FindFirst <– 128 results are returned in fastmode

    FindNext <– fails, retries in slow mode, ignores the first 128 answers, and returns the second 128

    FindNext <– fastmode was blacklisted in the previous FindNext call and therefore it requests the next 128 files in slowmode

  18. Nawak says:

    I wrongly supposed that EAGAIN was a possible (and already existing) return code, then.

    Now that I think about it, the occasion on which I had to retry on EAGAIN (and it was *normal*) were not that numerous… It is strange this errno stuck in my mind…

  19. Random Reader says:

    Re "unexpected error" handlers: wtf?

    The only safe thing a program can do on an unexpected error is quit.  If the creator wasn’t expecting an error, he can’t possibly know how to handle it.  And um, quitting is just as fatal as crashing.

    How does this help anyone?

  20. AC says:

    "This isn’t a problem with Free software.  The user can just edit the source code and recompile!"

    See point #2: " Or the knowledge necessary to understand the source code may have been lost." – I don’t see it as being any less appliccable to OSS

  21. steveg says:

    "This isn’t a problem with Free software.  The user can just edit the source code and recompile! "

    Free doesn’t mean ‘source-code included’, nor does it mean ‘compiler included’.  eg: i can run Win3.1 apps on XP, but I can’t compile ‘em in Visual Studio 2003. So even if I had the source code to Tetris.exe or Reversi.exe (3.1 apps I still use) it wouldn’t do me any good. Same applies to other free apps in other o/s.

    (I can’t even load the 3.1 SDK anymore because I’ve not longer got a 3.5" floppy drive).

  22. Toma Bussarov says:

    In reply to Raymond’s post:

    I agree with you. ‘Fast mode 2′ would become contractual problem if it is exposed as flag in the API. I didn’t mean that because it will be ‘change of the rules’ in some way.

    ‘Fast mode 2′ must be managed by the redirector (at SMB level) and not by the application itself. This way the change will be transparent to the apps.

    Such change might harm apps that already use ‘Fast mode’ because they would not get it until server is upgraded.But if there are such apps they already experience the Samba bug.

  23. Aaargh! says:

    Returning an error is obviously wrong, applications shouldn’t know anything about SMB or the errors it can create, that’s the responsibility of the filesystem abstraction layer (I suppose windows has one of those). That’s where the problem should be fixed.

    You can either retry the request in slow mode after failure and then continue giving the calling application the correct data or test if fast mode is available when accessing a server. It doesn’t really matter as long as it is hidden from the application, that’s the whole point of having a FS abstraction layer.

  24. Meikel says:

    ‘Fast mode 2′ must be managed by the

    > redirector (at SMB level) and not by the

    > application itself. This way the change will

    > be transparent to the apps.

    Just as transparent as it goes from "Slow Mode" to "Fast Mode" with Vista’s current implementation. As I understood that problem, the API in question has no problem on XP as it’s not using "Fast Mode" internally.

    You create a new feature, you find it’s broken (even if it’s not your fault), you might need to implement it again.

  25. Dean Earley says:

    "Restart over" may also (however unlikely) cause potential infinite loops:

    1) enumerate 100 files

    2) next file causes "restart" error, goto 1

  26. Adam says:

    BrianK: Offline I’ll say "Open Source". Online I tend to point people at the list of software freedoms as defined by a certain foundation, and try to explain the difference between Free Software and freeware.

    Being vague to avoid the mod-bot,

    Adam

  27. BryanK says:

    Adam and gher5: This is why I’ve stopped referring to the software I use every day as "free".  It is free, in both senses of the word, but because there *are* two senses, too many people get confused when I say "free".

    The term "open source" does not have this issue, and that’s pretty much the only reason I use that term.

  28. Norman Diamond says:

    I wonder about changing the rules after the game has ended.  These are different games played by Windows APIs but do metarules allow applying the same rules to these games.

    As is almost well known, GetMessage returns a BOOL whose value can be TRUE, FALSE, or OTHER.  The maker of Visual Studio 2005 deliberately decided to continue promoting bugs involving OTHER, but sufficiently informed users can alter the generated code to handle it.  It looks to me like old rules only had TRUE or FALSE, and after the typesetting game was over, rules were changed to add OTHER.

    EnumProcesses returns a BOOL whose value can be TRUE or FALSE.  But the fact is there’s an OTHER possibility too.  MSDN recommends this:

    > There is no indication given when the buffer

    > is too small to store all process

    > identifiers. Therefore, if pBytesReturned

    > equals cb, consider retrying the call with a

    > larger array.

    Are there any possible values of TRUE that haven’t been used yet?  For example if -1 hasn’t been used yet, then it could be used now for OTHER.  Old apps would still work right because they would detect this nonzero value and interpret it as success, but if *pBytesReturned equals cb then they will take a slow fallback route whether it’s necessary or not, growing the array and retrying.  New apps would check for this particular value of OTHER and would know that in this case they have to grow the array and retry, but for other successful returns they would not have to grow and retry even if *pBytesReturned equals cb.

    ‘course if they do have to grow then there’s still no way to avoid a loop.  If the API would store a value larger than cb into *pBytesReturned, in an attempt to say how big the array has to grow, then some old apps might be corrupted by it.  (Under the old rules they could just use that returned value because it couldn’t ever be larger than cb.)

  29. GregM says:

    For example if -1 hasn’t been used yet,

    >then it could be used now for OTHER.  Old

    >apps would still work right because they

    >would detect this nonzero value and

    >interpret it as success,

    Only if they did

    if(FALSE == nResult)

       handle_failure();

    and not

    if(TRUE == nResult)

       handle_success();

  30. Norman Diamond says:

    Thursday, April 13, 2006 4:45 PM by GregM

    > Only if they did

    > if(FALSE == nResult)

    >    handle_failure();

    > and not

    > if(TRUE == nResult)

    >    handle_success();

    They had to.  If an application used BOOLs as integers then it sure did have to compare nResult to FALSE rather than comparing it to TRUE, because any non-zero value meant TRUE.  C defined Boolean operations that way on integers (and other types) and 99% of the time Microsoft has preserved it that way with BOOLs.  Look at MSDN pages that mention a zero return vs. a nonzero return.

    If an application used BOOLs as Booleans then the syntax would be more like this:

     if (!Result) {

       handle_failure();

     }

    or:

     if (Result) {

       handle_success();

     }

    By the way here’s another rule change I encountered today.  Trying to save a page from CodeGuru, Internet Explorer saved a bunch of files and then failed on one.  In order to change the rules consistently, it deleted everything it had saved up to that point and displayed an error message saying it couldn’t save the page.  (By the way a meta-bytheway:  a workaround for this problem starts by surfing while Administrator…)

  31. GregM says:

    Normal, remember the subject matter.  "had to" != "did", and if the API only ever returned FALSE or TRUE, then the program worked.  Adding a new return value, even if it is "allowed" by the docs, doesn’t mean that it’s not going to break anything.

    I keep finding

    if(TRUE == …)

    or

    if(true == …)

    in the code I work on, and it makes me cringe every time.  It works as long as the function only returns 0 or 1, but blows up when other values are returned.

  32. Norman Diamond says:

    Friday, April 14, 2006 11:28 AM by GregM

    > I keep finding

    >  if(TRUE == …)

    > or

    >  if(true == …)

    > in the code I work on, and it makes me

    > cringe every time.

    Gads.  I’ve seen stuff like that, but I didn’t know it was so widespread.  Now wondering how long it will be until I get fired or censured by some employer for trying to fix that kind of code…

  33. Another example of changing the rules after the game is over.

  34. Is Backward Compatibility Holding Microsoft Back

Comments are closed.