How would you solve this compatibility problem: Network interoperability


Okay, everybody, here's your chance to solve a compatibility problem. There is no answer yet; I'm looking to see how you folks would attack it. This is a real bug in the Windows Vista database.

A beta tester reported that Explorer fails to show more than about a hundred files per directory from file servers running a particular brand of the file server software. The shell and networking teams investigated the problem together and tracked it down to the server incorrectly handling certain types of directory queries. Although the server claims to support both slow and fast queries, if you try a fast query, it returns only the first hundred or so files and then gives up with a strange error code. On the other hand, if Explorer switches to the slow query, then everything works fine. (Windows XP always used the slow query.) Additional data: An update to the server software was released earlier this year which claims to fix the bug. However (as of this writing), all of the vendor's distributors continue to ship the buggy version of the driver.

What should we do? Here are some options. Choose of of the below or make up your own!

Do nothing

Make no accomodation for this particular buggy protocol implementation. People who are running that particular implementation will get incomplete directory listings. Publish a Knowledge Base article describing the problem and directing customers to contact the vendor for an updated driver.

Advantages:

  • Operating system remains "pure", unsullied by compatibility hacks.

Disadvantages:

  • Customers with this problem may not even realize that they have it.
  • Even if customers notice something wrong, they won't necessarily know to search for the vendor's name (as opposed to the distributor's name) in the Knowledge Base to see if there are any known interoperability problems with it.
  • And even if the customer finds the Knowledge Base article, they will have to bypass their distributor and get the driver directly from the vendor. This may invalidate their support contract with the distributor.
  • If the file server software is running on network attached storage, the user likely doesn't even know what driver is running inside the sealed plastic case. Upgrading the server software will have to wait for the distributor to issue a firmware upgrade. Until then, the user will experience temporary data loss. (Those files beyond the first hundred are invisible.)
  • If the customer does not own the file server, the best they can do is ask the file server's administrator to upgrade their driver and hope the administrator agrees to do so.
  • Since Windows XP didn't use fast queries, it didn't have this problem. Users will interpret it as a bug in Windows Vista.

Auto-detect the buggy driver and put up a warning dialog

Explorer should recognize the strange error code and display an error message to the user saying, "The server \\servername appears to be running an old version of the XYZ driver that does not report the contents of large directories properly. Not all items in the directory are shown here. Please contact the administrator of the machine \\servername to have the driver upgraded." (Possibly with a "Don't show this dialog again" check-box.)

Advantages:

  • Users are told why they are getting incomplete results.

Disadvantages:

  • There's not much the user can do about the incomplete results. It looks like a "Ha ha, you lose" dialog.
  • Users often don't know who the administrators of a file server are, so telling them to contact the administrator merely leads to a frustrated, "And who is that, huh?", or even worse, "That's me! And I have no idea what this dialog box is telling me to do." (Consider the network attached storage device.)
  • The administrator of that machine might have his/her reasons for not upgrading the driver (for example, because it voids the support contract), but they will keep getting pestered by users thanks to this new dialog.
  • Since Windows XP didn't use fast queries, it didn't have this problem. Users will interpret it as a bug in Windows Vista.

Auto-detect the buggy driver and work around it next time

Explorer should recognize the strange error code and say, "Oh, this server must have the buggy driver. It's too late to do anything about the current directory information, but I'll remember that I should do things the slow way in the future when talking to this server."

To avoid denial-of-service attacks, remember only the last 16 (say) servers that exhibit the problem. (If the list of "known bad" servers were unbounded, then an attacker could consume all the memory on your computer by creating a server that responded to a billion different names and using HTTP redirects to get you to visit all of those servers in turn.)

Advantages:

  • Windows auto-detects the problem and works around it.

Disadvantages:

  • The first directory listing of a large directory from a buggy server will be incorrect. If that first directory listing is for something that has a long lifetime (for example, Explorer's folder tree), then the incorrect data will persist for a long time.
  • If you regularly visit more than 16 (say) buggy servers, then when you visit the seventeenth, the first one falls out of the cache and will return incorrect data the first time you visit a large directory.
  • May also have to develop and test a mechanism so that network administrators can deploy a "known bad list" of servers to all the computers on their network. In this way, servers on the "known bad list" won't have the "first directory listing is bad" problem.
  • Since Windows XP didn't use fast queries, it didn't have this problem. Users will interpret it as a bug in Windows Vista.

Have a configuration setting to put the network client into "slow mode"

Add a configuration setting to the Windows network client to tell it "If somebody asks whether a server supports fast queries, always say No, even if the server says Yes." In this manner, no program will attempt to use fast queries; they will all use slow queries. Directory queries will run slower, but at least they will work.

Advantages:

  • With the setting set to "slow mode", you never get any incomplete directory listings.

Disadvantages:

  • Since the detection is not automatic, you have many of the same problems as "Do nothing". Customers have to know that they have a problem and know what to search for before they can find the configuration setting in the Knowledge Base. Until then, the behavior looks like a bug in Windows Vista.
  • This punishes file servers that are not buggy by making them use slow queries even though they support fast queries.

Have a configuration setting to put Explorer into "slow mode"

Add a configuration setting to Explorer to tell it "Always issue slow queries; never issue fast queries." Directory queries will run slower, but at least they will work. But this affects only Explorer; other programs which ask the server "Do you support fast queries?" will receive an affirmative response and attempt to use fast queries, only to rediscover the problem that Explorer worked around.

Advantages:

  • With the setting set to "slow mode", you never get any incomplete directory listings.

Disadvantages:

  • Every program that uses fast queries must have their own setting for disabling fast queries and running in "slow mode".
  • Plus all the same disadvantages as putting the setting in the network client.

Disable "fast mode" by default

Stop supporting "fast mode" in the network client since it is unreliable; there are some servers that don't handle "fast mode" correctly. This forces all programs to use "slow mode". Optionally, have a configuration setting to re-enable "fast mode".

Advantages:

  • All directory listings are complete. Everything just works.

Disadvantages:

  • The "fast mode" feature may as well never have been created: It's off by default and nobody will bother turning it on since everything works "well enough".
  • People will accuse Microsoft of unfair business practices since the client will run in "slow mode" even if the server says it supports "fast mode". "Obviously, Microsoft did this in order to boost sales of its competing product which doesn't have this artificial and gratuitous speed limiter."

Something else

Be creative. Make sure to list both advantages and disadvantages of your proposal.

Comments (200)
  1. NBC says:

    something to add to tweakui :)

  2. Jimbo says:

    Disable "fast mode" by default

    Since Vista is a new product, the last thing it needs is to arrive with "problems".

    Let people turn fast mode on and get and error message if the server has a problem.  That way, the user is in control.

  3. Scott says:

    Couldn’t you auto detect the problem and refresh the list using slow mode?  This way you don’t have to keep a list.  The downside would be that the file list would be retrieved even slower than slow mode.

  4. Automatically fix results by rerunning query in slow mode for explorer.  Log an event in the system or application event log once per whatever indicating the problem and linking to the knowledgebase article.

    Advantages:

    Everything works

    Administrators who monitor event logs will become aware of the problem and can fix if desired.

    Similar behaviour to windows XP.

    Disadvantages:

    Difficult to implement?

    Hack there just for compatibility

  5. David says:

    Can’t "Auto-detect the buggy driver and work around it next time" be modified such that when Explorer detects the error code it immediatly asks for the directory listing using the slow mode again, therefore avoiding the display of any truncated list ever? Something like:

    string[] GetDirList()

    {

      bool weirdErrorHappened = false;

      string[] fileList = GetFileListUsingFastMode(out weirErrorHappened);

      if(weirdErrorHappened)

      {

         fileList = GetFileListUsingSlowMode();

      }

      return fileList;

    }

  6. tony says:

    have a filter in the network stack that can overide slow/fast for all apps!

  7. David says:

    Ok, I guess you need to let us know, at what point you get the error code that allows you to detect the bad server. I take it now that this only happens AFTER some app might have received the first parts of the file list during enumeration?

  8. Tim says:

    Perhaps this sounds a little harsh, but i’m a fan of the Do Nothing method.

    I personally feel (and feel free to tell me I’m wrong) like its problems like these that lead to the windows codebase being so hard to maintain for backwards compatibility.

    Where do these "hacks" to the windows code stop? Why isn’t the responsibility put on the 3rd party vendors to make their software work properly?

    I think the fact that you’ve recognized the problem, found out that there IS a solution (the update from the vendor) and could put out a KB article describing the problem and solution in detail is PLENTY for Microsoft to be responsible for. And in fact, I’d argue its probably a lot more than other vedors might do…

  9. David, Will, Scott: David’s suspicion is correct. The shell has already returned a partial file list to its caller by the time the error is detected. An
    application calls IShellFolder::EnumObjects and the shell issues a fast
    query. Each time the application calls IEnumIDList::Next, the next
    result is returned. After returning about 100 items, oops, it turns out
    that the server is one of those bad servers whose fast query is broken.
    Now what? It can’t go back in time and “un-return” all those items that
    it had returned up until now in response to IEnumIDList::Next…

    Sorry I didn’t explain why you can’t “refresh”. I didn’t think it was important.

  10. Moi says:

    Considering Vista apparently won’t ship this decade, why are you worrying about it?

    If you must create bugs in your own code, do what David suggests, otherwise tell the vendor to fix their distribution. Of course, instead of putting the message in a message box you could always put it in a log which would be monitored by sysadmins. Nah…

  11. David says:

    Also, I don’t like any of the options that show specific UI for that to the user (a la "contact your admin"). But what would be nice is something in the event log. So that at least in well managed environments admins would get tipped off (without involvement of the user).

  12. Steve Thresher says:

    Why not contact the vendor if you know they have a problem?

  13. BryanK says:

    Depends on what exactly the other server is.

    If it’s Samba, for instance, then report it as a bug with Samba (or fix it yourself and give them the code).  Users will eventually upgrade, and the problem will eventually go away.  But this is only an option if the server is open-source.

    If the server is not open-source, then there’s really nothing you can do but maintain the back-compat hack database that none of the commenters here seem to like.  ;-)

    As for "an update to the server software was released that claims to fix the bug":  Does it actually fix the bug?  Presumably you have a testcase that shows the issue; have your testers tried applying the update to see if the problem is still there?  If it does work, then whether or not people have applied the update right now, they will eventually.  And once they do, the back-compat hack is useless.

    (And an open-source program like Samba will get the fixed version rolled out to more places more quickly, as well — not least because the average Linux distro seems to update packages much more frequently than the average closed-source OS.  For instance, Debian updates (on average) 5-6 packages every day: they will eventually pick up this fix, if it works as advertised.  Users who have auto-update mechanisms in place (which is most of them) will then get it installed.  Yes, NAS boxes are still an issue — maybe contacting the known vendors of NAS boxes using this server would be an option.  Not sure on that though.)

  14. Mike says:

    If the vendor is in Europe, refer it to the European Commission for advice.

  15. DrPizza says:

    The first question that must be answered is, just how much difference is there between fast mode and slow mode?

    If it’s actually worth having fast mode, then create a new return code from the find file API for "server gave up and needs you to start again".  The network driver need then only remember the error occurred for that session (and mark the session as "slow"), and Explorer will know to requery the directory.  Maybe log to the Event Log at the same time, and maybe have a mechanism to globally force the driver to slow mode.

    Advantages:

    It uses fast mode if it can

    It uses slow mode if it must

    It doesn’t need any persistent memory

    Disadvantages

    Needs adding a new error code which not all apps will deal with; "legacy" apps which don’t know they need to resubmit will just be left with a broken find, especially as they probably don’t check error codes properly

    Will slow down the first directory listing of a broken server

  16. Adam Gates says:

    DO NOTHING!!!

    Changing your projduct to fixed someone elses problem? You are just asking to open another bug or security hole.

    AND… Why not work with the offending vendor and offer to pay them(for their time) to fix there product?

  17. DrPizza says:

    (and if it’s a problem with the client-side driver, simply refuse to let the user load it, saying it’s unsigned or incompatible or some other similar complaint)

  18. Bart says:

    Scrap ‘fast’ mode and add an autodetectable ‘faster’ mode and fallback to ‘slow’ mode when it isn’t available.

  19. David says:

    Ok, so clients call IEnumIDList::Next numerious times to fetch all the names of the files. During one of those calls you guys realise "damn, we can’t list the complete dir, because we got the strange error back". Isn’t that just VERY similiar to say the network connection being dropped in the middle of the enumeration? So, why not return some error from IEnumIDList::Next that would have the meaning "Sorry, can’t retrieve all items, due to network problem". Then, if an app wants to try again, you could use your known bad server list to use slow mode the next time.

  20. Anonymous says:

    Tremendous hack: Do the fast query, remember the files returned (up to 100, or the maximum number of files that a bugged fast query can return). If it fails, do the slow one, and don’t pass the files already reported.

    If the server is known to work well, don’t remember the returned files anymore for that server. If the serves is known not to work, use slow mode always for that server.

  21. Adam Gates says:

    So Do nothing BUT engage the vendor to a get real solution!

    Example of failure: Remember the issue with Write Back Cacheing and Exchange 5.5?

    Compaq and Microsoft could not agree on a standard on how to report the way memory was written to the disk from cache on the controller card.

    Exchange kept expecting to find the data in the last place it wrote the data (the cache) but the data had already gone to disk.

    If Microsoft had engaged Compaq just a little more this could have been fixed by Compaq.

    Microsoft needs an escalation process to work with the outside Vendors to fix their issues.

  22. Tim Farley says:

    I second the motion on the modified "work around it next time" scenario that David proposed in the fifth comment.

  23. Kristoffer says:

    Hacking just explorer would be my recommended option; leave the poor network driver unsullied.

    First we’ll need a new option in explorer which has three states:

     1. Use slow queries

     2. Use fast queries

     3. Not set yet (this is the default)

    Dropdown only needs 2 entries and should be blank if has not been set yet.

    If the option has not been set yet explorer auto detects a truncated listing and automatically changes the option to use slow mode and re-issues the query in slow mode. All queries from now on will be done in slow mode.

    If the users change this option to using fast mode then they will see truncated results on buggy servers and a dialog box warning them of such. The dialog box is acceptable in this case since the user had to manually turn on fast queries (if the server claims to support them of course). Explorer can now note this as a bad server and retain X number of bad servers where it should use slow queries instead of bad ones.

    Online help for this option should explain why this option may have been set to use slow queries by the system.

    Disclaimer: This is all theoretical. I don’t know if it’s possible for explorer to re-issue a query without causing issues internally which would invalidate part of my solution.

  24. AC says:

    Idea 1:

    Isn’t it possible to detect the version of the server, and afterwards to produce slow queries to servers which are known to make errors?

    Since it’s something like "get first/get next" you are supposed to exchange more messages with the server so you can start the session by first asking the version.

    Another way around, if you can’t detect "old bad" can you detect "new fast" servers? Then only those should be asked fast.

    Idea 2:

    If previous is not possible (why? most of the servers identify their version, or their version can be easily recognized) then even if you have some "get first/get next" you can still "revert to slow" after the error. There’s a chance that the "list" changed, but if the first 100 entries still exist, you can still deliver the whole list without the error probably much more often than in any other scenario.

  25. Riva says:

    In my opinion, it also depends on how common the particular server is. If it’s going to cause a problem for say (abitrary low ‘insignificant’ number goes here) of Vista users it’s different than something that affects half your users.

    I’d also vote for "Do nothing exceptt issue a KB" and try to persuade the company/organization to issue their equivalent of a KB to let their users know they need the fix. Less code paths, less testing, more time available to test your own product rather than fix a problem that isn’t your fault.

    Other than that I would go for "Auto-detect the buggy driver and work around it next time" (+ logging in the event log) except there should be some kind of feedback for the user that lets them know the current operation failed.

    Raymond: what keeps IEnumIDList::Next from returning say E_FAIL which causes Explorer to clear the items it received so far and pop up a dialog box saying the operation failed? No results are better than inaccurate ones.

  26. Brad Corbin says:

    Great real-world question, Raymond.

    The fix / workaround is going to depend pretty highly on the technical details of how things work under the covers.

    My two initial ideas (both mentioned by others above) were:

    1) Before attempting a Fast Query, poll the server OS to see if it falls in a list of versions with this known issue. If it does, use Slow Query instead. Maybe store this on a temporary "good" and "bad" list so later calls to a known server.

    Advantages:

    It just works

    Disadvantages:

    Is this even technically possible? Is there a way to identify this system besides actually getting the error itself?

    Does this extra call make the Fast Query call SLOWER than just using a Slow Query to begin with?

    2) If you get the "weird error" using a fast query, retry immediately with a slow query. Remember that server on a "slow list" so retries use the slow query from now on.

    Advantages:

    It just works

    Disadvantages:

    Is this even possible?

    From your response to the comments above, it sounds like this doesn’t really work: that an application doesn’t do a "getFolderContents" (or whatever) call, but instead does a recursive call to FindNextFile. So any fix along this line would really have to be on the application level. You could fix Windows Explorer, of course, but you have no control over the implementation details of all the millions of other apps out there.

    What happens if you do a slow query FindNextFile immediately after receiving an error? Does it "reset" and give you the first file in the directory again? If it does, then you’re screwed. If it really does give you the correct next file, then idea #2 above should work.

    3) Put the above fix into Windows Explorer, and let FindNextFile calls by other applications always use the slow query.

    Advantages:

    Makes Explorer faster than competitors

    It just works

    Disadvantages:

    Heh. Don’t let anyone see you do this! :)

    Actually, that gives me another idea:

    4) Always do a slow query when FindNextFile is called. Make a new API called FindNextFileFast that does the same thing in Fast Query mode, BUT also returns a new error code if this particular problem (or some other similar one)is encountered. Retool the Explorer application with the proper logic to use the fast version when possible, but fail over to the slow version if the error is encountered. Let the application worry about issues like keeping a list of bad servers, etc.

    Advantages:

    Pushes the logic up a layer to the application

    Old apps will just work the way they did in XP (using slow query mode)

    Newer apps can be tooled with the proper logic to failover correctly, and will work faster.

    Disadvantages:

    Probably the most difficult technically

    Requires rewriting some deep, important parts of Explorer

    The Windows API for Vista is probably already locked, this may not be a valid option.

    Please keep us posted on this one, Raymond!!!

  27. Me personally, I like "Have a configuration setting to put the network client into ‘slow mode.’"  The idea of auto-detecting this error makes me cringe.  What if you auto-detect it wrong, then what happens?  What if in the future the error code you’re looking for ends up being a valid code, now your "workaround" may cause problems.

    You could go with "Disable ‘fast mode’ by default" but I don’t like the idea of dumbing it down to the lowest common denominator just because a minority have a problem.  I’d rather see the minority use a workaround rather than the majority use a workaround to enable the full product capabilities.

  28. derek says:

    "The shell has already returned a partial file list to its caller by the time the error is detected. An application calls IShellFolder::EnumObjects and the shell issues a fast query."

    Why not, if the error that the server returns is obvious enough, then issue the slow query and don’t return the items that have already been sent to the app?

  29. Andy says:

    Raymond,

    is it possible to modify IEnumIDList::Next? I know you shouldn’t handle such a problem on that level, but you make the method behave normal until you get the error after file #100. Instead of returning the error, you’d recreate the file list using IShellFolder::EnumObjects (but in slow mode), run ::Next a hundred times and return item #101.

    Advantages

    – Works on all clients

    – Doesn’t bother the user with bugs/contact some administrator

    – Doens’t cause problems on other file servers, so they can take full advantage of the new fast query.

    Disadvantages

    – Ugly hack on the "wrong" level in the API

    – Will probably cause a performance hit when querying the problem servers, since the first 100 files will be queried twice (in fast and slow mode).

    I don’t think the last disadvantage is such a problem: most people probably won’t notice, and you can still post a mskb article about the correct drivers to "improve performance". Bad performance is still better than missing files.

  30. Mike Swaim says:

    I’d take several approaches. I’d default to fast, but be have a configuration setting to set the client to always use slow mode.

    FindFirstFile has an unused parameter. (Yes!) Use it to specify whether to use fast or slow mode. (Default to slow mode if the parameter is set to 0.)

    Create a new error code to say that the server can’t handle fast mode. If FindNextFile encounters this bug, it returns the error code, and sticks an event in the event log.

  31. Mike Swaim says:

    I’d take several approaches. I’d default to fast, but be have a configuration setting to set the client to always use slow mode.

    FindFirstFile has an unused parameter. (Yes!) Use it to specify whether to use fast or slow mode. (Default to slow mode if the parameter is set to 0.)

    Create a new error code to say that the server can’t handle fast mode. If FindNextFile encounters this bug, it returns the error code, and sticks an event in the event log.

  32. vsz says:

    The "Do nothing" approach would make a lot of favour to the world and MSFT itself, especially on the long run. Let those vendors and developers fix their own stuff.

    If this cannot work, I’d choose this:

    "Have a configuration setting to put the network client into slow mode"

    + Put an Event Log entry from Explorer if the bug is detected.

    And finally: Document your protocols, so that 3rd parties could possibly avoid these kind of bugs in the first place.

  33. Antgiant says:

    I personally am a fan of the "Auto-detect the buggy driver and put up a warning dialog" method although I would suggest also creating an event log message each time the error occurs.

  34. Gene says:

    Obviously all the alternatives suck, otherwise you would have picked one and not asked!

    I can’t see why there’s two different "speed" queries in the first place. Is the "fast" query really that much faster? Why? What does it short-circuit? Does the "slow" query just do "FOR I IN 1 TO 100; NEXT I" after each disk read? :)

    Seriously though, what sort of performance hit are we seeing here? I guess it must be significant, otherwise you’d just not use fast queries and be done with it.

    It seem nothing really uses the fast query anyway, since the issue is showing up so late in the game.

    I guess the best action would be "disable fast mode by default" since it’s just plain buggy and nobody seems to care to beat enough ass to get it fixed… otherwise you’d pop up a window saying "get yer buggy drivers fixed"

  35. Brian Stanton says:

    Would MS ever consider moving towards a ‘certified works on Vista’ model?  Only drivers/code/apps/components etc that pass MS tests get the certification. This buggy driver would obviously not get certified.  Perhaps if an uncertified component was installed, Vista could display that info to the user in a obvious but unobtrusive manner.

    Getting back to this specific instance, I would think most users would favor correctness over performance.

  36. Joe Butler says:

    A complete directory listing is desireable over an incomplete listing.  An error message should not be presented to the user.  Windows should handle the problem transparently so that users get what they expect in terms of the data – forget the speed issue.  If speed is a problem for the user, they’ll find the kb article detailing the issue and apply the appropriate patch or force Windows to override the compatability fix.  Just remember, those missing files could represent mission critical data where limited testing did not test with more than the threshold number of files.

    So, why can’t Windows internally issue a new slow request and merge the slow results with the incomplete results when it receives the fast error message – the app waiting for the next enumeration may stall for a short while as the new results are returned and merged (but only after file c. 100).  At this particular point, all files identified in the internal slow request as not appearing in the incomplete fast list are appened to the, as yet unterminated, original query.

    FindFirst… (1)

    FindNext… (2)

    FindNext… (3)

    ..

    ..

    FindNext…

    [internally, error returned from server to Windows] (don’t tell the caller – issue an internal slow request now)

    ..

    [internal (got that one)]

    [internal (got that one)]

    [internal (a new one)] FindNext(101)

    [internal (got that one)] FindNext(102)

    [internal (got that one)]

    ..

    ..

    [internal (a new one)] FindNext(n)

    [internal (no more files)] FindNext (no more files)

    Benefit.

    XP users are used to slow network listings and can continue to work without issue when using Vista.

    Scheme seems to be backwards compatible for all applications using the internally-patched FindFirst.. FindNext… APIs.

    Users don’t appear to ‘mislay’ data.

    Microsoft don’t get reports of ‘Vista can’t see my files’.

    Downside.

    Vista is not as fast as the developers know it could be.  

    Competing OSes that don’t care about the end user will win in directory listing benchmarks.

  37. Justin Bowler says:

    I would do exactly what Andy suggests.

    Admitedly it is an ugly hack, but it’s already in an error handling routine so that’s not that big an issue. In general ugliness is acceptable when you are already in the "Something bad happened" state.

    The requery might have to be a little smarter than just "loop 100", since the returned lista may have chnaged between the two calls.

    This is really the best fix, as it allows fast mode to be enabled, and allows it for others. It also provides a nice "forward compaitibility" so that when the file server is fixed things start working fine.

  38. PaulM says:

    Am I the only one who is breaking into a cold sweat reading the majority of these suggestions?

  39. Christian says:

    ‘ Have a configuration setting to put the network client into "slow mode" ‘

    is the right thing to do!

    When we upgraded to XP Sp1 (hope I remember correctly) our Samba server made problems: Clients could not join the domain anymore.

    We had to prepare the image with the "RequireSignOrSeal" registry key.

    Overall this was not a big problem.

    Especially consider that people will roll out Vista (while the old NAS still is there) and can prepare to include that fix in it!

  40. Coleman says:

    Auto-detect the buggy driver version and put-up a warning dialog with the option to re-run the query in slow-mode and the ability to always do that in this particular scenario ("Always perform this action/Don’t show again" setting).   Default to fast mode.  

    Note though that the average user won’t know what this means anyway.

  41. Jesse says:

    Just out of curiosity, why did Windows XP always use the slow mode?

  42. Marcel says:

    If Vista works around this problem transparently, then how could you force the buggy driver developer to fix the driver? What is the point of having this fast query if it can’t be used?

  43. Anonymous Coward says:

    The only right solution is to tell the developer and not change your code.  The developer then has several months to address this with their customer base.  If they don’t do a satisfactory job by the time Vista ships, also create a knowledgebase article saying they are broken and point to them for the solution.

    Any other route will make things a lot worse.  Any form of workaround will end up with you deciding to go in slow mode even after they have fixed the problem.  At that point it won’t be possible to tell if Microsoft is using the slow mode as an anti-competitive practise or for "compatibility".

  44. Martin says:

    I would do a combination of

    ‘Have a configuration setting to put the network client into "slow mode"’ and ‘Auto-detect the buggy driver and put up a warning dialog’

    The dialog should say something like this: The server you are accessing only returns the first 100 files when using fast queries. Do you want to turn fast queries off? Y/N.

  45. Anon says:
    1. Always perform the fast query if it is significantly faster, dont change the OS, if you know that not all data came back write a log entry warning, linking to the knowledge article. Sometime in the future the bug will be fixed….but if not, the OS is performing to its capability

      2. Make the mode (fast/slow) user changeable and if after the first pull the data is incomplete force the mode to slow and inform the user of the change and the reason -> link to article

  46. Michael B says:

    Detect the buggy driver, put up a warning dialog asking them to ask the administrator to upgrade the driver.

    Give the user the option to ignore the error and continue with buggy behavior, or retry in degraded mode.  Explain what degraded mode is.

    RECORD THE ISSUE TO THE EVENT LOG, ALSO REPORT WHAT THE USER CHOOSE, AND RECORD HOW TO UNDO THE CHOICE.

  47. Paul says:

    The first time you get the weird error code for a particular server, let the user choose what to do: either try again (once) in Slow Mode, turn off Fast Mode for this server (i.e. add it to the local Slow Mode cache), turn off Fast Mode for all servers, or Cancel (i.e. live with the reduced list, and ignore any redirects that may allow a DoS attack).

  48. Joe Dietz says:

    I’m going to take the not-so-wild guess that we are talking about Samba here.  The Samba folks themselfves are cool and are always interested in fixing problems, and it sounds like they have.  However vendors that sound like ‘DeadRat’ aren’t so cool and aren’t so interested in fixing problems.  Given how the DeadRat (and friends) community works socially, I’ll be reading on /. how this was a MSFT conspiracy to break compatibility with everybody with DeadRat installed.  Also the number of embedded systems with Samba installed is scary, some of which have very old versions of Samba that the Samba team would probably rather went away as well.

    Perhaps some sort of additional attribute could be added to the server code.  If this attribute is present that say identifies the server as W2K3ServerR2SP1 use the fast query mode, if not, act like XP.  Someday the Samba folks will implement this new attribute as well and then all ‘distributors’ of Samba when they pick upp the fixed Samba bits will automatically start using the fast query mode.

  49. Jacobo says:

    (joke) Exploit the bug and make the server crash. Then only 100 items will have appeared because the server crashed :-)

    Seriously, what I’d expect Windows to do is to detect the error, repeat the query in slow mode, fast forward the 100-items-or-so behind the scenes, and replace the iterator under the program’s nose :-)

  50. Ilya Birman says:

    (Sorry, haven’t read all the comments above.)

    Make the network client, not Explorer, automatically re-query dir list in slow mode when the problem is detected. This should be 100% transparent to every program including Explorer itself.

    AND

    Store last "16" (I would suggest more) buggy servers not to issue the first fast query at all, to make it all work as fast as you can.

    Advantages:

     – everything just works

     – it works faster, that XP

     – user is not asked questions he has no idea about

    Disadvantages:

     – buggy servers, if "17" or more, work even slower then they could if slow mode would have been always enabled (hey, but it’s a buggy server, but it still works – cool!)

     – you need to use couple of bytes of windows registry to store the "16"-server list

    As far as I can see, advantages beat out the disadvantages completely.

    Oh, and I’m not sure about implementation issues regarding my solution, just have no idea :-)

  51. Adam says:

    How would the HTTP redirect DOS attack work?

    HTTP doesn’t have a "fast mode", so being redirected through a billion web pages can’t cause this bug to be tickled. It’s a different protocol to the one that we’re talking about here, isn’t it??

    Besides, if an app/the HTTP stack doesn’t have an HTTP redirect limit that stops you being redirected a billion times anyway, you’ve already got a DOS vulnerability on your hands.

  52. MSDN Archive says:

    My vote:

    Explorer should recognize the strange error code and display an error message to the user saying, "The server \servername returned an unexpected error.  Contact the administrator of \servername."

    Then, of course, have a KB with THAT EXACT ERROR TEXT in it that people could use to find the root cause of the problem and get the fix.

  53. Michael B says:

    Yeah, let me underscore that regardless of which camp wins (user’s blissful ignorance camp vs. punish for bad drivers camp), full logging of this condition is key.  It would be of enormous help to the admin who is going to get a vague enough problem report from the user one day if they check the event log and it has useful information about the condition.

  54. Anders Munch says:

    My choice would be to auto-detect, then pop up an error dialogue explaining there’s a bug in the server.  In the same dialogue, propose to work around it by switching a global setting to "slow mode".

    Advantages:

    – Simple to implement.

    – Doesn’t hide the bug, and clearly assigns responsability.

    – Provides a practical workaround when the problem appears.

    Disadvantages:

    – A one-time nuisance dialogue.

    – Doesn’t allow for the use of fast mode with good servers, if the user has met just one bad one.

    – Users will tend to forget the setting is changed and never use fast mode again, even if the server is fixed.

    About those clever schemes people are suggesting, where you try fast mode first, then automagically fall back on slow mode?  Well, it just ain’t worth the complexity. Remember that such workarounds tend to stick around longer than the original bug.

  55. James W. says:

    Solution below is way too complicated for reality but maybe it will inspire something else.

    Default to slow mode to maintain accuracy of the Directory Listing. That 101st file that is inaccessible is unacceptable!

    If an error dialog and Event Logging are used as well as disabling Fast mode per Network drive maybe the default can be fast mode. (Probably usability test the difference.) If fast mode is the default the user is more likely to see this situation.

    Then allow the user to configure the setting similar to "Optimize for Removal" that is available for USB drives. This can be available on the "General" Tab of network drive Properties. The wording would be similar to the text used on the "Troubleshoot" tab of Advanced Display Properties. "Disabled all accelerations. Use this setting only if your computer (…) has (…) severe problems". "All accelerations are enabled. Use this setting if your computer has no problems. (…)"

    Otherwise maybe another feature in TweakUI is more appropriate.

    The first time the error value is received then fast mode is disabled with a message written to the Eventlog. The user should also get an "OK" style dialog because that 101st file may not be visible. Explorer should then Refresh the directory list using slow mode.

    Disadvantage:

    This really is a full compatibility hack.

    Multiple Code paths complicate testing.

    If a Drive letter gets assigned to a different drive.

    Advantage:

    Data integrity.

    User confidence. They will be protected as much as possible from data loss.

    Low annoyance. (Especially defaulting to slow mode.)

    Don’t write to the eventlog for every occurance because the option is disabled automatically.

  56. Christian says:

    Raymond, can you please explain a little bit more what this fast mode is?

    Can I as developer also use it?

    Is there anything magical about the Win32 FindFirst, findnext things?

    Or the .NET  Directory.GetFiles?

  57. Peter Ritchie says:

    I would have to agree with the comments suggesting/agreeing-with "Do nothing".

    It’s noble to want to ensure a reliable user experience; but, when you’re dealing with third-party software, at what point do you "draw the line".  By adding something to compensate for buggy third-party software not only do you increase the code base with code that (hopefully) will eventually be entirely unrequired, you also bloat the code base with code that not all customers will need.

    You’re also setting a policing precedent suggesting that Explorer will work with all drivers even if they are buggy.  Do you really want that weight on your shoulders?  What about third-party software bugs you can’t compensate for?  Should a customer be expected to understand that you tweaked explorer for one bug and not another?

    I would suggest the "Do nothing" approach, continuing to detect and inform the user of errors.  If a user encounters the error they can find a KB article that describes the error and what to do about it.  Then, have the ability to force explorer into slow mode via some configuration option.

    What about situations where the error occurs and has nothing to do with fast-mode?

    Compensating for a third-party bug is just a nightmare: design a fix, create unit tests, create test cases, test the fix, perform full regression testing with the fix in non-failure situations to make sure the fix didn’t affect anything else, manage the fix, create a task to re-address the fix at a later date when its deemed all third-party software is stable, remove the fix, perform full regression test (again) to make sure removable didn’t break anything, etc. etc. etc.

    "Give me the strength to accept the things I cannot change and change the things I can and give me the wisdom to know the difference."

  58. Adam says:

    Depends.

    If the buggy server is common (say, has >10% marketshare of servers speaking this protocol in same price range?) then make slow mode default with a configuration to turn fast mode on. It’s the third rule of optimisation – Make sure what you have is correct. Then make it fast, if making it fast doesn’t make it less correct.

    If the buggy server is not common, do nothing.

    Of the remaining options, Autodetect + warning dialog + syslog message + auto-work-around next time would be next, because it’s not that bad an option.

    Making the user dig around to make things work properly just sucks. It should work properly by itself, and if it notices it’s not working properly then it should notice and tell you and ask you there and then (which is the previous option)

    Doing that per-application is the worst, as the user now has to dig around all over the place to get all their apps to do the right thing, and some apps won’t even have that option, and those that do might have a buggy "go slow" switch, etc, etc, etc…

  59. Dave Oldcorn says:

    I would suggest the last option; disable fast mode by default, the logic being that since that’s how it was in XP, nobody knows any different. It’s not a good solution, but it’s the most reliable one, most people will be none the wiser, and anyone who knows and cares enough to complain will be able to fix it themselves.

    There is no chance that any of the ‘break things’ options (including the ‘break things, but allow expert users who can use the registry or search the KB to fix it’) will be acceptable if the software in question is likely to be in use by any but the smallest number of average consumers, particularly if it is inside unpatchable hardware.

  60. diegocg says:

    Blacklist the specific version of software which does this and automatically set it to "slow-mode" – and output a warning to the event log. Being a bit slower is not a critical failure, a popup warning it is overkill for this kind of bug.

    But you can prepare your software to work against a buggy version using a blacklist and still get it wrong: other buggy servers may be released in the future, so autodetecting the buggy server and reconnect/relist the directory automatically – and add a warning to the event log – is a must aswell.

    Storing a list of recent buggy servers is stupid. What if the administrator realizes that the server is broken and fixes it? The client would get it wrong. Either do a blacklist of broken servers or detect and fix it somehow dinamically.

    And hey, the "future buggy server" may be one from Microsoft or a broken security upgrade, so you really want to autodetect broken servers regardless of the avalability of broken servers anyway. Is about robustness. (Also add registry keys to tweak all this ie: keys to allow clients to autodisconnect from buggy servers

  61. Darren Stone says:

    Easy.

    If the server returns an error code, report the error.

    If you recognize the error code, provide a link to the knowledge base article that describes the problem.

    In the knowledge base article, describe possible solutions (e.g. how to disable fast mode in Explorer.)

    Case closed.

    —-

    Even non-technical users try to track down problems like this on a regular basis.  The worst thing to do is hide the fact that there is a problem.  Inform people, provide a possible workaround, and put the onus on the 3rd party to fix their software.

  62. J says:

    "About those clever schemes people are suggesting, where you try fast mode first, then automagically fall back on slow mode?  Well, it just ain’t worth the complexity."

    I don’t know whether the resulting race condition applies to this problem, but I’m not sure people suggesting this approach have considered it.  If you take 2 different snapshots at 2 different times and then combine the snapshots, you may get an impossible combination of data items.  This would happen if you make a fast query, detect the error, make a slow query, and just don’t return the items that you’ve already returned.  What if between the fast and slow query, the directory snapshot has changed?

    If your directory listing snapshot becomes invalid the microsecond after you take it, that’s fine.  But if your snapshot shows that File A and File B are existing at the same time, you may run into problems.  I don’t know enough about the problem domain to know whether this is a big deal or not.

  63. Mark Sowul says:

    Unfortunately I have to concur with the suggestions which state to cache the first 100 and re-run the query on the 101st if necessary concurrent with a list of "known bad"

    It should be unnecessary but the advantages definitely outweight the downsides:

    – You get fast mode when you can

    – Only the stupid servers are punished

    – Stupid servers are not punished as badly when found in the cache

    – You’ve had to do this kind of crap before, what’s one more hack?  (e.g. the lists of known-bad optical drives, usb devices, etc that I’ve found in the registry)

    Again, it is important that it be written to the event log.

    It’s unfortunate that this crap is necessary but it’s a necessary evil if you want to play with the retarded kids in the sandbox.

  64. Mike says:

    If you get the wierd error code log it in the event log (being careful not to spam the event log to much of course).  Popping up UI that nobody understands (and they won’t) sucks.

    Having done that doing nothing (else) would be best if you can get away with it.

    If not then starting fast and falling back when the error is detected is good.  However you said that was problematic because you’ve already returned partial results.  The solution to that would be to just ask for 100 items up front regardless of what the app actually asked for and then return results back from this cache.

  65. B.Y. says:

    My first choice: auto-detect the buggy driver and put up a warning dialog.

    My second choice: do nothing.

    But in either case, I’d also disable fast mode in explorer (only) by default.

  66. I think the proponents of "do nothing, it’s the file server’s fault" and "show an error message to the user" do not fully appreciate the way users view Windows.  If there’s a silent data loss, it WILL be considered by most to be a bug in Windows, and if users are exposed to an error dialog they don’t understand and probably can’t fix, it WILL (rightfully) be considered an abusive interface design, regardless of the fact that both are "correct" from a strictly developer’s point of view.  I was initially thinking along the lines of "start with a fast query, and if you get the error silently switch to the slower query" but now I see that this presents insurmountable issues with the possibility that the directory has changed.

    I think the only option is to modify the API.  Calls to IShellFolder::EnumObjects using the current syntax (meaning legacy programs) should continue to use the slow queries.  As <a href="#564887">Brad Corbin suggested above</a>, you should make a new function that allows the application to explicitly request a fast query and that explicitly defines an error for "directory listing truncated" or something (I’m not sure whether the error should be specific to this case or should describe any case where the listing can’t be completed).  It’s possible that you could accomplish this with an extra parameter to the existing IShellFolder::EnumObjects function, but this might be too much of a change in what is rightfully considered a stable API by developers.

    Advantages:

    – Does not break any existing applications

    – Does not require complex caching, more user-level configuration, abusive error messages, or re-querying of directories

    – Allows new applications to leverage the fast queries, but only on the condition that they are able to handle the possible side effects

    Disadvantages:

    – Legacy programs will not benefit from the possible speed increases in Vista (although they will not run any slower) – this could be perceived as bias on the part of Microsoft

    – Expands the API

  67. Anthony says:

    How about adding a parameter or overloaded version of IShellFolder::EnumObjects to implement a fast mode query, have an error code that describes the bug, and let whomever implements this version deal with the bug?

    Advantages:

    Moves the problem to the application layer, which would need updated anyway.

    Less ugly hackery.

    No worries for current code.

    Disadvantages:

    Is it possible?  (My gut says that it wouldn’t make any sense putting the option up at this layer.)

    Apps need updated to use fast queries.

    Makes one vendor’s buggy driver everyone’s problem.

  68. Richard Kuo says:

    Some basic tenets:

    Do what’s expected.

    Work by default.

    Clearly fall back to slow mode always is a nice floor for this, so we can use that.

    One question is how much "better" fast mode is than slow mode.  Obviously there must be some reason to use one mode or the other.  Is a user or application ever going to notice that slow mode is actually slow?

    If there’s no appreciable difference and fast mode vs slow mode was a design fubar, then "just stop using fast mode" is a possible solution.

    Now let’s say that isn’t an option.  Why can’t you just restart the enumeration if the fast mode operation fails?  Are you displaying the results on the fly and that’s why the enumeration can’t be restarted?  Perhaps you could buffer the results if that is the case, to allow a restart in slow mode to be possible if a specific error code is detected.  I don’t think merging the results is a good idea due to possible race conditions that the enumerator might be accounting for.

    Another idea is to use slow mode all the time initially, and probe when an opportunity presents itself for fast mode compatibility.  This would ensure compatibility and allow the speed up to occur pretty quickly.

    Can you probe the server via OS prior to any enumerations?  That might help.

    I do not believe enabling or toggling slow/fast mode via the registry is appropriate.  This requires manual user intervention and will still break certain people by default.  It’s our job as programmers and people shipping actual product to take work away from our users, not give it to them.

    In all cases, you should notify the vendor and request a fix to be made.  And add a regression test if possible.

    Richard Kuo

    http://www.rkuo.com

  69. Mike M says:

    If this software has any reasonable market share, then IMO it is absolutely not acceptable to "do nothing."  Vista would be introducing a regression.  It doesn’t matter whose fault it is, the directory listing would work with XP and not work with Vista.  That is an unacceptable result from a performance optimization.

    Now, if servers running this software are very rare, or you could work with the distributors to get the fix to customers preemtively, then it might be worth the tradeoff to "do nothing."

    Also, several people supporting "do nothing" suggested something along the lines of "get the vendor to fix their product."  Apparently they missed the fact that the bug has already been fixed, but distributers are still shipping the broken version to customers.  Getting the developer of the server software to fix the bug doesn’t magically make that fix available to every customer, for reasons mentioned in the article.

  70. Darren Stone says:

    I don’t like the way this problem has been approached.  There are really only two possible outcomes:

    Option A: The user requested a directory listing; give them their files.  This is just a simple matter of coding.

    Option B: Fail.

    The thing is, this is a largely non-technical decision that no one here is qualified to make.  It depends on all sorts of factors we know nothing about: how common the problematic server is, where it’s used, is it worth the risk to add this specical case in Explorer, etc.

    Now let’s say we want to discuss option B, how to fail.  Well, this really becomes a generic discussion, doesn’t it?  The bottom line is that this is a problem for institutions that have a diverse computer network, and solution is to make it as straight forward as possible for IT departments to diagnose and solve to this network problem.  For them, that would be either disabling Fast mode on their users’ machines, or updating the problematic servers.

  71. Xander says:

    If the other software contains a bug, the onus should be on that vendor to fix their software.

    Incorporating any kind of workaround into the system means there’s more code for the system to carry forward, more code in the system that could itself contain bugs, and more behaviours to be honoured when implementing future changes.

    Ignoring CS-level OS research, the two most important developments in deployed operating systems over the last 20 years have been the move to abstraction (don’t expose the implementation, so it can be changed later) and minimising side effects (an API to do X should only do X, not X/Y/Z).

    Both Microsoft and Apple started off at one of this spectrum, but over time both are moving to the other.

    It’s a fact of life that you can’t resolve every interoperability bug, and there will always be situations where buggy software that’s already shipped has to be broken.

    You will obtain a far more reliable system in the long run if the default policy is that broken software can be allowed to break rather than supporting it by any means necessary (there are obviously exceptions, but those should be reserved for "affects 95% of our customers" issues).

    So I would do nothing, and treat this as any other kind of error (log it so that it can be noticed, but if the server is faulty then the server is faulty).

    If you add extra code to make this particular case work, you’re just deferring the problem – the real question is what would you do if, after shipping the workaround, you discover the workaround causes a conflict with something else? Add another workaround?

    If Vista is going to be thrown away again in 5 years and rebuilt, that approach is fine. However it’s intended to be around for a decade or longer, bugs need to be fixed.

    You need a solid foundation for any project, and in software terms that means short term pain ("your new OS doesn’t work with my old server") for long term gain (fast mode becomes standard, and slow mode can be retired).

  72. TK says:

    The solution should be:

    -Report the error to the user when it occurs. Simply tell them the result set is incomplete and give a KB #.

    -Supply a reg key or something to use the slow (call it high-compatability) mode. Document this in the KB or even the error message.

    -Notify the vendor(s) that you will not be fixing it in Vista.

  73. Scott S. says:

    You could always send this to your marketing department for a solution.

    I predict they’d recommend having Vista servers identify themselves as such, and allow clients to use fast querying.  Anything else is stuck in slow mode.  One more bullet point for the feature list.

    Seriously, I assume we’re talking about a Linux/Unix system that is causing the problem based on your use of the term "distributors".  

    Unfortunately you cannot be the "Good guy" in this scenario and fix it for them and get all that nice publicity for fixing a hole for them since the distributors would need to integrate your fix.

    If this is an open source system we’re discussing, you could always analyze the code and find if there’s any way to detect this particular system based on other abnormalities    that may be more in line with your existing process for handling queries.  Hard to say if that would be worthwhile though.  You guys may just end up spinning your wheels taking the time to do that.

  74. whinger says:

    Are we talking about Samba here?

    If so, then lets not forget that you guys don’t allow the Samba team useful access to the SMB documentation. Microsoft clearly don’t want to interoperate with Samba, so isn’t this more of an opportunity than a problem? The most consistent action would be to detect and blacklist any such servers, refusing to talk to them at all.

  75. guillermov says:

    I know no specifics, but let’s assume the following:

    1. Error can be detected while in slow mode.

    2. There’s no significant performance difference between the two modes while querying directories with a limited number of files.

    If that’s so, why not try the other way around? Start in slow mode, and keep using it until a query is met that could trigger the error. If the query gives no errors, mark the server as fast-mode-enabled and use fast mode in subsequent queries. And if the list can have no arbitrary limit in length, even better.

    But most probably I have no clue and I’m just speaking nonsense.

  76. Brandon Bloom says:

    I think this could absolutely be done transparently (save for a mention in the event log).

    If the fast iterator fails with this weird error, replace it with a slow iterator and simply refresh the view. It seems unnecessary to record the bad server.

    Assuming fast queries really are faster, performance will be faster for small queries and the hurt for >100 file results will be unnoticeable.

  77. kbiel says:

    Raymond: I think a lot of the commenters are confused about the scenario.  Some believe the problem is with a local driver on Vista supplied by the vendor and most do not know what protocol is being used.

    Presumptions:

    The server is from another company, likely a non-OSS competitor such as Novell.  They are emulating SMB for MS clients.  SMB natively includes slow and fast modes but the competitors server has a bug in its SMB driver so there is nothing MS can do on the client side to fix or work around the actual bug.

    Suggestions:

    If a server reports an error to the client, display the error in a dialog.  In this dialog, include a the text "The server has reported an error that may be resolved by using a slower access method.  Would you like to attempt to access the server using the slower method?" with buttons "Yes", "No" and "Cancel".  If they click "Yes", switch to slow mode, store the name of the server for future reference (but only if the server share is mounted to a drive letter; otherwise slow mode is a session only setting) and requery.

    Pros:

    -This resolves the problem for any server from any vendor that may have a bug in their fast query implementation.

    -The user is now aware of the problem and where the problem occurred (i.e. not in Vista).

    -A knowledgable user/administrator can now contact their vendor with a complaint and the exact error text.

    -Windows has given the user an option that may fix the problem NOW from their perspective.

    Cons:

    -Nobody likes getting dialogs reporting an error.

    -Nontechnical users are more likely to press cancel initially (but will probably try yes eventually and their experience will improve).

    -On some other buggy implementation of SMB, the change in setting may not improve the user experience.

  78. Jen Kilmer says:

    This reminds me of one of the Key Diffrences between the NT & Windows teams.

    The NT team would be more likely to stay "pure": if they got the odd error, they probably wouldn’t return any data. Instead, an error to the extent of "Your server is incorrectly reporting the files available. Please contact the vendor for an updated version" or some such would be displayed. Or NT would have refused to work with the server anyway.

    The Windows team would take the "if it worked before the customer installed Vista it should keep working" and would likely disable "fast mode" by default.

    Have I mentioned how much less stressful life is now that I’m not on the shell team? :D

  79. vsz says:

    Let’s not overlook the fact that this problem is not something to be fixed (detected: maybe) on the app level (Explorer), as any other (future) applications will be able to use fast directory lookup, introducing the same problem for their end-users (or keeping away developers from using this feature).

    Windows/Kernel API level falls out too, because such a single vendor/version related 3rd party bug is not something to be handled by a stable API.

    So if Windows OS would like to provide a solution for this, IMO it should be solved on a lower level. Network client code looks like a good candidate here, or maybe a filter driver for those who need it. The latter can even be implemented outside Microsoft.

    …"Do Nothing" still looks the most tempting ;)

  80. Josh says:

    Make Explorer requery in slow mode if fast mode gives an error, and turn fast mode off by default for all other applications. Make an application that wants to use fast mode explicity recognize this problem.

    Result: Explorer, the most ubiquitous file-related application, can still benefit from fast mode. Other applications that don’t know about fast mode will be just as fast (or slow) as they were before. New applications that do know about fast mode can take advantage of it.

  81. Advertise all over the internet this bug making sure all vendors, distributors, admins and support people know about it. I bet in this condition distributors would switch to the new fixed version. Given that Vista won’t be released for about a year it’s a huge time.

    In short, learn how Open Source community deals with these issues by making the problem public.

  82. Evil Otto says:

    I don’t think any senerio where the user sees an incomplete list is ever acceptable.  That is important.

    What is M.S.’s stance on that?

  83. Evil Otto says:

    My comment got eaten, i’ll try again.

    Fix:  Don’t make fast the default.  Make it something that can be enabled, by install option, sms whatever.  Document the hell out of the problem.  Use something akin to the HCL?

    Good:  Least surprising behavior for the user. Works the way the expect.  They NEVER see incorrect data/listings.

    managed environments, where the "admins" have a clue about the services on the network, or "power users" who are on their own but know enough to get this stuff can enable on their own terms.  Also the act of enabling it can present a disclaimer of known issues with venders/versions.

    Bad: someone has to take a bullet point off their powerpoint.

  84. Teman Clark-Lindh says:

    variant of do nothing.

    If the bad driver is already signed, blacklist it in the driver installation process so it never gets installed (or generates lots of additional warnings).

    If it isn’t signed, well, then it won’t run on Vista x64. (And I hope x32 will complain bitterly)

    Make sure this test case is added to any certification process (so any future drivers don’t get signed by MS without testing).

    For Vista to work well, we need to raise the bar on what we expect vendors to provide.

  85. Philip Beber says:

    Is the server in question running Windows? If so perhaps a patch issued through WindowsUpdate would allow windows to detect the dodgy driver and always run in slow mode.

    Advantages:

    – Everything works.

    – For 99% cases Vista is faster than XP.

    Disadvantages:

    – Assumes server is running windows.

    – Assumes server is updated regularly.

    My second choice would be for fast mode to be disabled by default. Correctness is much much much more important than performance. Remember those blue-screen-of-death t-shirts? Ever seen a t-shirt that says "My Linux server is 1.5% faster than my Windows server"?

  86. boxmonkey says:

    The problem with adding a bunch of code to handle this special situation is that the code will likely end up hanging around as long as Microsoft is still in the business of making Operating Systems.

    I don’t know what the best solution is, but the one that sounds good to me is to keep fast on by default, make it user configurable, detect the error and display a message about how to resolve it. As someone pointed out, it doesn’t have to be a modal dialogue (how annoying!), but it should be present. When Vista becomes more popular, there will be more pressure on vendors or distributors to start using the fixed version of their software. For everyone who can’t get updates for whatever reason, slow mode still works.

  87. Scot Boyd says:

    If fast query returns an error, populate the first 100 with an arrow-hourglass cursor and then do a slow query to populate the rest of the list.  Then reset the cursor. Cache the name (is 256 any more of a security risk than 16 names?) and slow query from then on.  Write the KB article showing where the server list is cached (large networks might want to prepopulate it on Vista rollouts).

    And rejoice, because for once there’s an error code that you don’t have to bubble up to the user.

  88. Ilya Birman says:

    To the ones who think Microsoft should not try to work around bugs in third-party software and punish those bugmakers this way.

    You see, Microsoft’s business is not punishing someone, but making software as good as it can, for users (well, as well-sellable as it can, of course).

    Reporting wrong file listings in fast mode is a bug, no doubt. But can Microsoft do anything for user happiness? Finally, users pay money for Windows. This bug is not such a big deal Microsoft can’t work around. Why not make Windows work even with buggy servers?

    I understand that my computer is running tons of useless code, which checks for this and that to ensure everything works right. Maybe 0,1% of this code really does something for me. But for you, another 0,1% does something.

    You will be the first to shout at Microsoft when you find that Windows doesn’t work with whatsoever, even if it’s buggy.

    I, personally, support this practice of working around others’ bugs. Escecially, if every hack is thought-out as thoroughly as this one.

  89. Fox Cutter says:

    Disable Fast mode by default. It’s a pain for a few people, you need extra documentation but the option will be there. It might not be worth having the fast mode because of it, but it will at least be there.

    Also keep in mind you can’t assume the 100 file rule will always be 100 files. What in the server fixes the bug, but it happen again and 400 files. If your looking for it to happen only with the 100th file you will be out of luck (and the user will be missing data). You have to assume if it can happen it can happen at any time, at least until it has happened and you can mark it as bad. This also means you can never know if a server is actually good or not, it’s either bad or not known to be bad, never good.

    And remember, popping up a dialog box for the user for this is a very bad thing. You get the combination of an unexpected message that means nothing to them, so the default is to cancel it (probably without reading it).

    The only really workable solution is to disable it by default.

  90. Colin Jeanne says:

    I agree with what has been said before: providing an incomplete file listing is not an option. If it is possible get all the files to the user by requerying (even if it is a lot of work) then do so. It doesnt sound like that is possible so the only options I see are to either return an error to the user or to not use fast querying by default.

    I prefer using slow querying by default with the option to use fast querying as a configuration setting. In future versions of Windows when servers have better support for fast querying then turn fast querying on by default.

    I believe this solution is similar to what has happened <a href="http://blogs.msdn.com/oldnewthing/archive/2006/03/21/556505.aspx">with Windows File Protection</a>. Plan for the better solution to be implemented in the future but for now implement the one that is best for today.

  91. Brad Corbin says:

    After thinking about it a bit more and reading the additional comments here, this turns out to be a pretty tricky problem.

    Up to now, I was leaning toward the option that I mentioned above, and a few other people have discussed:

    4) Always use slow mode for "standard" calls, and introduce a new API for "New Fast Mode", that is aware of this particular issue, and throws a new error message.

    Advantages:

    Old apps just work

    New apps can be coded to use fast, but failover to slow when necessary

    User isn’t faced with the issue

    The more I think about it, though, the more I think that this is a pretty ugly hack. Imagine what the programmer will be thinking 3 years from now when he’s trying to code his network-aware application:

    "Ok, I can do a standard query in slow mode, or I can add a flag for Fast Mode (from the original API), but because of one weird corner case, that perfectly good API flag doesn’t work! So to REALLY do fast mode I have to use this new flag called PleasePrettyPleaseUseFastModeIReallyMeanItThisTime?? What the heck is this??? MS sucks!!!!"

    Ah, I think I got it. In Raymond’s original article he uses the quote:

    "If somebody asks whether a server supports fast queries, always say No, even if the server says Yes."

    Ok, does that mean that the proper way to use the current API is to always ask first if the server supports fast mode?? Then its easy! Just change the CanIUseFastMode routine with additional checks that look for this particular problem! If there is an easy way to query the exact build number of this particular server OS, then that should be easy!

    If that kind of version query is not possible, then you’d probably be stuck. :(  With just a server name, you couldn’t test it experimentally (by reading a big folder and actually see if you hit this error), because everyone sets up their fileshares differently!

    You could test it experimentally once the application passes in the network path to query, but that sounds like a massive amount of overhead (read 101 files first to see if we hit the error, then decide whether to use the current results or re-query in slow mode).

    I have newfound respect for the work you guys do. Good luck with this one.

  92. VickM says:

    Thanks for a great interview question :)

  93. dc says:

    Dunno if anyone’s posted this yet, but I would do something like:

    – Use fast mode by default

    – If you get the weird error code, add the server to the list of bad servers and remember what the last file returned was

    – Then, re-try with the slow mode, and give the application the rest of the file listing (starting at the file after the last one returned with the fast list)

    Is that unfeasible for some reason?

    I see Wesha just posted this solution, in code nonetheless! I like it the best. But I’d also earmark it to be removed from the next version of Windows after Vista (circa, what, 2012?), cause no one likes cruft, and by 2012 the problem of dealing with one buggy version of an obscure decade old third-party server implementation server will be beyond worrying about.

  94. MarkL says:

    Do Nothing

    I don’t believe you should taint the OS just because somebody else didn’t do their job properly.  And if you cover it up for them they are a lot less likely to fix the error or even try to avoid making such errors in the future.

  95. Jerry Pisk says:

    To me this is missing the larger issue – the logic should not be based on a single weird code specific to this case. If explorer gets an error, any error, during directory listing it needs to handle it appropriately, not simply list what it received so far.

    So my take on this is simple – try fast mode, if it fails, for whatever reason, make an attempt at slow mode and if that one fails as well just show an error message and go on with an unavailable directory. There’s no point in building a special logic to handle one specific case.

  96. Stefan Kuhr says:

    Dunno if this is possible: If SMB has a version number, make a certain version number on the server side a requirement for fast queries. Then ship a patch for Windows Server 2003 and 2000 that bumps up this version number, so Vista Clients can use fast queries if they see such an SMB server with high version number.

    Lots of people here suggest that a FindFirstFile query should always start in fast mode and query again if it fails with this weird error and match the queries’ results against each other and only issue the diff between the two to the client. Is this at all feasible with regards to the memory that is required to keep at least 100x_MAX_PATHx2 Bytes for each FindFirstFile call in memory on the client side? I guess FindFirstFile is nothing but a function that returns an RPC context handle. So if the client does the same job as the server (keep the state) and then update its state with state that might be completely different after the second call, this would be a perversion of the idea behind context handles and it would need lots more of memory than necessary.

  97. John Doty says:

    I recommend different things for different layers of the software. As near as I can tell, you are going to have to work around this problem, as any failure to do so will be a regression in Vista. (By definition: a change was made that breaks existing functionality.) The question is what layer should work around the problem to make everybody happy.

    1. Default to "slow mode" below IShellFolder::EnumObjects, i.e., do not change the default behavior of the object from the way it behaved in XP. This allows existing clients to continue to work as written.

    2. Provide a "fast mode" switch on the object that implements IShellFolder. This mode issues fast queries. It is otherwise no different. Specifically, it does not attempt to remember whether or not fast mode worked in the past.

    3. Allow explorer specifically to use fast mode until fast mode breaks. Teach explorer that sometimes fast mode breaks against certain servers, and it needs to fall back to slow mode in those cases. Presumably explorer is in a position to refresh any data that it has when it encounters this problem. Use caches as required to keep from making too many redundant requests.

    4. (Optionally) Teach the same tricks to as many clients of IShellFolder in Vista as you can. Despite this, I would not make a public object model or API set that would automatically work around this particular glitch

    5. (Optionally) Document the problem in a KB article somewhere, and possibly in the documentation for the new "go fast mode" API in MSDN.

  98. Here is a wild idea.

    Sell two versions of Windows….its not like Microsoft does not sell enough versions.

    Version 1: Hack free. No buggy third party compatiblity crap compiled in. I am sure you could do something in the source like

    #ifdef _VENDOR_HACK

    buggy_workaround();

    #endif

    You guys would then push the crap out of this to the corps that have all of these buggy custom written in house applications. Sell this at a reduced cost via the OpenLicense program you guys have to try to get more businesses to fix the broken apps. Of course some systems would still need the buggy backwards compat version so you have…..

    Version 2: Hacked version

    This would still have all of the compatiblity tweaks for the end users Backyard Baseball type games. In fact let me take it a step further. You could make this a simple compatiblity pack if you did not want to invent all new SKUs for another version.

  99. Adam says:

    Dialogs, switches, and remembering buggy servers are too complex.  There’s a simple solution.

    Always try the fast method.  If it returns any error that might possibly be fixed by using the slow method, try that.  In that case it always works fast for servers which aren’t buggy.  Buggy servers are a little slower, but still correct.  Perhaps log the problem for sysadmins who are wondering why their servers are slow.

    Buggy servers will be slower that they would if you bothered to remember to skip the fast method, but it’s much simpler to implement.  Also, my guess is that most people will blame the slowness on the server — if they realize it’s a different machine.  People don’t generally blame slow websites on their own machines.

  100. Regan Heath says:

    Assumptions:

    1. You have a complete list of known vendors/drivers that suffer from this fault.

    2. At some point in the future you will be distributing a fixed driver for the vendor. Possibly via some automatic update system.

    Solution:

    1. Distribute Vista with ‘the list’.

    2. Whenever a driver is installed, check it against the list, if it’s a known broken driver, set it to slow mode, otherwise set it to fast mode.

    (this implies a fast/slow mode option per driver)

    3. Allow users to enable/disable fast mode at will.

    4. If in fast mode, and you get the error, popup a dialog telling them the problem, disable fast mode, and tell them to close and restart the application.

    Pros:

    1. The ‘workaround’ code runs only during driver install/upgrade and when detecting the error. It doesn’t run on every list/mount and it doesn’t need extra memory/processing etc.

    2. Users don’t need to know about fast/slow mode, they will automatically get the best possible mode at all times.

  101. Eric says:

    No errors. No new APIs.

    Consequently… run in slow mode unless you can identify a fast mode server prior to requesting data.

  102. 112 comments wow.

    Anyway if you didn’t see in my original comment I said ‘in explorer’.  I propose fixing only explorer, not the api’s themselves.  Explorer can certainly detect the error and refresh the directory listing.  Maybe other MS programs need to do this too, those that don’t use the standard file open dialogs and the like.

    As for the api’s used by other apps, that is up to the application developer to fix.  But what about the old applications no longer supported?    They wouldn’t be using fast mode anyway, correct?  But say they did, maybe then there would need to be a compatibility option for running the program that forced slow mode.

  103. microbe says:
    1. Do Nothing is the best amongst the above choices.

      2. If you have to keep compatibility, the first time you connect to a new server that claims to support fast query and might be a candidate of having this bug, prompt the user "Fast query is a new feature of Windows Vista to make directory listing faster over network. Server xxxx claims to support fast query but there is a chance it might have bug described in KBxxxxyy. Do you want Windows to verify it? [Yes] [no] [Never prompt again]"

      There you go.

  104. JI says:

    Funny, seems that nobody has thought about this the other way around.

    For each connected server, initially set a registry value SupportsFastMode=2 (uncertain).

    Then run in slow mode until you happen to see more than 100 files returned.

    If that happens, immediately re-run the same query in fast mode and check for The Error.

    If you get >100 files set SupportsFastMode=1. If you get The Error set SupportsFastMode=0. If you get <=100 files then presumably you lost a race, and will need to try again next time.

    Advantages:

    – Don’t have to waste time searching for a valid test, you wait until the test finds you.

    – Performance will always be as good or better than XP.

    – Users never experience a fault, thus do not need to be notified.

    – Power users/Admins can set the key manually, if they know that their server is good/bad.

    – Relatively simple implementation.

    Disadvantages:

    – Performance will be sub-optimal if user never browses to folder with more than 100 files (but still as good as XP).

  105. Arnt Witteveen says:

    More than 100 comments, wow. Everyone wants to solve a windows bug (is this a commercial for OSS or what! ;-)

    Generally: make it work. I’m with Richard Kuo here, when he says "Some basic tenets: Do what’s expected. Work by default." As a user, I don’t want to go upgrade stuff so it’ll work with Vista, I want Vista to work with my stuff. I could not imagine any one who has ‘managed’ more than one computer thinking different, in practice. (I really don’t want to waste my time and frustration on such a problem.) I see that totally I’m wrong here based on the ‘do nothing’ comments.

    I was thinking like everyone else: switch to slow mode (impossible, you then specified), or return the rest of the list using slow mode, by keeping the first 100 or by re-querying for them, if a fast query always returns the same 100 files.

    Another option, and I saw this one above, but only once: how about whitelisting, instead of blacklisting? Always start with the slow query for a machine not seen before. Evaluate if the conditions for testing this bug are fullfilled (e.g. slow query returns >100 files). Send a message to some the system component asking it to evaluate this machine for fast access, giving it a path this testing can be based on (the one you just found more than 100 files in).

    Continue using slow mode, until the system component discovers that fast mode is ok.

    (Variation: do this testing in the same thread that does the slow mode accesses, at the same time, if fast mode is really so fast, it won’t make a dent…)

    Optionally, optimize by whitelisting ‘known’ systems (MS OS, Novell OS, Good samba versions if you can detect them, whatever…) automatically.

    Advantages:

    – always works. No weird, non-reproducable behaviour (I swear it returned only 100 files last time! wait I’ll restart my program and show you.. now it does return them all! <Next day>Turn on computer, there it is again, only 100 files. Go get admin, doesn’t happen again, … repeat at infinitum).

    – still uses fast mode most of the time

    – not much slow down at all, not on the first query (except that it uses slow mode, but it doesn’t have to restart, doesn’t have to do several querys, doesn’t have to try a fast query, wait for the error (or not), then do an appropriate query which actually returns results to the suer, or anything like that that would slow down the first query)

    – Vista user knows nothing about it

    Disadvantages

    – very ugly hack

    – ungodly amount of code for such a thing, probably

    – ‘unexplainable’ directory accessess seen on target machine (and the Vista machine too), possibly by a strange user (if not using impersonation for the test) which you will have to explain after Mark Russinovich fires up filemon and has an hour to kill looking for strange accessess ;-). Then again, this blog entry explains it already!

    – possibly ‘brittle’ (is that even a correct english word?), i.e. this may easily go wrong. then again, it’s fail safe: if it does go wrong, you’re in slow mode, and everything still just works.

    – Vista user knows nothing about it (=> this will remain unpatched with all the distributors)

  106. Arnt says:

    And I see JI has thought of the same thing. (His post wasn’t there when I started on mine.)

  107. My opinion is that off-the-shelf distributed code should be as low on vendor specific hacks and fixes as possible.

    I’ve had trouble, from time to time, with files being deleted/truncated instead of overwritten when saving them to smb shares. This as a result of the smb connection being reset at an awkward time during the update. In most of these cases I’ve been able to track it to some hardware failure (misbehaving firewall), and in others it’s been caused by some random VPN client. Annoying? Yes, of course, but hardly something one expects the OS to handle ("connection dropped? never fear, we’ll generate random files for you to browse!"). The same goes for this matter, IMHO.

    Leave the fast query on by default, and provide a registry setting to switch it off. Perhaps even let users switch it off for specific IPs or NETBIOS names.

    Show an alert dialog if the server returns the weird error code you describe, with a link to the proper KB article. This is after all an error, so it should be noted as such. If necessary, drop another registry setting in there to disable specific error messages/codes. I’m guessing other errors can/will be shown like that, so being able to disable them would be a peach for the advanced users.

    The mentioned KB article should of course either contain the description of how to resolve this by using the registry settings, or maybe even have a patch applying some other solution described somewhere else on this page.

    Disadvantages:

    – The user will see the error popup, if he’s as unlucky to browse the flawed servers, and might be puzzled for a minute or two

    Advantages:

    – The user will know why it happened, and how to fix it

    – Patches may be distributed by network administrators if it’s a company wide problem

    – By being told what happens, the (advanced) users can patch or upgrade their file server software to some decent version

    – Most users and file server providers will benefit from the fast query mode

  108. Mike Fried says:

    So you can’t test the server for its version or for some kind of flag saying that it has been certified for compatability with Vista?

    In that case, make the API behave identically to the Windows XP case, and introduce an EX version of the API with a flag to enable the faster query. Document the bug in a KB article, and put the link to the KB article in the MSDN documentation of the EX API.

    Existing code won’t suddenly change behavior. New code (such as the Vista Shell) can work around the issue if it encounters it. The end user doesn’t need to be bothered with these details. I know it sucks that you can’t get perf gain for backwards compat with existing hardware in this case, but it only works if it’s well tested. That’s why we have the WHQL.

    Better safe than sorry. You never know – if you don’t fix this bug, you may end up with a QFE request later.

  109. Jack Mathews says:

    Raymond: Does the error code happen every time a directory listing is gotten, or just when a directory listing is too long?

    If it happens every time, then do a quick directory listing of a root directory when the server is mounted.  If you get the error or don’t have permissions to do a directory listing on any of the roots, silently enter slow mode.  Otherwise, you can safely use fast mode.

    Advantages:

    * Everything just works.

    Disadvantage:

    * Slightly slower mounting time.

  110. DmitryKo says:

    Fixing other people’s errors is just wrong, unless there are virtually no chances of the error to be fixed, so I’d chose "Auto-detect the buggy driver and put up a warning dialog"

    Also write an event log essage stating that the remote server is faulty and needs to be updated; create a new error number for this exact issue. As a workaround, suggest reverting to the slow mode by default (of course, there should be the registry setting to allow that).

    However, knowing about multiple various hacks employed by Microsoft in order to fix wrong code, I would suggest the workaround already proposed by Andy, Joe Butler, Michael Cook and probably others.

    That is, keep last 100 recent entries in the internal list (or maybe in the file cache) for network requests, and re-query the server in slow mode upon discovery of the error. Then comparie new listing to the saved one so only changed items will be submitted in the next batch. Write the error in the event log, as mentioned above, and provide the means to disable this workaround through the registry.

    Advantages: applications get the complete directory listing and gets no error messages

    Disadvantages: OS nneeds to cache at least 100 items for every directory listing request to every network server, which impacts performance

    (the cache could probably be discarded after first succesful 100 items, but still needs to be recreated on every new request)

  111. Neil says:

    Anyway, my suggestion is to release an optional update that makes Windows XP run in fast mode. Add a note to the update that there are more details in the KB article as to how the administrator could test their third-party devices for compatibility. Since people will want to switch to fast mode hopefully people will complain to brand X about their incompatible server.

    > (If the list of "known bad" servers were unbounded, then an attacker could consume all the memory on your computer by creating a server that responded to a billion different names and using HTTP redirects to get you to visit all of those servers in turn.)

    How is this supposed to work? Redirecting from http:// to file:// perhaps? Does the security team know about this feature?

  112. DrPizza says:

    (If it happened for every listing, not just "long" ones, then you could just make the network driver do the listing itself; this would have the bonus of not breaking anything, at the expense of *slightly* worse mount performance, and would be far and away the best fix.  But I don’t think that’s what you’ve described.)

  113. How about this. I can see that it has "issues" but…

    When you run the query in Fast Mode, keep a hold of all the items you’ve been given until you’ve received enough that you can be confident that the server isn’t one that has this bug.

    If you get the error message that indicates it *is* the buggy version, rerun the query in slow mode. For each result that you get from the slow mode, check to see whether it’s in the list that you already received and filter it out of the results if it is. Return any results that *weren’t* in that list as if they were just a continuation of the result set of the original query.

    The biggest issue is, of course, what happens if the directory contents actually change in the time between the fast request and the slow one? You could end up with a listing that didn’t accurately represent the contents at any real point in time. Whether that’s a serious problem will depend on the nature of the API and what it’s typically used for.

    The other problem is that it introduces significantly more overhead even beyond what the "slow mode" would normally be – doing all the filtering over intermediate results etc. IMHO that’s fine – "this server performs excessively slowly when serving to Vista clients" is probably a more compelling upgrade reason for users of the buggy server than any number of theoretical arguments about bugginess.

    Finally, there’s the issue that you need to keep track of the results for *every* fast mode query until you’re sure the server is okay. Again, the nature of the API will determine whether that’s a problem or not.

  114. Amos Houndsbreath. says:

    Wow, I’ve scrolled through a huge number of suggestions, and most of them would move Windows’s usability firmly towards that of lesser (and cheaper;-) OSs. Hope you ignore them.

    I think the code has to be reliable with any possible server, including ones with the bug. That’s an absolute must. It shouldn’t bother the user with information they most likely won’t understand either.

    If possible, check the server capabilities when you connect – maybe SMB has a way to get OS version/Driver version or something, so you could use that to work out if Fast Mode was usable. Then you’d only use Fast Mode if you’d verified the server supported it, and use Slow Mode otherwise.

    If not, and there isn’t a way to recover from Fast Mode fails on the fly without telling the application, I’d vote for only using Slow Mode.

    Or there’s another possibility. You could start off using Slow Mode. When you find (via a Slow Mode query) a directory big enough to benefit from Fast Mode, do a check to see if Fast Mode gives the same answers.  If it does, use Fast Mode from then on. But don’t return bad data, and don’t tell the user to upgrade his server software. People that want to work with that kind of OS won’t use Windows anyway.

  115. Nekto2 says:

    /*joke*/

    Every time new file server is found you just create there TEST directory with >100 files and read it. If it fails – mark it as slow :)))

  116. kokorozashi says:

    It seems obvious to me that the right thing to do is test the version of the server or driver and use slow mode if it’s old and fast mode if it’s new. This approach is almost all upside. The only downside is that Windows contains a hack which recognizes a particular version of a particular server. I’ve worked in operating systems before and committed many much worse compatibility hacks than this.

    This is so obvious to me that I assume RC would have listed it as a possibility if it were. That must mean it’s impossible to test such a version.

    Barring a version check, I’d run the query slowly, write a KB entry, and contact the developer to tell them what happened. Tell them to add a way to query the version of the server so you can use fast mode when appropriate. Otherwise, their most important client will ignore all the hard work they did adding fast mode. Tell them their competitors run X% faster  because their fast mode doesn’t have the bug. I’m not sure there is a downside here because this will motivate them to distribute the fix and you can roll the version hack into a service pack you were going to ship anyway.

  117. kokorozashi says:

    And if you get lucky they will have distributed the version capability before Vista freezes so you don’t have to deal with the service pack.

  118. Adam Ligas says:

    Solution:

    Build in a setting to enable/disable Fast Mode.  Ship with the setting disabled. Include the code to detect the problem when Fast Mode is enabled.  If the problem is detected, tell the user to disable "Fast Mode".  

    Since they enabled it, they’ll know where to go to turn it off.

    Followup:

    If the status of the issue changes (i.e. distributors update the code from the vendor and problem goes away), enable Fast Mode in a future update.

    Result:

    Windows Vista ships working. If the situation changes, then people get a "free" speed increase in a later Service Pack.  In an above-average environment – say a large corporation or knowledgeable person – they can enable the feature in their installation if they know it will not cause any problems.

  119. blah says:

    We’re all looking at the problem from the wrong end. We really should be looking at the machine on which the file server runs.

    What about a "Compatibility Catalogue" of apps / drivers on that machine which haven’t passed compatibility testing for some reason? That, combined with a Microsoft-side Compatibility News/Update webcenter would be the real way to get Administrators to pick up on the issue and deal with it *as soon as* there is a patch out.

    With an official-enough framework to deal with the issue, perhaps you might even find that the Compatibility webcenter becomes a Hall-of-Shame for apps…so much so that they feel the pressure to fix their issues, and do so.

    Nice post!

  120. I like guillermov’s auto-whitelist idea. It combines many of the advantages of the other approaches. To reiterate: The difference between the fast and slow methods probaby isn’t significant for small numbers of files. Always try the slow method until a query returns more than 100 results (using the slow mode). Then (even asynchronously) execute the same query using fast mode.

    The directory changing between isn’t a concern because the results of the second query aren’t actually used. If the fast query returns less than 100 results, because the directory changed in the interim, do nothing. If it returns 100 or more results, mark the server (using some of the schemes discussed above) as fast-capable, and use the fast method for all subsequent queries. If the test query returns the error, mark the server as being slow and don’t attempt to test it again for some number of days.

    I also don’t see why the length of the server list has to be so restrictive — it only has to be consulted at connect time, so why not store it on the filesystem (or in the registry) and consult the list once, when connecting? The cost of doing that would be lost in the noise.

    You’d still need some limit for the size of the server list, but it could be much higher than 16.

  121. Michael Cook says:

    OK, here are my ideas:

    1. When a directory list is required, do it with the fast method. If you don’t get the error message from the server (lets say there are only 60 files, or there are 200 and the server doesn’t have the problem) you are done.

    If you DO get the error message, you know there are more files and the server is buggy. So as soon as you get the error message you re-query the server using the slow method. The user will already have a list of files (about 100) to look through while the rest are retrieved.

    Pros:

    * When people don’t run the buggy server, you use fast mode thus things are faster and there is no penalty (like using slow mode for everything)

    * If a user is just browsing in Windows Explorer, they’d probably never notice you were doing this

    * Uses fast mode if there are only a handful of files

    Cons:

    * It is a double hit on the server when this issue is there. That could be a big problem depending on how often people do this. To fix this, I would propose a registry setting to simply force slow mode that could be documented in a Knowledge Base article for those who need it.

    * More complex than just always use one mode or the other

    2. Keep an internal cache of server names and whether you can use fast or slow mode. After a certain period of time (haven’t accessed the server in a week, perhaps) remove the entry of that server (to prevent it from being to big). You can also re-check the server once a month or something to see if it has been updated (or something like that).

    When you first connect, you do like in solution #1 to find out if the server suffers from the issue. You save the determination so that next time you can just use fast or slow mode as appropriate and skip the check.

    Pros:

    * Again, will use the fastest mode possible

    * Avoids double-hitting the server every time if it suffers from the issue

    Cons:

    * Requires a cache, not as elegant (code wise) and just always using one mode

    If you like either of my ideas, I’d love to hear about it. My e-mail address is on my website.

    — Michael Cook

  122. BryanK says:

    PaulM:

    > Am I the only one who is breaking into a cold sweat reading the majority of these suggestions?

    No, you’re not the only one.

    Speaking as a part-time ECMAscript developer, many of these suggestions (all the ones that involve blacklisting or whitelisting certain software versions) absolutely SUCK.

    The canonical example of this kind of thing going wrong is with DOM detection in JS — if the JS code does its "what feature set can I use?" detection by looking at the user-agent string (in this case, that’d be the server name and version), then it may work most of the time.  But if that same user-agent string (or server name/version) starts working properly in the future, your code suddenly needs to be updated.

    Whereas just checking for DOM features that you need before using them, and including a fallback if necessary, is a *much* cleaner solution.  Yes, the code is a bit bigger, but often the differences can be abstracted away in a wrapper function.  (See the myriad attachEvent / addEventListener JS libraries that are available, for instance.)

    In this case, the only two decent options (given the frequency of OS updates from Microsoft) are do nothing, and detect the error *WHEN* *IT* *HAPPENS*.  Microsoft can’t blacklist or whitelist certain server software, because if those servers get fixed in the future, the blacklists will need to be fixed too.  Whereas if it just detects the error (or does nothing), it won’t care about future updates.

    After seeing that you’ve hit this bug, it doesn’t matter (much) what you do.  But you *can’t* write code that expects to see the bug with certain combinations of software without actually seeing it, because that check will be WRONG in the future.

    (Now, blacklisting certain server *DNS names* (or NetBIOS names, either way) isn’t so bad, as long as the entries get maintained.  A server that gets fixed can’t stay in the list too long.)

  123. orcmid says:

    Well, I’m in favor of "do the right thing" and "don’t give the user a problem they can’t do anything about."

    1. I think Patrick Correia’s thinking, along with Mike Fried, about versioning the API is interesting.  It doesn’t come at the real issue, but it is an interesting idea just the same.  It holds a clean interface contract in terms of previous behavior. I do like that.

    2. Along with Andy, Joe Butler, Jstin Bowler, Michael Cook, Mike, Richard Kuo, and Arnt Witteveen, I’m interested in seeing the fast case, if used, deliver correct results whenever possible.  

    2.1 For me, I would relegate this to a pacing and dataloss problem (even though it’s really a bug) and solve it as a pacing problem.  Caching a flag against a failing fast-responder is fair to avoid incurring restart overheads, just like fixing a transmission window size to curb packet losses.  

    2.2 For splicing in the tail of a slow query with the lead of a fast one, I’m assuming that query responses aren’t atomic and are also immediately stale, so having a potentially inaccurate return is not that frightening in terms of minor discrepancies.  There’s an explosion of new failure modes, though, and that makes me nervous, because they become rarely-seen and easily screwed-up cases.

    2.3 I am concerned about how the user experiences what happens.  Why is the fact of premature ending of the return not visible to the user now?  If a connection is lost or a distant server fails in the middle of a query return (directory enumeration?), what happens now?

    2.4 I think I would want to warn an user when a retry is in progress (with some non-modal indicator) so they don’t think the situation is exactly hung.  I’d want to provide some indication that results may be inaccurate/incomplete if the strange error code occurs and recovery was not attempted or was unsuccessful.  I suspect that this depends on what can get through an enumerator interface and what an application does with it.  Not encouraging.

    3. I keep coming back to wondering what happens now when an enumeration is truncated for any reason.  How does that get dealt with, and how is it perceived by users?  If it is application specific (don’t know why not), how does application code learn what it is there is an opportunity to be application-specific about?

    If the enumeration is not informative enough, maybe reving the API is ideal for that reason too.  You could have a refresh/retry case there, perhaps.

    4. OK, I can’t resolve myself about this, so no pros or cons.  Thanks for challenging us with this.

  124. mikeb says:

    I am amazed at the number of people who advocate doing nothing special and letting the directory enumerations fail when querying the buggy servers.

    Everyone should (but apparently don’t) realize that Microsoft (and by extension, Raymond) don’t really care about the producer/vendor/whatever of the server software.  If that were the only entity negatively impacted by the failure, then the answer would be easy – force the buggy server implementation to be fixed in order to get correct results.

    However, Microsoft does care about the people who are currently *successfully using* this buggy server implementation.  Many, maybe even most, of those people don’t even know they’re using it and have no control over how, when, or if the server gets updated.  So even if the buggy server implementation gets fixed by the vendor, that doesn’t mean the deployed buggy implementations will get fixed – those users are still screwed.

    If the directory queries fail, that leaves these people with 2 options:

    1) don’t upgrade to Vista (that’s if they even realize there’s a problem)

    2) lose data

    Neither of which is an acceptable solution for Microsoft, even if it makes Vista’s directory query implementation impure.

    So, please stop suggesting that these directory queries just fail (particularly silently).

  125. Riva says:

    I was just reminded of the way XP deals with UDMA errors. If a device acts acts up a set number of times, XP will degrade performance by dropping down to PIO from then on.

    Enabling fast by default, obviously at some point an enumaration might fail and from then only slow is ever used again for everything. As someone else pointed out: why not deal with it in a similar way as a sudden network outage? It wouldn’t be terribly accurate, but it would simply cause the user to retry the action in almost all causes, and the next time around it succeeds so they will shrug and be none the wiser as to what happened behind the scenes.

    Advantage:

    * no confusing error messages and no user interaction required at any point

    * fair tradeoff between using the fast mode and workaround (which is no slower than what’s in XP today)

    Disadvantage:

    * the user will still encounter a problem once (but only once)

    * even if the server gets patched, the user might not know or remember to enable fast queries

    I would say that the first disadvantage is acceptable. After all, the server is defective in a way and as long as the failure doesn’t happen silently, no harm is done.

    For the second disadvantage I think the UDMA behaviour is a valid precedent. Argueably using PIO when a device supports UDMA is a terrible performance decision, but if it acted up in the past, you have little choice until the user fixes the problem and tells the OS to revert to the default – faster – setting.

  126. Vince P says:

    My solution is to run a diagnostic at driver installation time.  Test for a failure situation after the relevent driver is installed.  If there’s a failure set that box to do slow mode. If it passes, set the box to be fast.  

    Simplicity :)

  127. cola says:

    Oh boy, a chance to give advice!

    It might or might not be a good idea to leave out the list of buggy servers, and just try every single call in fast mode.  (If you get the error, retry in slow mode, as explained in detail by everyone else.)  Someone should find out how often directories have more than 100 files.

    Also, put this bug fix all in one file, except for the calls to the functions in that file.  That way the developers won’t forget about it.  "What’s in this file?" is a more common question than, like, "What happens for this error code in this function in this 30K file?" :P

  128. It is unacceptable to nag the user regarding some driver issue which they do not care about.  Not their problem.  Don’t make it theirs.  So, no $#%* modal user-dialog.  I mean it!  Don’t make me come over there!  

    Nothing wrong with an Event Log entry though, so that a tech-savvy user can investigate.

    Next, what to do about the error.  Again, we don’t want to bother the user.  Not their problem, don’t make it theirs.  Detect the problem (this is not a hack, it is just making robust code), and re-act.  Switch to slow mode if necessary.  Display a (non modal!) message if you like, just keep it unobtrusive and clear.  "Switched to slow network mode.  See this url for help."

    Think outside the box.  Don’t tell me anything about "well half the result is already processed and um…".  Deal with it.  

    If only we lived in a perfect world!  Just do what you can in the time you have.  But don’t kid yourself into thinking your problem is a buggy 3rd party driver, because that’s just a symptom.  Your problem is your design’s inability to handle protocol errors in a transparent and well-defined way.  

    And while I’m ranting, what’s with the whole "shell and networking teams cooperated to find the problem" thing.  If some stupid component ate an error, then *thats* a problem.  Has anyone even suggested addressing that?

    I’m in a bad mood.  Has anyone got any chocolate?

  129. Mike says:

    If the error displays after a mere 100 (or so) entries, why not simply cache them (if doing a "fast query"), and on failure do a non-"fast query" and simply cull the already "fast query" gotten results already returned to the "client"? I mean, it’s not like Explorer is the hallmark of lean-and-mean anyway, why adding something like (up to) 25-30KB for each "live" Explorer dodah wouldn’t even be measureable in the megabytes after megabytes it consumes.

    I do however wonder what a "fast query" is. At first I thought you were talking about SMB, but then you added HTTP to the discussion. I suggest you stop the riddles and simply tell us what protocol we’re dealing with here. Trying to find a solution for a half-specified problem in an unspecified domain is about as easy as lifting yourself by the hair.

  130. Phil Bevan says:

    I would suggest the following:

    1. Use whatever method is reliable to detect the buggy driver.

    2. If buggy driver detected inform the current user that buggy driver detected and that they should notify the system admin, with a note to check event log.

    3. Write a detailed event log message that the buggy driver was detected along with a link to a MSKB article on the problem with links to the fix.

    4. Disable "fast mode"

  131. Stephen says:

    It appears the program in question is indeed Samba, and the problem is described at:

    https://bugzilla.samba.org/show_bug.cgi?id=3526

    This bug certainly sounds like the one Raymond is talking about, at least.  It was apparently fixed in Samba 3.0.21c, released on 25 February 2006.

  132. A few people seemed to have missed Raymond’s comment that requerying isn’t really possible (at least not without a lot of work).  There are a few principles at work here:

    a) The user should *never* get an incomplete list of files from the server without any indication.  This is bad.  I don’t think there needs to be any special error handling here — there’s probably an existing error code that can be returned to the caller to handle a similar error condition.  In addition, adding something more informative to the event log is a good idea.

    b) The user should have some way to enable "slow mode" so if upgrading the server isn’t possible they can still run Vista.  Include a knowledge base article about it.  If a server administrator encounters this problem during testing they will be able to figure it out (either by contacting their vendor — who will surely know the fix — or by consulting the MS knowledge base).

    Since there’s no good way to reasonably detect the problem and retry it — this seems like the best answer.  The driver will get upgraded and this problem will eventually be long forgetten — no need to hack some complicated solution to it in the short term.

  133. Jorge Coelho says:

    I think the solution here is a combination of possible methods, which have basically all been described here.

    First: it’s not the user’s fault. He can’t do anything about it, so don’t bother him. Put an entry in the event log, if you want, but leave it at that.

    Second: It’s obvious the vendor doesn’t want his software to be classified as ‘slow’, so they will fix the problem ASAP. Unfortunately, it will still take time for the new version to filter through to the distributors and finally to the customers. This is not the vendor’s fault, so there is no need to punish him by displaying an error message either.

    The solution to the problem, under the above circumstances, has already been described here:

    Use fast mode to query, maintaining a list of the first 100 entries. When you get to entry 101st and receive an error, switch to slow mode and re-query. Match the slow mode entries with the ones previously obtained in fast mode and return only those that do not match. This is an enumeration, btw, so, as with all enumerations, things might change while you are doing it – but that’s a different problem altogether.

    Now add that server to your ‘last 16 buggy servers’ list. Next time you need to query that server, you will not have the overhead of doing fast mode first and then repeating in slow mode.

    Time stamp each entry in the ‘bad servers’ list so that each buggy server is removed from the list after one month or so. Why? Because the server software might have been updated in the mean time. Who cares if once per month *one* query will be a little slower than usual?

  134. JamesW says:

    Problem:

    Third party software implements a Microsoft protocol incorrectly.

    Solution:

    Microsoft fully documents the protocol.

    Advantages:

    Third party software doesn’t have to be developed in the dark.

    Users get to live in a blissful world of system interoperability.

    Happy users won’t be moaning that Vista doesn’t work come 20??.

    Might get the EU of Microsoft’s back.

    Disadvantages:

    Users can use non-Microsoft solutions with less fear, uncertainty and doubt in their minds.

    People will want NTFS documented next!

  135. John C. Kirk says:

    I’d favour a combination of the suggestions from some previous posters.

    1. Document this compatibility problem in the Microsoft Knowledge Base.

    2. Have a setting for "always use slow mode", which is turned off by default. This is controlled through Group Policy (on/off/not configured).

    3. If "always use slow mode" is on, then do what it says on the tin, and there’s no error.

    4. If "always use slow mode" is off, and an error occurs, don’t display an error message to the user, but do write an error in the system log, which includes a link to the aforementioned KB article.

    a) If Group Policy for this setting says "not configured", then set "always use slow mode" to on (writing another entry to the system log), and repeat the query.

    b) If Group Policy says "stay off", then do what it says, i.e. stay off and leave the user with an incomplete list of files.

    Advantages:

    * The user interface doesn’t get cluttered with extra options.

    * A home user can use Group Policy to control their local machine.

    * A corporate admin can avoid schlepping round to 100 machines making this change; they can turn on slow mode when they know about the problem (which avoids other PCs on the network having to repeat the test), then turn it off when they know that the NAS box has been patched.

    * If people aren’t affected by this bug, then they get an immediate speed improvement, without having to go out and buy one of those magazines with dubious "tips and tricks to boost your PC!"

    * Anyone who knows what they’re doing will notice the problem in Event Viewer. Anyone who doesn’t know what they’re doing will (hopefully!) not have been screwing around with this setting in the first place, so they’ll get the default behaviour of silently succeeding after a small delay.

    * This doesn’t delay a roll-out of Vista for anyone who is affected by this bug.

    Disadvantages:

    * It is possible for someone to sabotage their own (or someone else’s) machine, if they have admin privileges. This could be done on their behalf by malware. In this situation, they will probably blame Microsoft.

    * Vista has to be sullied by code to work around someone else’s bodge job, and this may lead to accusations of "bloatware".

    Personally, I would like to go for the option of "do nothing". However, as a customer, I wouldn’t do a rollout of Vista if I knew that it was incompatible with one of my servers. So, I’d push for a bug-fix from the server vendor/manufacturer/whoever, but in the meantime I would delay taking up Vista.

  136. Mark Sowul says:

    Wow, what an ugly bug.  You can apparently throw out all the suggestions about retrying the operation, because they would fail too:

    The next NtQueryDirectoryFile [fast] call after getting the first 128 directory entries returns STATUS_INVALID_LEVEL.

    If we switch to using the FileBothDirectoryInformation [slow] after getting this error, that call returns STATUS_NO_MORE_FILES.  This does not happen when using the FileBothDirectoryInformation [slow] level right from the start.

  137. BillK says:

    This may have been said, but it seems to me that any solution that solves

    "…Users will interpret it as a bug in Windows Vista."

    will involve extra code or workarounds that will be a liability and eventually become a bug in another version or two of Windows.

    (then again I love a day where I manage to end up with a negative net line count)

  138. Wesha says:

    /* oversimplified for clarity */

    static int currentPos;

    static mode = MODE_FAST;

    static struct dir_item *buffer[101];

    static struct dir_item *nextItem[1];

    function MoveFirst() {

      int res = ReadUsingFastMode(&buffer, 101);

      if (res == WEIRD_ERROR_CODE) mode = MODE_SLOW;

      currentPos = 0;

      }

    function MoveNext() {

      struct dir_item *res;

      if (mode == MODE_FAST) {

        if(currentPos <= 100) res = buffer[currentPos];

        else {

          ReadUsingFastMode(nextItem, 1);

          res = nextItem;

          }

        }

      else {

        ReadUsingSlowMode(nextItem, 1);

        res = nextItem;

        }

      currentPos++;

      return res;

    }

     

  139. Anonymous Coward says:

    How about simply crash? Either explorer, or the calling app, or some dummy process that doesn’t do anything other than die for the purposes of sending in a crash report. Then the user gets to send in an error report, and you can point them to a response page telling them to upgrade the offending driver. You can also set a regkey to use slow mode for the rest of the session, in case the user tries again. Clear the regkey once every few weeks so that people get reminded to upgrade their stuff eventually.

    Benefits:

    – users that don’t get to deal with buggy servers always get the fast behavior

    – users of buggy servers know what the problem is and maybe will fix it (and aren’t totally hosed in the meantime)

    – kernel stays (relatively) clean

    – noone sees incomplete results

    Drawbacks:

    – Vista crashes sometimes :(

    – Users that hit a buggy server sometimes will live in slow mode for a few weeks (can it be that bad if that’s all that exists in XP though?)

    – Vista crashes a lot if there’s a bug in the code that identifies buggy servers

  140. makoto says:

    Whatever you do, please don’t "do nothing."

    It might be old samba, which my cheap NAS hdd box maight use.

    Even if so, I don’t think the vender cares about that.

    I think, if it can be used by XP, it should be used by Vista.

  141. John C. says:

    My creative (ahem) solution: Shame the vendor into solving the problem.

    You’re absolutely right that users will interpret this as a Windows Vista bug. Not solving it leaves a bad impression, even if it’s not your fault. But as you describe it, this fundamentally seems like a social problem rather than a technology problem. The desirable outcome is to have the vendor fix the problem rather than to implement some nasty technical workaround.

    It’s as if the tires on a car were substandard; the car manufacturer could come up with all kinds of workarounds (limit speed to 65 mph if the temperature is over 70F, except when you turn on the hazard blinkers to indicate an emergency), but the real solution is to get the supplier to fix the underlying problem. Of course, the relationship is rather different here: this isn’t a supplier over whom you wield direct economic power. So what other kinds of power could you bring to bear? Legal is unlikely, but social might be possible. My social engineering skills are somewhat weak so I’m certainly the wrong guy to suggest tactics, but I bet you could find people who could figure out how to apply the right kind of pressure to make this problem go away without writing code on your end.

    Thinking in purely mercenary terms, perhaps you could even pay the vendor to fix the problem on their end? Yes, you’d have to rely on the installed base actually installing patches and whatnot, which might be totally unrealistic; still, I can’t help but thinking that their might be ways to resolve this that are not fundamentally technical hacks on Microsoft’s part.

  142. Joe Butler says:

    I don’t think it’s viable to offer a ‘classic’ directory listing and a new improved ‘rapid’ directory listing for apps to choose.  All that will happen if this option is given is a lot of application developers will simply choose the ‘fast’ method without realising the implications.  This, again, puts the end user at the mercy of sloppy developers.  Haven’t we learnt anything from Raymond’s posts – give the developer the ability to do the wrong thing, and will jump right in – just look at how many people here are talking about ‘buggy drivers’.

    < :-) > An alternative fix would be to release Vista SP2.  This would really be XP SP2 with a Vista theme.  99% of Vista users would not notice that they had been regressed. </ :-) >

  143. Jeff G says:

    Upon first interaction with a server, run a query to test if they have the fixed code.  If no such test exist, write one into the the next version of the server software and consider all current versions as failing.  Cache the return value in a short list* (both positive and negative results) and use that to guide the call to the directory query functions.  Expire the cache between connection resets or at known times the server software could be changed (I assume connections have to be reset to upgrade/downgrade the file server software).  If your cache is full, do not add an entry.  If the cache value is missing during a directory operation, use slow mode.  

    Advantages:

    No user interaction

    Everything works always (assuming the connection stuff i mentioned)

    Fast mode preserved in common** use case

    Disadvantages:

    Requires potential change to server software***

    Slow mode dominant if you regularly interact w/ large (> cache size) numbers of servers.

    Overhead on establishing a connection.

    *Expose the cache size param via tools/articles where interaction with a large # of untrusted file servers is expected, maybe the ‘Optimizing Bittorrent for Windows Vista Development Journal’.

    **Common given my understanding of how people use file servers, if this isn’t the case this solution is suspect.

    ***If necessary, this might be unreasonable to ask of a group you may/may not have political clout with and would kill this solution.

    Oh, and kudos to your test/shell/networking and team for finding and identifying this issue.  Perhaps step 0 of this solution is giving whoever decided ‘we really need to test against X version of the file system software’ a promotion.

  144. Miral says:

    My pick: do "Auto-detect the buggy driver and work around it next time", but requery *immediately*, not next time.

    Remember the server is "bad", but periodically try fast mode on it again anyway, just in case they’ve upgraded.

    While the requery is going on in the background, pop up a warning dialog saying that a buggy driver was detected and that they should upgrade their server, pointing them at the relevant KB article.

  145. Mark Sowul says:

    I admit I usually side with MS on most things, and I also don’t know whether this is a result of MS not fully opening up its SMB information, but this seems like a case of reaping what you sow.  If MS had opened its network protocols in the first place, Samba wouldn’t have had to reverse-engineer them.  I feel like a Slashdot troll.  It’s a very unpleasant feeling.

  146. malachi says:

    Here are the ideas and assumptions that shape my point of view.

    1. Breaking something by default in a new OS that worked in the prior version in unacceptable.  

    This means "default to fast and do nothing" is out.  This option is seductive. It seems like "doing the right thing", but the theory has to be carefully balanced against reality.  Doing nothing might be fine, but only if you can get away with it.  I don’t think Vista can get away with it.

    2. Showing the users a technical error message in response to what would be percieved as an arbitrary condition and when things work fine 99.9% of the time is not acceptable.  (If it would confuse my father and cause him to call me, it is not user friendly enough.)  It’s even worse if they have to make a decision.  Always remember: USERS CAN’T READ.

    This means that any message box that the end user sees is out.

    3. Assuming that the faulty number of files will be constant is a bad idea.  Sure, today it might be 100, but there is no way to know if the code might be modified at some point in the future to break in the same way at 50 files or 500 files etc.  

    This means both the "cache the first 100…" and the "default to slow until you come across a directory with more than 100 files…" plans are both out.

    4. Black or white listing server versions is a bad idea.  While you might be able to build a comprehensive list of all faulty implementations in the wild today, it is possible that tomorrow a new faulty implementation will be released and it can not be assumed that the user will ever update the OS to recieve any additions or subtractions to the blacklist/whitelist.  It is fairly irksome when I stumble upon a web page that doesn’t recognize my browser version it tells me to "upgrade" to a prior version because it only knows about browser versions x and y and I have upgraded to z.

    So, the static black/white lists are out.

    5. It can not be assumed that the user can do anything to fix the faulty server version.  Expecting the user to fix the server is bad in so many ways, it is hard to know where to start.  First of all, many people use computers in enviornments where they have no control over the server(s) they use.  For that matter, some sys admins do not have any control over the server version for various reasons ranging from auditing paranoia to black box systems with recalcitrant vendors or out of business vendors.  Not to mention cases where the server is under the control of another party that the user cannot force change upon, even if that change is possible.

    This means suggesting upgrading of the server as the only possible workaround or solution is out.

    6. It cannot be assumed that the broken versions will eventually die out or be upgraded in a useful time frame.  If this is OSS code then it is possible that a company has a branch of the broken implementation that they might not fix, for whatever reason, but which they may incorporate into a future product.  

    This means we may be stuck with this hack for a long, long time.  Not to mention the fact that if the OS siliently works around the problem not only is there no incentive to fix it, but other products may someday intentionally implement the bad behavior for whatever silly reason.  If you don’t think this might be true, go read all of the old entries on this site.

    7. If performance is really a concern, blacklisting 16 bad servers is not a great plan for a couple of reasons.  If the user never accesses more than 16 bad servers, and the servers are fixed later then they will be stuck in slow mode forever.  Also, if the problem only exists for lists that are over 100 (well, n, because I don’t believe that 100 is a hard limit) files long, then having one instance on a server of 101 files could forever relegate that server to slow mode even if there is never another instance of a directory with more than 100 files on that server.  Even if there is some automatic method to clear out this cache, that will just have a negative performance impact on the people that frequently have to access large directories on a server that is not upgraded.

    This takes out the option of keeping a black/white list of servers that are known to be good/bad.

    8. Logging the error to the event log is acceptable, but putting a url or KB number in the text is not desirable.  Describe the error and the problem as well as possible, but don’t assume that the KB number will be useful several years from now.  From my point of view, this would be hard coding something that should not be hard coded.  There is a possible future where the same exception is raised and logged for a different reason, making the hard coded KB number very misleading.  It is also no fun to end up viewing KB404.

    This doesn’t rule out messages in the event log, but it rules out being overly specific about the assumed cause.

    9. It is not acceptable to produce a file list that is not reflective of the directory at one single point in time.  

    This rules out the "cache the first n records and requery merging the new results…" plans.  I must admit that this is a somewhat attractive option in that it would make the issue invisible, but the problems outweigh the benefits.

    So, after all of that (and I’m sure I forgot something along the way), I would have to go with slow mode by default with a configuration option to enable it.  Depending on how it is implemented, it may be accompanied by a warning that results may not always be reliable.  (This is acceptable, by the way, because it is presented in a context in which it is more likely to make sense to the user and in response to a direct user action that does not have the historical expectation of "just working".)  It would also be good to log the error to the event log.  

    This would allow sys admins in controlled enviornments to enable it as well as "power users" (even if it isn’t a good idea, these types of users are more tolerant of broken behavior because they tend to cause it) who want the "speed boost".  This keeps normal users from being exposed to strange and confusing behavior in almost all cases, and it limits the number of people who will assume that Vista is at fault.  

    This approach reminds me of DMA settings for IDE devices in Windows 98.  Since you’ve bothered to read this far, I’ll punish you with a reminder of an irritating bug in Windows 98.  Every time I checked the DMA box for a hard drive for the first time, I had to go back into the properties and set it again for the setting to "stick".  I’m not really sure how I ever figured that out.

    Advantages

    – No failures by default ("it just works")

    – Very few failures total

    – Allows for fast mode, but it is a conscious decision that someone has to make

    – Doesn’t require a gross hack

    Disadvantages

    – Slow by default

    – Malware could, in theory, tweak the setting and cause problems, intentionally or unintentionally (but I see this as a problem with almost all approaches, and there are bigger malware fish to fry anyway)

    – Fast mode is less useful because it will be rarely used

    – I think the anticompetitive disadvantage is a straw man, as long as everything is treated equally.  I think it only really becomes a problem when only some specific servers start working slowly for some ineffable reason.

    I apologize for any spelling or grammar mistakes.

  147. John McCormick says:

    In no case is a silently incomplete file list acceptable. The shell must either fail with an error or return the entire list in some way.

    Is there no place in IEnumIDList::Next()’s contract where an error can be indicated?

    If so, applications that ignore the possibility of the shell returning an error are already doing so at their own risk, and you can obviously update Explorer to respond to it properly.

    If not, your contract *requires* a full list, so you must guarantee the existence of such a list before you allow someone to start iterating through results, which may defeat the purpose of a fast query…. (Also, in that case, why does it use the same interface?)

    Like everyone else here, I’d love to know more about the problem before making wild guesses about what solution is best.

  148. peterchen says:

    You’ve got us cornered here!

    Whatever you do, you must make it trackable to the respecitve vendor + update. Event log + KB article + link in error message (if any)

    Fix automatically only if you can fix it perfectly (b detecting the driver version, or executing a requery – which I still don’t understand why this isn’t possible)

    Try to find a way to tell "good" from "bad" servers.

    If nothing works, pass through the error in the API. Maybe use a more friendly way than a message box to handle this in explorer (like the "popup blocked" in IE?) but LINK TO THE KB ARTICLE!

    MHO for the other options:

    "Do Nothing" camp:

    In two years there will be equally many shouting "Windows sucks because they still doesn’t use fast mode which has been fixed ages ago"

    The problem is not the vendor fixing it, but distribution. An Error message does promote distribution, but moreover it is a "It worked in XP, it broke in Vista". Customers don’t care who is guilty. They care about features, and the software industry getting their act together.

    "Explorer Option"

    Suitable only if this is a major product. I guess if MS always took this way, we’d have 65032 explorer options by now.

    One possibility would be a single checkbox "Compatibility Mode", aggregating multiple fixes, with help linking to a KB article that describes which things it fixes in detail

    "Certification"

    Please no. Development for Windows is getting harder and harder for small shops. Required certification may be the death for many applications written with passion, and move us toward 9-5 drone software. Voluntary certification (as it is NOW with drivers) will not stop people installing crap and blaming MS – but maybe that’s what it takes.

  149. peterchen says:

    Coax the vendor into a fix that allows you to detect which version is running.

    If you can’t detect, run in slow mode.

  150. Cheong says:

    Having read the thread, here’s my vote:

    I’d rather have Windows detect the buggy driver, then popup bubble to alert users about this… much like what you would get when you plug a USB 2.0 device to a USB 1.1 port.

    I see no problem for this. Just make sure the warning message contains specific string that’ll make it easily searchable in search engines.(With KB article number, maybe)

  151. Jonathan Wilson says:

    One idea is that when you detect the error, re-enumerate. Then, return the item they asked for (i.e. keep a record of what item you would have returned if the error hadnt been detected and return that from IEnumIDList::Next after you re-enumerate)

    Only slowdown then is once when the error happens and the directory has to be enumerated again with slow mode.

  152. Radeldudel says:

    I’d say disable fast mode for the initial release of Vista and have an option to enable it. Put the story why it is disabled in the help for the option.

    Enable it later on (SP1?) so it will not be blamed on Vista itself but on some service pack.

  153. ::Wendy:: says:

    apologies if I’ve misunderstood the problem and for not reading the 153 responses.

    The problem is temporary based on a 3rd-part bug.  Prioritise designing for the ideal – no 3rd party bug.  Then prioritise letting the user know that they are not getting standard results in a manner that acts as a pressure for the 3rd party to improve its business by removing the bug/problem and moving towards the ideal.

  154. Yuhong Bao says:

    I recommend that you:

    * Include this issue in the Release Notes or a Read Me file and a KB article

    * Add a check box somewhere that disable fast mode.

    Advantages:

    * People can read the Release Notes or a read me file or search in the KB for it and find the issue.

    * People can disable fast mode, if nesserary.

    Disadvantages:

    * Not everyone read the Release Notes or aware of the KB.

  155. Steve says:

    A lot of noisy people here. I’ve seen probably a total of 5 largely different solutions, and those are the original ones Raymond put up.

    Lay off the vendor. Raymond said in his second paragraph that the vendor fixed the problem and the distributors haven’t picked it up yet. If you’re not going to read the comments before replying, at least read the original post!

  156. Paul de Vrieze says:

    There are a number of things that must be remembered:

    – For users it must just work

    – The problem should be limited as much

     as possible:

     – Make a difference between new stuff

       and old stuff.

     – If you automagically work around

       issues such that they don’t have a

       backlash, no-one will fix them.

    – While things should just work, they can

     be slow. Users can also be given

     unobtrusive warnings (in the statusbar,

     not a dialogbox)

    If it would be possible to detect the broken servers that would be preferred. If not, I’m in favour of (if possible) having two APIs. The compatible one for existing code that is slow, but works always. The new API should return an error if the fast access fails, that suggests the application to retry. (I suppose that dll’s support versioned symbols).

  157. 8 says:

    Wow, lot’s of comments on this one. I just browsed through most of them.

    Anonymous: "Tremendous hack: Do the fast query, remember the files returned (up to 100, or the maximum number of files that a bugged fast query can return). If it fails, do the slow one, and don’t pass the files already reported."

    That could cause non-obvious strange behaviour years from now, I can see the tont blog entry coming in the far future ("When drivers don’t support something they claim to")

    AC: "Isn’t it possible to detect the version of the server, and afterwards to produce slow queries to servers which are known to make errors?

    Since it’s something like "get first/get next" you are supposed to exchange more messages with the server so you can start the session by first asking the version.

    Another way around, if you can’t detect "old bad" can you detect "new fast" servers? Then only those should be asked fast."

    Almost exactly what I thought. Try to find a signature to detect the broken server.

    Advantage: It’ll "just work"

    Disadvantages: Ugly BC hack in the code, slightly slower query time

    Brad Corbin: number 4 entered my mind in the first part of your solution too, and indeed having a new API is a tremendous bummer, especially now that it’s been partially delayed again.

    Joe Butler: interesting solution, let’s hope a security bug doesn’t crawl in

    To those that say the vendor should be contacted or the driver should be fixed and BryanK, read the friendly article, it was already fixed but there’s still need for BC. (How on earth would you track down who uses or is going to use the buggy version?)

    BTW adding any sort of UI, be it an error or configuration setting is costly (requiring a help text, a translation, a translation of the help, possibly a KB article to explain it, etc.), especially now that Vista is in a late stage of development. And this one

    is especially funny:

    Martin: "The dialog should say something like this: The server you are accessing only returns the first 100 files when using fast queries. Do you want to turn fast queries off? Y/N."

    Some users aren’t gonna read or understand it and just press No, Raymond has pointed that out already: http://blogs.msdn.com/oldnewthing/archive/2004/04/26/120193.aspx

    And I agree with JamesW, although bugs can happen while there’s proper documentation provided too.

  158. Robert B. Anonymous says:

    Do "almost nothing."

    Create a new field in installs that allow the installer to specify who the mystical entities (such as "Network Administrator") are, complete with E-mail or some other contact information.  Have this also be built into the framework for Active Directory/whatever you want to call the Vista Server implementation, so enterprises can customize this (and update it) site side, in a single place.  This way, the end user will know who to contact in a corporation.  At home, they’ll know who to contact as well.  If they don’t, I’m going to make the rash assumption that they probably don’t need this server sitting in home.

    If you do much else, what are you harboring?  If I knew that I could make an *almost working* game/driver/program, and publish it, having Microsoft more or less "complete" the code for me, think of all the man hours I’d save.

  159. AC says:

    Everybody, take a look at the posts from Mark Sowul.

    He specifies which calls are in question:

    NtQueryDirectoryFile

    returns STATUS_INVALID_LEVEL after first 128 entries.

    After STATUS_INVALID_LEVEL is received, calling

    FileBothDirectoryInformation

    returns STATUS_NO_MORE_FILES.

    Now both are NTDLL calls, and both are (well of course, why do you ask :) ) undocumented.

    So it look obvious that such bug should not be "fixed" just in Explorer.

    Moreover Explorer should not even use these calls, but use "FindFirstFile" that’s the only thing that anybody else can use. Anything else would be cheating. And unfair advantage compared to any iother third party application which can only use the Win32 API.

  160. Chris Becke says:

    Skip the error messages or dialogs.

    Add a configuration option to the Network Client to "Allow Fast Mode". Default it to Off, but put it there. Obviously XP’s slow mode wasn’t all that slow, so defaulting people to slow mode isnt a problem.

    In the mean time, work with the vendor to upgrade their server. At some point in the future, simply change the default for the network client from Off to On for this option – once its deemed that the risk of users hitting old servers is minimized.

    Cons:

    Everyone is punished and forced to use slow mode – unless they happen to be aware of fast mode and how to enable it.

    Why I dont care about the con:

    XPs exclusive use of slow mode means that slow mode isnt THAT slow.

    Pro:

    The ability (and intention) to enable fast mode means the vendors can get fast mode compliant drivers tested and out there.

  161. Asd says:

    I vote for retrying and continuing using the slow request when you catch this error. There have been some comments saying this is complex and nasty but I would think it is a lot simpler and cleaner to implement than a specific error message.

    And I think version checking is almost always the wrong way to do things. Check for functionality not a specific version.

  162. Laura T. says:

    I think almost everything as been said on this topic.

    Anyone remeber the infamous OPLocks? They were enabled by default, and what we got..? The Fast Mode (whatever it is) seems something similar.

    They both have/had compatibility problems. The root causes are different (in most cases) but the point is, the new, a lot faster system was enabled by default. It was not a success.

    This and other reasonings brings me to say "Disable "fast mode" by default", at least on the business editions. Home editions might not.

    But let me an easy way to switch it on or off with the correct advises ("This option might…"). The system might even advise if automatically after a while (like performance center does) if no problems encountered. But that’s another problem.

    And lastly, because this seems to be, only the FIRST appearance of a problem with this new type of query. It might, and might not, be more   pervasive that it seems now. The use data preservation is the first priority of an operating system. If I cannot find files, it’s not so secure, whoever the blame is.

    Laura

  163. dave says:

    As a developer in this particular problem space, the solution I prefer for the good of mankind is "do nothing and let the buggy file server get fixed".  There are enough hacks and workarounds associated with SMB already.

    For bonus points, have Someone Important get on the phone right now to the CEO of Electronic Notworking Moving Control Appliances, or whoever they are, right now, and point out how embarrassed they’re going to be when Vista comes out.

    Care to describe what ‘fast mode’ and ‘slow mode’ really are in SMB terms? I’m curious.

  164. Christoph Richter says:

    would use Centaur’s solution.

    i know the reason, why microsoft is doing so much "compatibility" code, but for buisiness, where the company is still alive and has a workaround, it should not be "hidden silently"

  165. 8 says:

    BTW, I do think it’s really cool that Microsoft is actually catering for a bug in Samba. I’d expect MS to do everything possible to undermine Samba, and Samba to do everything possible to stay compatible.

    This is a major leap forward in that regard, thanks mr Chen! This is good for MS.

  166. Igor Shmukler says:

    Do a slow mode by default, but try to see whether the fast mode can be properly supported.

    Keep bounded LRU splay tree with servers that support fast mode

    Do a lookup and if your server is not in the tree, do a slow mode.

    This should yield same practical results, as fast mode by default.

  167. David Walker says:

    For the comments that say the following:  "You can apparently throw out all the suggestions about retrying the operation, because they would fail too:

    The next NtQueryDirectoryFile [fast] call after getting the first 128 directory entries returns STATUS_INVALID_LEVEL.

    If we switch to using the FileBothDirectoryInformation [slow] after getting this error, that call returns STATUS_NO_MORE_FILES.  This does not happen when using the FileBothDirectoryInformation [slow] level right from the start. "

    Right, if you SWITCH to using the slow version after getting the STATUS_INVALID_LEVEL error.  

    How about closing the connection, and then reopening it in slow mode? (Starting completely over after seeing the error.)  

    If "closing the connection" isn’t the right term, then start over and do whatever you would have done if fast mode didn’t exist.

    Why hasn’t anyone mentioned that possibility?

    David W

  168. Some of you are just plain killing me.  It appears you don’t understand the economics of writing software.

    James

  169. macbirdie says:

    My take is: Do Nothing.

    All NAS devices are upgradable and all un*x server admins can upgrade their software as well. Can’t see why there should be another backcompat/bugcompat hack in Windows.

  170. Centaur says:

    So…

    * There is a buggy server that claims to support fast mode but does not.

    * The bug is fixed in the server software.

    * Distributors are distributing the old version.

    * Windows Vista is going to use fast mode.

    * Some time will pass until Windows Vista ships.

    Therefore,

    * Contact distributors, urging them to update to the new version.

    * Give them a sensible time frame.

    * After that period passes, issue a public advisory (or cause a public advisory to be issued), probably to a security-related mailing list such as Bugtraq (since the symptoms look like the condition of denial of service to the user who wants the 101st file). State that the bug is fixed in version X.YY.

    * When the error condition is detected, log an event with a descriptive details text that administrators and advanced users can google for.

  171. Dewi Morgan says:

    [Summary: I suggest a mix of "autodetect with dialog", plus "config setting for SMB client", plus "knowledgebase article", plus "syslog" plus…]

    Allowing developers to turn on (or off) fast mode in their code is important, but won’t fix the problem, so isn’t a relevant response.

    A "go slow" setting should be provided for the SMB client. Won’t fix the problem when it happens, but allows a fix for the general class of "problems from fast mode", rather than the specific class of "problems from samba returning a bizarre error code".

    This does not address the user experience, though.

    Requiring apps to deal with this will still make people say "Vista bug!", which it is. All my apps break except those upgraded by Vista? Vista bug.

    Anyway, you can’t code for every incompatibility caused by fast mode, whether you put your workaround in explorer or the SMB client.

    You also can’t rely on the server being fixed unless you tell people it needs fixing.

    So the answer is probably found in the answer to the question "how does SMB deal with the network being dropped before a findnextfile"? Users are used to networking issues. They don’t (generally) blame networking issues on MS. Do what you normally do then, but with a different message and (imperative) a pointer to the KB article. Log it in the syslog (also imperative).

    The KB article should list how to turn fast mode off systemwide, how to upgrade the server so they don’t have to, and importantly, how to remove the files you can see in order to see the others.

    And as people pointed out earlier, open up and document the SMB standard now that there are clear financial and development costs to having it a closed standard.

    Advantages:

    * Operating system remains "pure", unsullied by specific compatibility hacks.

    * SMB client now has the ability to deal with all unexpected error codes and direct people to a relevant KB article, which gives a generic answer, and directs them to a specific subpage for each known unexpected error (this samba bug being the only one, for now).

    * Customers with this problem will know that they have it.

    * Customers are given a specific place to look for the solution.

    * Customers can turn off fast mode clientside, so support contracts are unaffected and the problem can be resolved even without firmware upgrades or talking to an admin.

    * The customer is told how to reduce the number of files in the folder and recover all data even without resolving the problem on either end, even if they’ve no access to the SMB client settings on their own machine.

    * A clear enough message causes users not to interpret it as a bug in Windows Vista.

    * Administrators can choose not to upgrade, and instead tell all users to configure their clients for slow mode, or just to not store more than N files per dir.

    * None of the disadvantages of the autodetect/blacklist methods.

    Disadvantages:

    * Programming and testing required, and it’s very late in the day for Vista.

    * It doesn’t silently "just work".

    * Makes all servers use slow queries if there’s a single buggy server (unless the client can be given something like a "slowservers.ini", which accepts wildcards – a manual whitelist).

    * Users have to read an error dialog, understand it, and follow it.

  172. Stefan Kanthak says:

    Do the same as described in MSKB article 896427!

    Stefan

  173. Jules says:

    A number of people have been saying something like the following:

    There is no real option.  Explorer must either show a list of all the files all of the time, or an error message.  Everything else is just details.

    These people are right.  Showing an incomplete list without explaining to the user that something is wrong is broken behaviour, and must be fixed.  Further, the order of preference is clearly "show a complete list" followed by "show an error message", because a working tool is better than one that tells you it isn’t working.

    Version checking doesn’t work.  Comments have made it clear that the software at fault is samba; samba has a feature that allows the user to customise the system name and version number returned, hence there are probably thousands of different cases here, many of them indistinguishable from working ones (e.g. I’ve seen samba servers installed that claim to be W2K servers).  At the very least, every single NAS appliance manufacturer who uses it will have their own custom system name and version number scheme.

    So, for me there are only two possible solutions:

    1.  Make it work.  The best way of doing this depends on the precise nature of the problem, but it seems clear from Raymond’s comment above that the current API is not adequate for the purposes.  This doesn’t make it impossible to make the file listing work by switching to slow mode on receiving the error: it just makes it harder.

    Deprecate the current API and make the current version behave identically to XP (i.e. it should use a slow query if that is what XP does); introduce a new version with a new contract that allows fast queries but requires the user to check and reissue a slow query if the fast one fails.  Update Explorer to use this new API.

    This is by far the best solution possible.  Advantages: Explorer works and is fast in most cases, other new apps work and are fast in most cases, old apps continue to work if they currently do.

    Disadvantages: Old apps don’t benefit from the speed increase.  C’est la vie.  Requires inclusion and maintanence of legacy code (a deprecated API).  Windows doesn’t have a strategy for removing deprecated APIs and seems unlikely to introduce one in the foreseeable future, so this is a cost that will be ongoing indefinitely.  It’s likely only a small cost: the old API can probably be implemented as a special case of the new API with only a few lines difference.  What’s the cost of (e.g.) 50 lines of code?  Not a lot, all things considered.

    2. The other option is display an error message.  This is Raymond’s second option, and by far the best he presents.  It can be combined with a config option that makes either Explorer or the system degrade to slow queries, and this option can be mentioned in the text of the error message.

    Other things that should be considered: should Explorer show some UI to indicate what has happened if option 1 is implemented?  I think yes; perhaps something in the status bar like "139 objects (using slow mode – click here for details)".  Another alternative is logging the failure in the event log.  My opinion is that at least one of these should be done, although I wouldn’t say it is incorrect to do neither.  Doing nothing about the whole problem, though, or fixing it only by adding an option to work around it but being silent about its existence, *is* incorrect.

  174. developer says:

    use try catch & see the error thrown

  175. This is a bug someone reported to us 19th feb 2006. I fixed it the same day (it was an error in my code, missing a couple of entries in a switch statement). The bug – here :

    https://bugzilla.samba.org/show_bug.cgi?id=3526

    was fixed in 3.0.21c. By the time Vista ships I expect most vendors to have moved to at least this version. If the Microsoft engineers came to the CIFS conference along with all the other CIFS engineers this problem would have been found and fixed in earlier versions of Samba. I urge Microsoft’s engineers to communicate directly with the Samba Team when they find problems like this. We have good relations with all our vendors and have the ability to push expidited bugfixes to people who are shipping Samba code.

    To quote Steve McQueen, This is simply a failure to communicate :-). Let’s hope we all do a better job in future.

    Jeremy Allison,

    Samba Team.

    PS. I forgot to mention. The bug was open in our bug db for a grand total of *three* minutes before I had the fix committed into the SVN tree. I don’t think we could have reacted faster in getting a fix done than that.

    As I said, if it’s causing a problem with Vista deployments let us know and we’ll poke our vendors with a stick to make sure the fix gets widely updated. It’s two extra lines in a switch statement so I don’t think it’s a problem for people to review it for correctness :-).

    Jeremy Allison,

    Samba Team.

  176. Steve Loughran says:

    As an aside, the Apache Axis SOAP Stack has some patches to deal with .NET 1.0’s SOAP stack which doesn’t handle all forms of XML. It went in, although its slightly more inefficient in terms of bandwidth, because the apache team knew they’d get the support calls for the interop problem. Saying ".NET SOAP doesnt handle all XML" may be true, but it doesnt meet customer needs of having things talk to each other.

    Pragmatism beats ideology when it comes to network interoperability.

  177. Hayden says:

    It seems to me that yet more "state" is a bad idea. blcklisting buggy SMB servers won’t really scale.

    The user needs to know that he’s not looking at the whole list, when the server fails on a "fast" list. All these extra error dialogs and such are just a cop-out – "help, it’s gone funny, not my problem any more".

    So:

    1) Only Explorer uses "fast" mode, as it has the wit to work round it. Other programs should be (silently?) made to use slow mode.

    2) When the "weird error" happens, the hourglass changes to "pointer-plus-hourglass" – in other words, you can select and work with any files shown, but you get the idea that things aren’t quite done yet.

    3) Meanwhile, Explorer re-queries using slow mode. When the number of returned entries exceeds the number displayed (or there are no more entries) the Explorer view refreshes. Which would un-select any current selection, of course, but then you had the idea that things where’s quite done. This simple refresh idea removes any inconsistencies caused by the directory contents changing in between queries.

    So, what you see is: list, pause, list refreshes bigger.

  178. silkio says:

    I’d honestly be surprised if there wasn’t a better way to diagnose this problem. That’s what I’d spend time looking into to. Especially if the Samba team is happy to help, I’m sure some sort of solution can be realised so that you can do a version check and solve all the problems.

    Clearly you can’t "do nothing" and the worst, but most likely, result would be to force fast mode =’s off always.

    *shrug*

    I do find it hard to believe you can’t work with the Samba team to figure out a way to do a version check.

  179. How about…

    1) Default ‘off’ fast mode in Vista, perhaps with a GUI setting to turn it on.

    2) Make fast mode have a big bold link "WARNING: This may cause problems with…" that leads to a KB article shaming the vendor(s) involved for their bad code.

    3) Work with third-party vendors to get fixed code out and give the broken code some time to exit the marketplace (read: be updated or have its parent device die).

    4) Enable fast mode by default when a higher percentage of the marketplace runs non-buggy code, perhaps in the next release.

    5) Issue an event log entry (with a cap of once a day or so) that indicates if a buggy server is found while fast mode is on.

    Advantages:

       * Vista "just works".

       * Vista is at the same speed/compat by default as XP.

       * If the speed differential is significant, some users will complain to their buggy NAS/software vendors to force them to upgrade if it doesn’t work with the new Vista feature.

       * Users who run into the problem will likely have an idea of the cause and be more able to fix it.

       * Not as much of a compat hack as re-issuing the query and legitimately adds to the flexibility of the product.

       * Allows users to adjust a setting if OTHER products have similar problems with the new feature.

    Disadvantages:

       * Vista doesn’t use its full speed potential.

       * No proactive way of handling broken servers — would probably require an administrative change to restore compatibility after fast mode was turned on.

  180. DmitryKo says:

    I find it funny how people propose that Explorer should use undidclosed API functions… weren’t suspected OS-specific "ties" considered a gravely bad thing and a reason for major criticisms of Microsoft products?

    Look here http://blogs.msdn.com/larryosterman/archive/2004/08/12/213681.aspx for just one single example…

  181. Morrog says:

    It was noted that there is some sort of enumeration going on here, which prevents a total refresh of the data already sent to the querying function, so that supposedly ruled out the possiblity of switching to fast mode. But how about this.

    If this strange error is encountered, at the 100th entry (or 101), why not switch to slow mode, query up and throw away the first 100 results, and then continue on returning results from the slow query?

    Rough code:

    NextItem()

    {

        QueryServer();

        if(There Was That Weird Error)

        {

             Disconnect();

             Reconnect();

             StartSlowQuery();

             for(i = 0; i < results_so_far; i++)

                  QueryServer();

             QueryServer();

             return Result;

        }

        else

        {

             return Result;

        }

    }

  182. Sorry, I haven’t read all the comments.

    Firstly, did you discuss it with the Samba guys? If it was a misunderstanding of the protocol arising from lack of documentation, I think Windows should work properly with Samba.

    But if the Samba guys agree it’s a bug in their server, I’m strongly in favor of Do Nothing (or, if that’s not acceptable, Put up a dialog). Windows should not accumulate (even more) cruft to compensate for bugs in third-party apps.

  183. ms speciality says:

    Do what M$ is best at: Sue ’em!

  184. Jeff says:

    Disable fast mode by default.  Inform the vendor that this is the default setting in Windows and that fast mode can and should be enabled by the vendor’s software installer when a newer (and functional) version of the driver is installed.

  185. Sjoerd Verweij says:

    Slow mode is default.

    Network driver detects server version. (If there is no way to do this, add it to server code, and e-mail the Samba people how to do it as well).

    If server version is bad (this should be in Explorer of course, because here’s where we go into kludge territory), add to "Bad server" list (keep 256 or so) so you don’t have to repeat your inquiries, warn the user (once): "Retrieving file listings is not as fast as it could be. Please upgrade the server you are connecting to or contact your administrator" — but only in business versions; this is exactly the kind of message that could freak a home user out, log in the event log (ALWAYS) and stay in slow mode.

    If server version is good, add to "Good server" list (keep 256 or so) so you don’t have to repeat your inquiries and kick to fast mode.

    Advantages:

    – Transparent to user

    – Never any missing files

    – No problem if a server

    Disadvantages:

    – Slight perf hit for the detection.

    – If there is no way to detect the server version, all will be slow until the newer, blessed versions of servers roll out.

    – Clients stay in slow mode even with a server upgrade (although you can work around that by making the Bad Server list entries time out every month or so).

  186. Sjoerd Verweij says:
    • No problem if a server falls out of the cache. It will be redetected and added to the good or bad list.
  187. MajorGlory says:

    Keeping things transparent to the user is all well and good, but if you mask the issue on then how will the administrators know to perform the upgrade unless they happen to read the client event logs?

    The user needs to know about the issue.  Raise a warning which says contact your admin and links to a KB article.  Use some sort of reg key to configure it and have an option to set this via group policy for the rest of the clients on the network.

    Then I’d follow this advice:

    Sunday, April 02, 2006 8:54 PM by Morrog

    It was noted that there is some sort of enumeration going on here, which prevents a total refresh of the data already sent to the querying function, so that supposedly ruled out the possiblity of switching to fast mode. But how about this.

    If this strange error is encountered, at the 100th entry (or 101), why not switch to slow mode, query up and throw away the first 100 results, and then continue on returning results from the slow query?

    Rough code:

    NextItem()

    {

       QueryServer();

       if(There Was That Weird Error)

       {

            Disconnect();

            Reconnect();

            StartSlowQuery();

            for(i = 0; i < results_so_far; i++)

                 QueryServer();

            QueryServer();

            return Result;

       }

       else

       {

            return Result;

       }

    }

  188. How many flags do you want?

  189. Norman Diamond says:

    The base note lacks some important information, but I gather from several replies that APIs such as FindNextFile are affected as well as user-visible applications such as Windows Explorer.  When the API sees that an error occured, I agree with several replies that the API must inform the caller that the error occured AND the API must log the event.

    If Windows Explorer gets an error return from the API in retrieving information that it is supposed to display, then Windows Explorer should display a message box to inform the user instead of displaying incorrect results.  (More on this later.)  If other applications get an error return from the API, their coders can decide what to do, such as getting the error text from the system and displaying the error text, or deciding how they want to do a retry.

    It should be moderately easy for the user to configure what to do.

    I agree with another poster that the fallback from DMA/UDMA to PIO is a partial precedent.  The obstacles that users encounter in discovering that they’re suddenly running PIO, and finding why, and finding how to undo it, are not good precedents.  Good logging is needed.

    I agree with another poster that the choice of turning write caching on or off is another precedent.

    A third precedent would be digital audio from CD-ROM drives.  By default it’s off, the user can turn it on, and the user can turn it back off if it doesn’t work.

    A fourth precedent would be the control panel applet that lets users set and store some usernames and passwords for access to various network servers.

    I think that in principle the above precedents do not limit the system to remembering 16 most recently used devices or servers.  If the insufficiently described problem in this case is Samba, then a limit of 16 will likely be too small.  There should not be a limit.

    Now which should be the default, fast queries or slow queries, I cannot really say.  The above precedents include good precedents for both possible decisions about the default.

    Now, here are some precedents which should not be copied.

    In the command-line version of FTP, if an mget command is used, it stops after retrieving 511 files.  Suppose the server’s directory had 520 files including some starting with a capital Z and suppose the server’s sort ordering puts capital Z in the middle of the list.  The user will look at the result, see Z-filenames at the bottom, and will not notice that some of the lower-case z-filenames and y-filenames weren’t copied because they came after the 511 point.  If the FTP command would display a warning (copying stopped after 511 files) then the user would know to look for the missing files and do a retry, but no, who wants to let command line users have an easy time of it.  Windows Explorer should not copy this precedent.  When it can’t retrieve the entire list, it should tell the user.

    If Windows Explorer can’t expand a folder in the tree view in the left-hand pane, it simply deletes the "[+]" box and doesn’t display an error message.  The user doesn’t guess that they needed to input a user name and password in order to view the network resource.  This precedent should not be copied.  When Windows Explorer can’t display the contents, it should tell the user.

    In Windows XP prior to SP2, and Windows Server 2003 prior to SP1, if the user connected or disconnected a USB hard drive or DVD drive then sometimes Windows Explorer did not update its display of drive letters.  This precedent should not be copied.  In a few cases (not enough) Windows Explorer overlays the drive’s icon with a red question mark or something like that.  When a drive letter is in use, Windows Explorer should show the drive letter, and if Windows Explorer has encountered a problem with that drive letter then it should show the fact.

    When Virtual PC 2004 is installed, with Windows XP SP2 installed on both the host and the guests, and a directory on the host is shared on a guest using VM Additions, Windows Explorer on the guest often gets a corrupted view of the share.  Sometimes the corruption has resulted in virtual BSODs in a guest but usually it’s invisibility of some files or corruption in their contents.  Either Samba is not involved or Virtual PC 2004 uses Samba.  Either way this precedent should not be copied.

    The reason why users would tend to blame Vista for bugs is that we’ve seen enough cases where other Windows versions (up to and including XP SP2, 2003 SP1, XP x64 SP1, and Vista betas) have lost files due to their bugs.  Part of the way to stop users from blaming Vista for bugs which Microsoft knows to be non-Microsoft bugs is for Vista to log the errors when they occur, and the source of the error.  For example maybe "Files could not be listed properly because \someserversomepartition returned invalid error code 993."  Another part of the way is for Windows Explorer to display the error so that the user who was expecting to look at a list of files will know not to believe the list, instead of getting a rude surprise months later.

  190. luser says:

    Users don’t read errormessages, that has been proven time after time. If the directory contents is wrong, ms will be blamed.

  191. Norman Diamond says:

    Thursday, April 06, 2006 1:42 PM by luser

    > Users don’t read errormessages, that has

    > been proven time after time.

    True.  But presenting an incorrect list without an error message is unconscionable.  Present an error message that 99% of users won’t read.  If the error message is short and to the point ("Files could not be listed properly because \someserversomepartition returned invalid error code 993.") then you’ve done your job.

    Of course if the error message is 18 pages of gobbledegook then you haven’t done your job.  If you don’t say "files could not be listed" then you’re actively deceiving your user when you present a fake list.

  192. Lorenzo says:

    I’d go for a solution like Auto-detect the buggy driver and work around it next time, but in not "next time". An error code is returnedafter fast? 1) (Optional: make user aware, expecially if this procedure can lead to time waiting, with a "do not display again" checkbox) 2) REDO the query in slow mode. Do this every time, no registry, nothing. It will eventually force admins reconsider upgrading samba, it will work, it will not poison registry, works like XP and Vista uses the new features

  193. Irimi says:

    Have explorer use fast mode, if fast mode fails for some "unusual" reason, then have it try slow mode.

    Advantages:

    * it will work. Drivers/versions that aren’t buggy will still work in fast mode.

    Disadvantages:

    * queries on buggy drivers will take a little longer due to the failed fast mode attempt.

  194. Ian Boyd says:

    > Have a configuration setting to put the network client into "slow mode"

    It’s not a bug with Windows, it’s not a bug with most file servers, and it’s not a bug with fast mode. It’s a bug with *some* file servers that lie about what they can do.

    If the error happens to pop up, then they user will get an error message. They google the error, find the KB article, add the registry key to disable fast mode everywhere, and all is well.

    Otherwise, they get to *SEE* that a device they have purchased is behavingly badly, and get to bitch out the real culprit, demanding updates, or sulking – all while having a temporary workaround.

    At best they see their Linux box needs to be updated because it’s buggy, at worst they know that the device they bought is buggy and they’ll know it’s inferior and want to get rid of it.

    Either way, i would rather Windows didn’t keep working around everything else. Go ahead, break it. If you hide it, i don’t get to know if my stuff is broken. If you work around it, things get slower. i don’t want things slower, i want them faster.

  195. Too many incompatible devices.

Comments are closed.