Why does the NT redirector close file handles when the network connection breaks?


Yesterday, Raymond posted an article about power suspend and it’s behavior in XP and Vista.  It got almost as many comments as a post in the IE blog does :).

I want to write about one of the comments made to the article (by “Not Amused Again”):

James Schend: Nice one. Do you know why the damn programs have to ask before allowing things to go into suspend mode? It’s because MS’s freakin networking code is freakin *broken*. Open files are supposed to be reconnected after a suspend, and *they are not*, leading to losses in any open files. (Not that saving the files then allowing the suspend to continue works either, as the wonderful opportunistic file locking junk seems to predictably barf after suspends.)

 

A long time ago, in a building far, far away, I worked on the first version of the NT network filesystem (a different version was released with Windows 2000).  So I know a fair amount about this particular issue.

The answer to the Not Amused Again’s complaint is: “Because the alternative is worse”.

Unlike some other network architectures (think NFS), CIFS attempts to provide a reliable model for client/server networking.  On a CIFS network, the behavior of network files is as close to the behavior of local files as possible.

That is a good thing, because it means that an application doesn’t have to realize that files are opened over the network.  All the filesystem primitives that work locally also work over the network transparently.  That means that the local file sharing and locking rules are applied to files on network.

The problem is that networks are inherently unreliable.  When someone trips over the connector to the key router between your client and the server, the connection between the two is going to be lost.  The client can reconnect the connection to the network share, but what should be done about the files opened over the network?

There are a couple of criteria that any solution to this problem must have:

First off, the server is OBLIGATED to close the file when the connection with the client is disconnected.  It has no ability to keep the file open for the client.  So any strategy that involves the server keeping the client’s state around is a non-starter (otherwise you have a DoS scenario associated with the client). Any recovery strategy has to be done entirely on the client. 

Secondly, it is utterly unacceptable to introduce the possibility of data corruption.  If there is a scenario where reopening the file can result in a data corruption scenario, then  that scenario can’t be allowed.

So let’s see if we can figure out the rules for re-opening the file:

First off, what happens if you can’t reopen the file?   Maybe you had the file opened in exclusive mode and once the connection was disconnected, someone else got in and opened it exclusively.  How are you going to tell the client that the file open failed?  What happens if someone deleted the file on the share once it was closed?  You can’t return file not found, since the file was already opened.

The thing is, it turns out that failing to re-open the file is actually the BEST option you have.  The others are actually even worse than that scenario.

 

Let’s say that you succeed in re-opening the file.  Let’s consider some other scenarios:

What happens if you had locks on the file?  Obviously you need to re-apply the locks, that’s a no-brainer.  But what happens if they can’t be applied?  The other thing to consider about locks is that a client that has a lock open on a region of the file assumes that no other client can write to that region of the file (remember: network files look just like local files).  So they assume that nobody else has changed that region.  But what happens if someone else does change that region?  Now you just introduced a data corruption error by re-opening the file.

This scenario is NOT far-fetched.  It’s actually the usage pattern used by most file based database applications (R:Base, D-Base, Microsoft Access, etc).  Modern client/server databases just keep their files open all the time, but non client/server database apps let multiple clients open a single database file and use record locking to ensure that the database integrity is preserved (the files lock a region of the file, alter it, then unlock it).  Since the server closed the file when the connection was lost, other applications could have come in, locked a region of the file, modified it, then unlocked it.  But YOUR client doesn’t know this happened.  It thinks it still has the lock on the region of the file, so it owns the contents of that region.

Ok, so you decide that if the client has a lock on the file, we won’t allow them to re-open the file.  Not that huge a restriction, but it means we won’t re-open database files over the network.  You just pissed off a bunch of customers who wanted to put their shared database on the server.

 

Next, what happens if the client had the file opened exclusively?  That means that they know that nobody else in the world has the file open, so they can assume that the file’s not been modified by anyone else.  That means that the client can’t re-open the file if it’s opened in exclusive mode.

Next let’s consider the case where the file’s not opened exclusively: There are four cases of interest, involving two file attributes and two file open modes: FILE_SHARE_READ and FILE_SHARE_WRITE  (FILE_SHARE_DELETE isn’t very interesting), and FILE_READ_DATA and FILE_WRITE_DATA.

There are four interesting combinations (the cases with more than one write collapse the file_share_write case), laid out in the table below.

  FILE_SHARE_READ FILE_SHARE_WRITE
FILE_READ_DATA This is effectively the same as exclusive mode – nobody else can write to the file, and the client is only reading the file, thus it may cache the contents of the file The client is only reading data, and it isn’t caching the data being read (because others can write to the file).
FILE_WRITE_DATA This client can write to the file and nobody else can write to it, thus it can cache the contents of the file. The client is only writing data, and it can’t be caching (because others can write to the file)

For FILE_SHARE_READ, others can read the file, but nobody else can write to the file, the client can and will cache the contents of the file, .  For FILE_SHARE_WRITE, no assumptions can be made by the client, so the client can have no information cached about the file.

So this means that the ONLY circumstance in which it’s reliable to re-open the file is when a file has never had any locks taken on it and when it has been opened for FILE_SHARE_WRITE mode.

 

So the number of scenarios where it’s safe to re-open the file is pretty slim. we spent a long time discussing this back in the NT 3.1 days and eventually decided that it wasn’t worth the effort to fix this.

Since we can’t re-open the files, the only option is to close the file.

As a point of information, Lan Manager 2.0 redirector for OS/2  did have such a feature, but we decided that we shouldn’t implement it for NT 3.1. The main reason for this was the majority of files opened in OS/2 were open for share_write access (it was the default), but for NT, the default is to open files in exclusive mode, so the majority of files can’t be reopened.

 

Comments (33)

  1. Dave says:

    Considering the evolution of Windows–from standalone PCs with local files to networks with remote files and the complication of sleep/suspend–it sure seemed right for the OS to paper over the differences as much as possible. Long term, though, it seems to make work harder for app developers.

    I guess that’s why the Internet is largely a stateless place, or the state is explicitly held by the clients. There’s more work up front, you have to face the issues immediately in the design. With transparent access at the app level, it’s easy to ignore those problem scenarios because they’re relatively rare.

  2. Good point Dave.  NFS is also stateless, which has its own set of horrible issues (try writing a reliable database that runs over NFS, it can’t be done).  

    For grins, search for "IMAP NFS Crispin" to see some of Mark’s comments about people who try to store their IMAP data stores on NFS volumes – without reliable file locking, it’s impossible to do a remote database without corruption.

    That’s why client/server is such a powerful paradigm, because it allows you to take a networked problem and turn it into a local problem.

    And I disagree that it pushes the problem to the app developers.  If you don’t make F&P network access seamless to the app author, it dramatically reduces the value of the network.  I would have a significantly harder time if I couldn’t run programs over the network, or copy files from the shell/command prompt.

  3. Ben Cooke says:

    Of course, this practice of pretending that remote files are local files has drawbacks of its own.

    The most annoying one is that applications with few exceptions assume that file operations will complete quickly and so complete them in the UI thread. Windows Explorer even does this frequently. As we all know, I/O operations on network-mounted filesystems in normal circumstances can block for quite a while if a connection needs to be re-established, and if the remote server has gone away completely that read operation might well block for ten seconds or more.

    This is most annoying in applications which, when they get focus, interrogate the active document on disk to see if another application has changed it in the mean time. If the filesystem has since gone away, the call blocks and the application appears to hang. I see this at least once a month at work when the admin reboots the file server to apply patches.

    There is an important distinction between local file operations and remote file operations, and while in ideal cases it’s nice to pretend they’re the same thing, in practice it just seems to lead to trouble. The lesson for all developers reading this is that you should perform all I/O in a separate, non-UI thread, even if you think it’s "just" file I/O! It’ll make life happier for people using slow USB Mass Storage devices and floppity disks, too.

  4. Aryeh says:

    Most of the cases you described are problematic because files may change while disconnected.  If the server stored an ID that changed when a file is modified, and the client stored the ID that applied to the file before the disconnect, then the client could detect if it’s safe to silently reconnect.

    There are scenarios when the client would still break, e.g. if the client is using a lock on a file to coordinate something else, so a temporarily unlocked file would be a problem even if the file is unmodified, but those programs will probably break anyway.

  5. BryanK says:

    > and use record locking to ensure that the database integrity is

    > preserved (the files lock a region of the file, alter it, then unlock

    > it).

    So what happens if the client sends 4 SMB packets (lock, write, write, unlock — the client did 2 write operations to update the record), and the connection dies between the two "write" packets?  Unless SMB is transactional *and* treats the entire lock-write-write-unlock sequence as a single transaction (I have no idea if it does or not; I doubt it though), then not only will the file not be in the assumed state, it won’t even be *valid* anymore.  Even if the server closes the file and thereby invalidates all the client locks, the file will still have corrupt data in it, because the first write succeeded but the second failed.

    And telling the client about this does no good either, because the client can’t fix the file’s corrupt data (the connection is gone).

    So much for "making network file access look exactly like local file access".  You can’t have your file closed on you in between two writes (especially if it’s locked) when it’s local.

    (This also probably explains part of why Access uses .ldb files, actually.  If the .ldb file is there but has no locks on it, then the database is considered corrupt and needs to be "repaired".  Normally the last Access process to have the .mdb open will delete the .ldb file when the .mdb file gets closed, or at least remove the last lock on the file.)

  6. Ben, you’re not pretending that remote files are local files.  You’re asserting that remote files can make the same assumptions about reliability that local files can.  So an app that is developed against a local file will continue to work remotely.

    Any application that assumes that I/O to a local file will complete quickly is making an invalid assumption – you can’t assume the local media is quick (think removable media (floppy or cdrom drives)).  

    Aryeh, you’re sort-of right, but that would require modifications across the entire stack, from the filesystems up through the server (what happens when a local user changes the file contents?).  Also, how do you ensure that the value is kept in sync?  I guess you could use a change #(or write count or last write time) but if what would cause the change # to modify?  If the modified page writer flushes a previously modified page to disk, does that change the write count?

    The other problem is that this value must be persisted (maybe the reason for the failure was a server reboot), but some filesystems don’t permit the persistance of arbitrary metadata (it might be a legacy filesystem like cdfs or FAT, for example).  So now the set of cases where it’s possible to re-open the file is even further reduced.

    Solving this problem correctly is VERY hard, and the cost of not solving it correctly is a really subtle data corruption bug, which isn’t good.

  7. Ben Cooke says:

    Larry,

    That was my point, essentially. Any program that assumes file I/O is "quick" (for some value of quick) and does it in a UI thread is going to suck on anything but hard disks. Including Explorer.

    Clearly application developers don’t get it, so another alternative is to force them to care by making them acknowledge the problem somehow. One possibility that springs to mind is to force the apps to use async I/O, but that would be a pain for anyone who knows how to write a multithreaded app and wants to handle the asynchronisity themselves.

  8. Not Amused Again says:

    <i>Secondly, it is utterly unacceptable to introduce the possibility of data corruption.</i>

    Larry,

    Thanks for covering this (I’m the NotAmused who started the topic); I do see your point. My own apps. now simply blanket-respond to the WM_POWER… messages with a "no way".

    Care to cover ISAM OpLocks problems…?

  9. NAA, I’m not sure why FS oplocks are broken after suspend/resume, I never worked on the local filesystems, so I’m sorry, I can’t 🙁

  10. Gabe says:

    I understand that there is no clear-cut answer in many of these cases, but haven’t they already been handled?

    When you open a file which is cached for off-line access and then reconnect to the server, how is that any different in terms of file integrity?

    I would really like to see this fixed in harmless cases (read-only files, opened with write sharing, etc.)

  11. Gabe, it’s not.  But the rules for CSC are pretty strict, the newer copy wins.  But there’s a manual sync step going on with CSC, it’s possible to detect conflicts, resolve them, etc.

    When you’re dealing with a live file, you don’t have the luxury of resolving conflicts, you need to make a real-time binary decision: re-open or no.

    Oh, and read-only files are NOT a harmless case.  What happens if the file’s marked read/write, modified, then marked read-only during the interim?  Remember, the cost of getting this wrong is data corruption.  Realistically, you could enable an optional behavior to allow reopening of read-only files but it would have to be off by default.

  12. Marc Brooks says:

    It seems that you ignore what SHOULD BE the most usual (and the only no-changes needed recoverable) case, simply because it was the only one… not because it was hard, or dangerous.  This irks me because this IS the most common case for networked files (read only, deny none).

    Even the presence of locks could have be allowed if the server kept a generation counter on every open-for-write, lock grant or write request processed at the server. When the client received the initial lock success, it should remember that generation count. When reconnecting and reapplying locks, it would fail the reopen if the generation count coming back from the RElock request was not exactly what it expected (meaning other locks were granted or writes occured, so the server cannot guarantee that state is good).  The generation count would not matter until a reopen or relock request.  Some optimizations on reopen can be done based on the original open mode (e.g. the server doesn’t need to bump the generation count for writes on a SHARE_DENY_WRITE file, since only the client with it open could be doing the writes). Finally, the lock count doesn’t even have to be on-disk, since no reconnects should be legal if the server went down with open client handles.

    Simply put, these issue would allow MANY other single-writer access recoveries to happen, which is the most common situation.

    If we failed a reconnect (and relock), the client will be notified the next time it talks to the file in ANY form of unsuccessful reconnect; but this is NO DIFFERENT than what you have to do if the server goes down.

  13. Marc,

     When the cost of failure is data corruption, you can’t design features around what is the most common case, you need to design around the most reliable case.  Yes, it would be possible to kludge something together that worked most of the time.  But not something that was reliable to the point that people would put their mission critical data on the server.

     There ARE some solutions for the read-only read mode deny write case (which happens to be the "reading a program from a network share case"), and we strongly considered adding it.

     The lock problem is intractable (you can’t solve it in the server, because you have to deal with local access to the file, which doesn’t go through the server), but the other problems can be solved with sufficient plumbing (they require protocol changes, server changes, filesystem changes, etc, but they are possible).

     I made a conscious decision NOT to cover all the potential issues and solutions, figuring that what I’d posted was sufficient, but trust me, this is a harder problem than even what I’ve described.

     At the end of the day, we decided that the cost associated with such a feature wasn’t worth the benefit it would bring – the reality is that applications need to deal with failure of file accesses on local access, so they’re already going to have code to handle the case of a read failing, the only difference.

     By pushing the problem to the application, we ensure that the application will discard any cached data, which means that their data integrity will be maintained – we can’t know what data they’ve cached, but the app does.

     There’s a different team that owns the NT redirector, it’s been 15+ years since that design decision was made, it’s possible they might consider making a different decision, but maybe not.

  14. > You just pissed off a bunch of customers who wanted to put their shared database on the server.

    Based on the years of problems I’ve had to fight as a contract IT support tech with lousy applications that use shared-file "databases", I really wish you guys had pissed off some Customers…

    ("Now we have to ‘re-index’ the ‘database’ because it is ‘corrupt’…"   "Do you have the right version of VREDIR.VXD?"  "Have you disabled OpLocks on your NT server?"  *sigh*)

  15. Cheong says:

    I feel(just feel that, no deep thinking performed) that maybe we can use the approach of version control system.

    Once a file is opened through the network, a copy of the file is saved to client local and any read/write is directed to it. The file content is sync-ed only when the file is closed or flushed.

    On the remote side, the file is set read-only(i.e. exclusive write). And an timestamp is set on both side to mark the expiration time. If the expiration time come and the file is not closed, it’s forced close.

    On the local side, the timestamp is checked on each write. If expried, the write fails, and mark the file closed.(As if some external programs forced closing the file handle) If the file closes this way, a last attempt to sync the file(file on disk only, content not flushed are not sync-ed) is made. If can’t update the remote side, the content is rolled back to last sync-ed state.

    With buffered read/write and "checkout" system, seems that a workable system that can reopen file over network can be made. Of course the overhead on diskcache can be large, but the side of the temp-disk cache can be set by users just like the virtual memory pagefile.

    The other throught is, what if we make it another way by making "all programs think they are working on the network files"?

  16. vredir.vxd???  What’s that?  Sounds like some windows 95 thingy…

  17. > vredir.vxd???  What’s that?  Sounds like some windows 95 thingy…

    Yeah– the Windows 9X redirector for Microsoft File n’ Print sharing. I spent many a day troubleshooting problems w/ cruddy "shared file database" applications "corrupting" files due to bugs in various versions of VREDIR.VXD. The trademark of these crappy apps was the instruction to add a value to the LanManServer parameters to disable oplocks.

    You’re talking about the NT redirector, of course, but the repressed memories of fighting with these applications that were too low-rent to bother with real client/server database systmes came welling back up. The problems are fewer now, since we’ve got the NT redirector on the desktops today, but I’m still horrified fairly regularly when I find new applications that continue to use such a mediocre and inefficient way to handle storing and sharing databases.

  18. Mike says:

    Funny (not "ha-ha", but curious), that *nix systems of all kinds have for ages been able to run completely off NFS (with no, absolutely zero, local disk needed), yet Windows has never been able to do this, and later incarnations needs both one and two *GIGA*byte of local storage to even start (I won’t even mention how horrible the RAM requirements of Windows have become).

  19. Mike,

     Talk to Mark Crispin sometime about trying to run a real-world server off NFS volumes.  It simply can’t be done (no locking semantics, no reliable write-through semantics).

     The problem is that there are a huge number of times when it’s CRITICAL that a client be able to confirm that a data write has been committed to the hard disk – without it, you can’t do database transactions (if you can’t be sure the write hit the hard disk, you can’t commit the transaction).  NFS doesn’t provide that support.  

    Try asking Oracle if they’ll support storing the data of an Oracle database on an NFS drive.  They’ll laugh you out of the office.

    And I have no idea where you got the idea that Win2K3 requires 2 gigabytes of local storage for F&P.

     

  20. Jarle Nygård says:

    Well my Windows folder on a W2k3 server takes up 1.6 Gb of disc-space… That probably includes a bunch of cab files (drivers etc.) and backups for patches etc., but it still requires > 1Gb of storage… 🙂

    BUT; who cares? I mean, the cost of storage is so low today that it makes virtually no difference if an OS takes 1, 2 or even 10 Gb of harddisc space.

  21. dan drake says:

    I’m not sure I understand the statement that corrupting customer’s data is to be avoided at all costs when oplocks are enabled by default on nt/w2k/xp… servers.  We had a flaky switch which caused innumerable "delayed write failures" e.g data corruption, since the redirector sends a one byte write to reserve space and sends the bulk of the data at some point in the future.   Once the one byte write completes, the write requests is returned with a success error code.

    I agree in principle that network storage should be handled the same as local storage, but the fact is that its not the same because its much less reliable.  If you are writing an application that depends on network storage you have to take this into consideration and write code to handle it.  You can’t depend on the redirector to do the right thing.  Unfortunately, very few dev groups do this.

    I believe the current behaviour of the redirector will change.  It will have too to be useful in mission critical areas where large data sets need to be store in a shared area.  Companies that are in this category have lots of money to spend.

    Also, I seem to remember seeing on msdn that access databases on network storage is not a supported configuration.  Doesn’t stop everyone from doing it though.

    What does iscsi do when the network goes down?

  22. Phylyp says:

    Larry, did you know the phrase "Because the alternative is worse" is copyright Raymond? 🙂

    [1] http://blogs.msdn.com/oldnewthing/archive/2005/12/20/505887.aspx

    [2] http://blogs.msdn.com/oldnewthing/archive/2004/09/09/227339.aspx

    [3]… too many links!

  23. Michiel says:

    Why do you need a generation counter if NTFS has a last-modified field? Sure, that doesn’t solve the FAT32-server "problem". I don’t care. A file server is typically set up as such, and if the documentation clearly says: "To support SMB reconnections, use NTFS", FAT32 is a non-issue.

    The "we can’t solve it allways, so we never solve it" attitude means 90% of the data loss was preventable.

    I’m not yet convinced a big protocol change is needed. Disconnections are pretty rare. You can take some time to figure it out afterwards; it may involve more than a few extra messages so resync server and client. Up front the only requirement is that both server and client keep their own state, so this can be compared afterwards – no wire protocol involved, I’d guess.

  24. Miral says:

    The really fun thing is that .NET apps running from a network share pack a major sad if the connection to the server is lost while they’re running.

    What’s weird is that most of the time they don’t even simply abort cleanly, they just go off into la-la land and occasionally mutter about bizarre errors.

  25. Yuhong Bao says:

    Why not add a type of lock that persists across disconnects to Windows Vista and Windows Server "Longhorn", so that during suspend and hibernation just before disconnecting, the client will hold this lock and the server tracks it during the disconnection? It will behave like a normal lock. When the client wakes up and reconnects, the client tells the server the client have the lock and then it is converted into a normal lock. If the same client does not tell the server when it reconnects, all such locks are discarded.

  26. you are a fuckin loser

  27. A bit unclear about your description of NFS here as having "no locking semantics" and "no reliable write-through semantics". NFS has supported  fcntl-locking for a good long time now, and the write semantics have always been absolutely clear: once a write has been acknowledged to the client, it must have been committed to stable storage. Now, you can change the behaviour of the server if you prefer high performance to keeping all your data, but I’d advise against that!

    Which is not to say that there are plenty of problems with NFS. In particular the fact that it is — unlike, from your description, CIFS’s — designed to be "as close to the behaviour local files as possible" means that when the network goes away applications which are doing IO to network filesystems must block until the network comes back. And requiring all writes to be committed to disk before they are acknowledged means that NFS is slow. But they are simply consequences of the stated requirement: unchanged applications must run reliably on the network filesystem.

    (You are also, by the way, correct to say that Marc Crispin has strong views on this as on so many things.)

  28. Chris,

     Everything I’ve heard about NFS is that NFS implements file record locking semantics as advisory – if you flock() a region of the file for write access, someone can still write to that region if they are running in another process using a different handle.

    This is why the documentation for flock explicitly states that it doesn’t work on NFS (http://php.netconcepts.com/manual/en/function.flock.php).  Having said that, that referenced documentation is clearly broken, the comment that flock isn’t supported on Win9x is just silly.

    Advisory record locks make it quite difficult to implement a reliable flat file shared database, since it means that you can’t ensure that the database file isn’t corrupted.  Instead you have to trust that everyone follows the same rules.

  29. It’s certainly true that UNIX file locking is advisory (there is an implementation of mandatory locking on many UNIX systems but it’s not typically used). It’s correct that you have to "trust that everyone follows the same rules" in respect of file locking, but that’s not a big problem with a shared database file, since you already have to trust that they follow the same rules in respect of the format of the data in the file too. Typically you have a single implementation of the database library or whatever, in which case it makes no difference that the locks are advisory: so long as the library always acquires a lock at the appropriate points, it doesn’t matter whether it *could* actually do IO without having done so.

    It is also true (per the PHP documentation you quote) that flock(2) doesn’t lock files across NFS shares in general; only the fcntl(2) call with command F_SETLK ("an fcntl lock") has this effect. (Exception: some systems emulate flock(2) with fcntl locks, and indeed PHP’s flock function is itself implemented in terms of fcntl locks on systems without an flock call.) This is an irritating historical issue, and it’s not often documented as well as it should be (note, for instance, that flock and fcntl locks have rather different semantics) but it’s not relevant to the question of whether file locking is available on NFS.