Why does the NT redirector close file handles when the network connection breaks?

Yesterday, Raymond posted an article about power suspend and it's behavior in XP and Vista.  It got almost as many comments as a post in the IE blog does :).

I want to write about one of the comments made to the article (by "Not Amused Again"):

James Schend: Nice one. Do you know why the damn programs have to ask before allowing things to go into suspend mode? It's because MS's freakin networking code is freakin *broken*. Open files are supposed to be reconnected after a suspend, and *they are not*, leading to losses in any open files. (Not that saving the files then allowing the suspend to continue works either, as the wonderful opportunistic file locking junk seems to predictably barf after suspends.)

 

A long time ago, in a building far, far away, I worked on the first version of the NT network filesystem (a different version was released with Windows 2000).  So I know a fair amount about this particular issue.

The answer to the Not Amused Again's complaint is: "Because the alternative is worse".

Unlike some other network architectures (think NFS), CIFS attempts to provide a reliable model for client/server networking.  On a CIFS network, the behavior of network files is as close to the behavior of local files as possible.

That is a good thing, because it means that an application doesn't have to realize that files are opened over the network.  All the filesystem primitives that work locally also work over the network transparently.  That means that the local file sharing and locking rules are applied to files on network.

The problem is that networks are inherently unreliable.  When someone trips over the connector to the key router between your client and the server, the connection between the two is going to be lost.  The client can reconnect the connection to the network share, but what should be done about the files opened over the network?

There are a couple of criteria that any solution to this problem must have:

First off, the server is OBLIGATED to close the file when the connection with the client is disconnected.  It has no ability to keep the file open for the client.  So any strategy that involves the server keeping the client's state around is a non-starter (otherwise you have a DoS scenario associated with the client). Any recovery strategy has to be done entirely on the client. 

Secondly, it is utterly unacceptable to introduce the possibility of data corruption.  If there is a scenario where reopening the file can result in a data corruption scenario, then  that scenario can't be allowed.

So let's see if we can figure out the rules for re-opening the file:

First off, what happens if you can't reopen the file?   Maybe you had the file opened in exclusive mode and once the connection was disconnected, someone else got in and opened it exclusively.  How are you going to tell the client that the file open failed?  What happens if someone deleted the file on the share once it was closed?  You can't return file not found, since the file was already opened.

The thing is, it turns out that failing to re-open the file is actually the BEST option you have.  The others are actually even worse than that scenario.

 

Let's say that you succeed in re-opening the file.  Let's consider some other scenarios:

What happens if you had locks on the file?  Obviously you need to re-apply the locks, that's a no-brainer.  But what happens if they can't be applied?  The other thing to consider about locks is that a client that has a lock open on a region of the file assumes that no other client can write to that region of the file (remember: network files look just like local files).  So they assume that nobody else has changed that region.  But what happens if someone else does change that region?  Now you just introduced a data corruption error by re-opening the file.

This scenario is NOT far-fetched.  It's actually the usage pattern used by most file based database applications (R:Base, D-Base, Microsoft Access, etc).  Modern client/server databases just keep their files open all the time, but non client/server database apps let multiple clients open a single database file and use record locking to ensure that the database integrity is preserved (the files lock a region of the file, alter it, then unlock it).  Since the server closed the file when the connection was lost, other applications could have come in, locked a region of the file, modified it, then unlocked it.  But YOUR client doesn't know this happened.  It thinks it still has the lock on the region of the file, so it owns the contents of that region.

Ok, so you decide that if the client has a lock on the file, we won't allow them to re-open the file.  Not that huge a restriction, but it means we won't re-open database files over the network.  You just pissed off a bunch of customers who wanted to put their shared database on the server.

 

Next, what happens if the client had the file opened exclusively?  That means that they know that nobody else in the world has the file open, so they can assume that the file's not been modified by anyone else.  That means that the client can't re-open the file if it's opened in exclusive mode.

Next let's consider the case where the file's not opened exclusively: There are four cases of interest, involving two file attributes and two file open modes: FILE_SHARE_READ and FILE_SHARE_WRITE  (FILE_SHARE_DELETE isn't very interesting), and FILE_READ_DATA and FILE_WRITE_DATA.

There are four interesting combinations (the cases with more than one write collapse the file_share_write case), laid out in the table below.

  FILE_SHARE_READ FILE_SHARE_WRITE
FILE_READ_DATA This is effectively the same as exclusive mode - nobody else can write to the file, and the client is only reading the file, thus it may cache the contents of the file The client is only reading data, and it isn't caching the data being read (because others can write to the file).
FILE_WRITE_DATA This client can write to the file and nobody else can write to it, thus it can cache the contents of the file. The client is only writing data, and it can't be caching (because others can write to the file)

For FILE_SHARE_READ, others can read the file, but nobody else can write to the file, the client can and will cache the contents of the file, .  For FILE_SHARE_WRITE, no assumptions can be made by the client, so the client can have no information cached about the file.

So this means that the ONLY circumstance in which it's reliable to re-open the file is when a file has never had any locks taken on it and when it has been opened for FILE_SHARE_WRITE mode.

 

So the number of scenarios where it's safe to re-open the file is pretty slim. we spent a long time discussing this back in the NT 3.1 days and eventually decided that it wasn't worth the effort to fix this.

Since we can't re-open the files, the only option is to close the file.

As a point of information, Lan Manager 2.0 redirector for OS/2  did have such a feature, but we decided that we shouldn't implement it for NT 3.1. The main reason for this was the majority of files opened in OS/2 were open for share_write access (it was the default), but for NT, the default is to open files in exclusive mode, so the majority of files can't be reopened.