Why is breadth-first searching better for file system tree walking?


Earlier, Eric Lippert discussed one scenario where breadth-first searching is better than depth-first searching. Today, I’ll tell you about another.

If you go back to the old MS-DOS file enumeration functions, you’ll find that there is a “Find first file” function and a “Find next file” function, but no “Find close” function. That’s because the MS-DOS file enumeration functions maintained all their state in the find structure. The FAT file system was simple enough that the necessary search state fit in the reserved bytes and no external locking was necessary to accomplish the enumeration. (If you did something strange like delete a directory while there was an active enumeration on it, then the enumeration would start returning garbage. It was considered the program’s responsibility not to do that. Life is a lot easier when you are a single-tasking system.)

However, moving these functions to a network protocol or a non-FAT file system created problems. When enumerating files over a network, the server needs to allocate additional memory to track the enumeration, memory which it needs to free when the enumeration is complete. Similarly, non-FAT file systems may have additional state to track that (1) doesn’t fit in the reserved space in the find structure, (2) shouldn’t be stored in the reserved space (because it would compromise security), or (3) can’t be stored in the reserved space (because the kernel needs to update it asynchronously when the state of the directory changes).

Since MS-DOS has no “Find close” function, how do these alternate file systems know when it is safe to free the memory associated with the file search? [Context clarified – 10am] Clairevoyance is not yet a viable option, so the file system is forced to guess.

Typically, a file enumeration is considered “abandoned” if it is not used for “a long time” or if too many pending file enumerations are in progress.

Let’s see how depth-first search compares to breadth-first search on these points.

In a typical depth-first search, you recurse into a subdirectory as soon as you see it. Thus, if you are, say, six levels deep in recursion, you will have six active enumerations, one for each level. The most recent one is most active, and the one at the root is least active. This means that the enumeration at the root is likely to be considered “abandoned” by the server because it hasn’t been used for the longest time, and because there is a lot of other enumeration activity going on in the meantime.

On the other hand, with a breadth-first search, when a directory is found, it is not processed immediately but is rather put on a list for later consideration. Consequently, there is only one active enumeration, and it runs to completion before the next enumeration begins. This means that the enumeration is unlikely to be considered “abandoned” by the server.

This problem is not purely theoretical. If you performed a depth-first search on a large directory tree on a network server from a MS-DOS program, you often found that the enumeration ended prematurely because the older enumerations became abandoned by the server. You also saw this from a Win32 program if you were enumerating against an older network server that was designed for MS-DOS clients (and which therefore doesn’t implement FindClose).

You can still use depth-first enumeration while avoiding the abandonment problem. Instead of recursing into a directory immediately after enumerating it, save it on a list. When you are finished enumerating the files in a directory, go back and process the list. You’re still enumerating depth-first, but you are delaying the recursive call until after the directory enumeration has completed.

Explorer uses this “delayed recursion” trick to avoid the problem of prematurely-terminated enumerations when walking through directories on these old servers.

Comments (17)
  1. DrPizza says:

    "Since there is no "Find close" function"

    Huh?

  2. Raymond Chen says:

    That was in the context of MS-DOS. There is no "Find close" function in MS-DOS.

  3. DrPizza says:

    Ben: check out the URL I used for my comment.

    Raymond: Oh, I see.

  4. fragmentation says:

    Another reason to enumerate all entries in a directory before recurse into subdirectories was perfomance. If you didn’t load smartdrv in DOS, the difference was very noticeable because directory entries is usually located near each other (same cluster usually) but far from subdir entries.

  5. A says:

    You also saw this from a Win32 program if you were enumerating against an older network server that was designed for MS-DOS clients (and which therefore doesn’t implement FindClose).

    Do Windows 95 and NT 4.0 count as "older network servers", or is it safe to assume that all 32-bit versions of Windows handle FindClose correctly over a network?

  6. Raymond Chen says:

    You are in better shape if both the server and client are running 32-bit operating systems.

  7. anonymous says:

    You can still use depth-first enumeration while avoiding the abandonment problem. Instead of recursing into a directory immediately after enumerating it, save it on a list. When you are finished enumerating the files in a directory, go back and process the list.

    Um, isn’t that breadth-first enumeration, then?

  8. Raymond Chen says:

    No. You still recurse on all children before returning.

  9. Jon Potter says:

    So you are just changing the order the children are recursed on (as if the subfolders all appeared at the end of the list instead of interspersed with files).

  10. Yet another reason for DOS programs to do depth-first search was memory considerations. We did not have a lot of memory back than so it was cool to not maintain any "state" or list and just do all the work in one scan processing just one file name at a time.

  11. Raymond Chen says:

    Yup. And if your per-file action is slow, you may want to throw *everything* onto the list, close the find handle, then go back and process the list. In that case, you process the list in exactly the same order they came out of the enumerator originally – no change in order.

  12. HA HA HA says:

    ‘clairEvoyance’? jeez chen ur letin me dwon here!

    http://onelook.com/?w=clairevoyance&ls=a

    id aslo liek to sugest in the secand-to-last paragraph chagning ‘abandonment problem’ to ‘abandonment issues’. but i gues tahts a lower proirity.

  13. So given that we all agree that breadth-first is better, why do so many programs like findstr /s or dir /s do depth-first? Seems like any time I use dir /s to find a file it gets mired down in Temporary Internet Files or some similar dumb deep directory. I find myself cursing at it wishing it was going breadth-first instead of depth-first.

  14. Raymond Chen says:

    Because human beings prefer seeing things depth-first.

  15. Rob says:

    I think someone might have missed this lesson when writing the "Find Files" feature for Windows XP. Obviously I’m trolling, but frankly it worked much better 10 years ago in terms of both speed and accuracy. I’d love an explanation.

Comments are closed.