IEnumerable and interleaving of I/O with CPU

Not all problems lend themselves to this, but as a general suggestion, if you're handling T[] or {I,}List<T> or ArrayList or the like of items (especially ones coming from I/O sources - files, network, etc.) with a single pass, you should consider (as with all perf, measure, measure, measure!) going the piecemeal route with IEnumerable<T> (or just IEnumerable for non-generic, of course).

While admittedly contrived, let's look at a method that counts the number of regex matches in a file on a line-by-line basis.

static long CountMatches(string filename, Regex regex)

{

    string[] lines = File.ReadAllLines(filename);

    long matchCount = 0;

    foreach (string line in lines)

    {

       matchCount += regex.Matches(line).Count;

    }

    return matchCount;

}

As you can imagine, this is an approach that leads to lots of I/O as we read the entire file into an array in memory (is that your working set or are you just happy to see me?), then lots of CPU as we loop over that array.  That entire time we're fetching the file into memory, our CPU is likely to have plenty of spare cycles that could have been crunching Regex.Matches for us.  Also, we're keeping strong references to all the members of the array (all the contents of the file), even after we're done processing them and could have made them available for GC.

Switching over to a IEnumerable<string> approach (which I like to think of as "stream of strings", but I need a better phrase since Stream already has a BCL meaning, of course), we can help fix a lot of those issues.

static IEnumerable<string> ReadFileLines(string filename)

{

    using (StreamReader streamReader = File.OpenText(filename))

    {

        while (!streamReader.EndOfStream)

        {

            yield return streamReader.ReadLine();

        }

    }

}

static long CountMatches(string filename, Regex regex)

{

    IEnumerable<string> lines = ReadFileLines(filename);

    long matchCount = 0;

    foreach (string line in lines)

    {

     matchCount += regex.Matches(line).Count;

    }

    return matchCount;

}

 

Now we're better off on multiple fronts:

  1. We can start the Regex'ing as soon as we get the first line - our CPU isn't going to twiddle its thumbs (today's secret word is anthropomorphize) waiting on the entire file to be read.
  2. We're not forcing the entire file to be in memory - we only keep references to lines as long as we need them, and then they're free for the GC to collect as needed.
  3. If the CPU can keep up (very likely true), the uncached-file case has its total time go from around (load time) + (N regex.Matches) to around (load time) + (1 regex.Matches).

Admittedly, to an extent this is somewhat leveraging the knowledge that the various layers (filesystem caches, hard drives, etc.) will pre-fetch/read-ahead as they notice that you've read pages/sectors 1 .. N of a file and are likely to want N+1.. N+x as well, so they go ahead and start fetching those to cut down latency.

It's not a panacea, of course, so it's worth listing the negatives:

  1. Total CPU cycles needed by this process will be higher.  While the IEnumerable introduces a state machine, my gut feeling is our bigger CPU cycle increase will be in potential loss of icache/dcache perf.  We traded nice, tight loops of Matches calls to a loop that includes GetNext, StreamReader, a painful context switch, etc.
  2. If there's an IOException half-way through the file (definitely a corner case), then you may have wasted a bunch of CPU, depending on whether you'd be fine reporting partial results or not.  In the more general sense, you've lost earlier knowledge of whether the entire data structure can be built fine.  You had obviously already lost previous knowledge of things like total count/size.

Obviously, this is contrived, as BufferedStream is also available, among other techniques.

If you're making an API for others to consume, likewise consider taking IEnumerable<T> (or just IEnumerable) if you can operate on your input as such.  This is already in the design guidelines, but it helps free your API's consumers to feed you items piecemeal instead of needing to build a data structure ahead of time.  They can still pass T[] or List<T> if they happen to have that, of course.

And, since it is worth repeating, as with all perf issues, measure, measure, measure!