BatchRunner - using a collection of BackgroundWorker instances for fun

BackgroundWorker is a new class in the .NET 2.0 framework - it's purpose in life is to make it much easier for developers to do background operations by providing an event-driven model.  It also uses the thread synchronization contexts so it's ShowProgress and RunWorkerCompleted events get marshaled back to the necessary thread (for WinForms, the UI thread).  This takes the burden off the developer for checking InvokeRequired and potentially needing to Invoke their logic over to the UI thread manually.  It's a huge timesaver and helps developers focus a lot less on plumbing (especially plumbing that's easy to get wrong, multi-threading bugs are always notorious for difficulty) and more on their code.  In turn, this should help make WinForms UI's better for end users as developers have a much lower barrier to doing background processing of long-running tasks.  A few good articles related to BackgroundWorker:

  1. Safe, Even Simpler Multithreading in Windows Forms 2.0
  2. Safe multi-threading using BackgroundWorker
  3. Using the Google Web Service (uses BackgroundWorker)

Ok, enough of the cheerleading :)  So, the one thing you'll notice about these articles is that they focus on doing background operations from UI's - that's great, but the BCL team was wise enough to put BackgroundWorker in the System.ComponentModel namespace - they know the usefulness of this class is not limited to just UI operations.

So, one of the tools I use a good bit (usually for quickly checking some response headers) is wget - nice and simple, and works well.  I wanted to do something similar as a C# app but I wanted to do the fetches in parallel.  Since I really like the event-based model of BackgroundWorker, I decided to go with that approach.  However, BackgroundWorker is specific to a single running job - once that job is done you can use the same BackgroundWorker for another job, but it can't do N at the same time.  Hence, I needed to make my own interface that wrapped a collection of BackgroundWorker instances.  I didn't need to make a new class for this, but since the desire to do a batch of N things with a given parallelism seems like something that's pretty unrelated to any particular problem domain, it seemed like it would be a useful class to make.

First, there's some design decisions to make.  Specifically, using BackgroundWorker, your time flow is generally: 1) start the job (do work handler runs) 2) report progress as your job is chugging along 3) completed handler runs.  Since I want to make an interface that's very simple, I decide to dump the "report progress" part for now.  It would be pretty straightforward to add him back in, but the conceptual interface I want is more "here, I have this stack of 200 things to do - run them with a parallelism of 8 and let me know when they're done".  If we're a command-line console application, then reporting progress with parallel jobs running probably doesn't make a whole lot of sense anyway.

Next decision - how related will the jobs be in an instance of this class?  Do we want to handle 200 jobs with varying do-work and worker-completed handlers or keep it simple and make them all the same?  Ok, that's a push-poll phrasing of the question, but yes, to help keep the bar low for usage, I'll keep them all the same.

To make further discussion of the class easier, I'll just refer to the class as BatchRunner since that's what I currently picked for calling it (not terribly exciting, I admit).

As with any kind of utility class or object model (and a lot of other kinds of code), I find you get a lot better results (from test-first/TDD, KISS, and YAGNI angles) if you code the consumer of the code before the code itself.  While many of you would say "duh!" it's amazing how many times you have an idea of what it's going to eventually look like and rush to start coding it instead of the consumer first.

BackgroundWorker already lets you pass in an object so the worker has some input data on his work.  That kind of model should stick around, so developers used to BackgroundWorker will feel at home with BatchRunner.

So what do we want our consumer to look like?  Flow-wise, it should be able to:

  1. Instantiate a BatchRunner, specifying the handlers for doing work and worker completed (remember, we made a design choice to not bother with showing progress of each individual job).  Also, we'll need to specify the parallelism with which to run the jobs - picking a default value for that seems dangerous, as it's far too problem-domain-specific to pick something reasonably intelligent for all situations.
  2. Add the jobs we want to have run.
  3. Wait until they're all done.
  4. We'd want to be able to WaitOne(250ms) or something similar so we'll know as soon as it's done, but can get timeouts to print overall stats numbers.  BatchRunner should provide some useful numbers on how many are in progress, how many are finished, etc. - those would be good to know and to pass along to the user.
  5. Some way of communicating the jobs that have succeeded and failed so far, and for the failed jobs, the specific exceptions.
  6. Method to cancel the runner so all the queued jobs are flushed and all the existing jobs are canceled (via BackgroundWorker's existing cancel mechanism)
  7. Not a hard requirement, but it would be nice to have some way of reseting/clearing the state so if, for instance, you wanted to feed back failed jobs into the runner to have them try again, you wouldn't need to instantiate another runner each time.

So far, we're looking at a BatchRunner interface with these kinds of characteristics:

  1. a ctor with params of DoWorkEventHandler, RunWorkerCompletedEventHandler, and int (the parallelism)
  2. an Add(object) method to add a single new job to run
    1. For ease of use, also an AddAll(ICollection<object>) (similar to AddRange from collections) to batch-add a bunch of jobs.
  3. void Cancel() for mass-canceling the jobs (flush queue, cancel running jobs)
  4. informative properties like:
    1. int CompletedCount
      int CancelledCount
      int ActiveCount  
      int QueuedCount  
      int ResultsCount 
      int ExceptionsCount
    2. it may seem silly to expose these properties as int's instead of the underlying data structures, but I don't want to expose the collections themselves for now, just copies of the collections when desired and counts of the collections otherwise.
  5. methods for getting (copies of, of course) the collections of results and exceptions so far.
    1. Dictionary<object, object> GetResults()
      Dictionary<object, Exception> GetExceptions()
  6. WaitHandle DoneEvent - so the user can Wait infinitely or for a specified timeout or whatever they want

Also, since one of the most common scenarios is "start a ton of jobs, print out console output as to what's going on every so often until they're all done", we'll provide a public static class BatchRunnerConsoleUtil to do just that via 2 static methods void WaitOnBatchRunner(BatchRunner runner) and void PrintBatchRunnerResults(BatchRunner runner).

The only "real" remaining question is locking since BatchRunner will be used in multi-threaded contexts.  Now, this isn't a requirement of BackgroundWorker, of course - as many of you already know, you can set the thread synchronization context and BackgroundWorker will happily marshal the worker completed (and show progress if we were using it, FWIW) handlers over to the specified thread.  However, since BackgroundWorker "out of the box" (IOW, in a console app context) doesn't have that support, I'd rather take the less restrictive route and just assume no marshaling by BackgroundWorker and do my own locking.

Locking is another KISS rule entity.  Very much in line with the "premature optimization is the root of all evil" philosophy, locking is an area where people don't think about relative runtimes much and end up doing more granular/sophisticated locking than necessary because they didn't follow the rule of "do it simple first, then profile, then maybe consider making more complex based on profiling numbers".  Hence, the locking is very simple for this guy - one internal lock that governs all the things with multi-threading issues.  For the usage scenarios I'm targeting, the job runtimes are sufficiently long that this locking is very much lost in the noise.

So, back to the important point of view, the consumer of the class.  What does he look like now?

Here's the simple little driver that I currently have as a test - it fetches some url's, uses a parallelism of 2 (I think IE does the same for a single site, maybe 4 for different sites).  We're currently just passing url's, but we could just as easily pass more complicated data.  Also, we're passing static methods for the do-work and worker-compeleted handlers, but we could also just as easily pass instance methods.  For instance, if you had a target directory that you wanted all the files to download to, you could set that as an instance variable and the worker method could download the file there instead of the temp location we do now.

public static void Driver()

{

    BatchRunner runner = new BatchRunner(SimpleWebClientUser.DoWork, SimpleWebClientUser.RunWorkerCompleted, 2);

    string[] pictureUrls = {

        "https://xxxbogus/number1",

        "https://www.cs.berkeley.edu/~efros/images/microsoft-1978.jpg",

        "https://money.cnn.com/2002/10/17/technology/microsoft/microsoft_outside_sign.03.jpg",

        "https://yyybogus/number2",

        "https://research.microsoft.com/~jiangli/portrait/portraitpc.jpg",

        "https://blog.seattlepi.nwsource.com/microsoft/archives/conceptcar.jpg",

        "https://zzzbogus/number3",

    };

    runner.AddAll(pictureUrls);

    BatchRunnerConsoleUtil.WaitOnBatchRunner(runner);

    BatchRunnerConsoleUtil.PrintBatchRunnerResults(runner);

}

 

 

public static void DoWork(object sender, DoWorkEventArgs e)

{

string destinationFile = Path.GetTempFileName();

using (WebClient client = new WebClient())

{

client.DownloadFile(e.Argument as string, destinationFile);

}

e.Result = destinationFile;

}

public static void RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)

{

if (e.Error != null)

{

Console.WriteLine("Download had an exception: {0}", e.Error.Message);

}

else

{

Console.WriteLine("Download completed fine to file: {0}", e.Result);

}

}

 

Nice and simple.  We created a runner, told it what handlers to use for doing the work and when it's completed, gave it a bunch of work to do, then waited for it to complete.  Note that there's some intentionally bogus url's in there because we want our worker methods to throw exceptions for some of these and report those problems.

BatchRunnerConsoleUtil isn't terribly interesting in and of itself, but it does show usage of some of BatchRunner's interfaces (highlighted those usages below).  I don't do it here, but it's very simple (since it's one of the design goals) that you can take jobs that have failed and have them re-try.  I tend to think that since retry isn't a given, I'd prescribe that any desired retry logic go into the worker method itself, but a consumer of BatchRunner could feed in the arguments that failed to try them again easily enough.  Doing the retry in the worker method has the nice consequence in that it can be more intelligent about the particular situation (some exceptions you don't want to bother trying again, some you will), and you have an easy mechanism for not retrying when a particular problem happens (just let the exception bubble out).

public static class BatchRunnerConsoleUtil

{

public static void WaitOnBatchRunner(BatchRunner runner)

{

while (!runner.DoneEvent.WaitOne(500, false))

{

Console.WriteLine("{0}: Number queued: {1}, Number active: {2}, Number completed: {3}",

DateTime.Now.ToLongTimeString(),

runner.QueuedCount,

runner.ActiveCount,

runner.CompletedCount);

}

Console.WriteLine("********* BatchRunner has completed *************");

}

public static void PrintBatchRunnerResults(BatchRunner runner)

{

Console.WriteLine("Got {0} cancelled jobs", runner.CancelledCount);

Console.WriteLine("Got {0} results", runner.ResultsCount);

foreach (KeyValuePair<object, object> pair in runner.GetResults())

{

Console.WriteLine("Argument {0} gave result: {1}", pair.Key, pair.Value);

}

Console.WriteLine("Got {0} exceptions", runner.ExceptionsCount);

foreach (KeyValuePair<object, Exception> pair in runner.GetExceptions())

{

Console.WriteLine("Argument {0} gave exception: {1}", pair.Key, pair.Value.Message);

}

}

}

 

Now that we've covered him, let's look at the output we get when we actually try to run our "download these url's" test.  Some of the output is from the BatchRunnerConsoleUtil, some from the worker completed, but the interleaving should help make it clear what's going on at runtime.  Note that we get the full Exception objects when a problem happens, I just currently only display the Message property to keep from dumping out stack traces that have little additional value.

10:41:51 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:52 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:53 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:53 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:54 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:54 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:55 AM: Number queued: 5, Number active: 2, Number completed: 0
10:41:55 AM: Number queued: 5, Number active: 2, Number completed: 0
Download completed fine to file: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp464D.tmp
10:41:56 AM: Number queued: 4, Number active: 2, Number completed: 1
Download completed fine to file: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp464E.tmp
10:41:56 AM: Number queued: 3, Number active: 2, Number completed: 2
10:41:57 AM: Number queued: 3, Number active: 2, Number completed: 2
10:41:57 AM: Number queued: 3, Number active: 2, Number completed: 2
10:41:58 AM: Number queued: 3, Number active: 2, Number completed: 2
10:41:58 AM: Number queued: 3, Number active: 2, Number completed: 2
Download had an exception: The remote name could not be resolved: 'xxxbogus'
10:41:59 AM: Number queued: 2, Number active: 2, Number completed: 3
Download had an exception: The remote name could not be resolved: 'yyybogus'
10:42:00 AM: Number queued: 1, Number active: 2, Number completed: 4
Download completed fine to file: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp4650.tmp
10:42:00 AM: Number queued: 0, Number active: 2, Number completed: 5
10:42:01 AM: Number queued: 0, Number active: 2, Number completed: 5
Download completed fine to file: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp4651.tmp
10:42:01 AM: Number queued: 0, Number active: 1, Number completed: 6
10:42:02 AM: Number queued: 0, Number active: 1, Number completed: 6
10:42:02 AM: Number queued: 0, Number active: 1, Number completed: 6
10:42:03 AM: Number queued: 0, Number active: 1, Number completed: 6
Download had an exception: The remote name could not be resolved: 'zzzbogus'
********* BatchRunner has completed *************
Got 0 cancelled jobs
Got 4 results
Argument https://www.cs.berkeley.edu/~efros/images/microsoft-1978.jpg gave result: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp464D.tmp
Argument https://money.cnn.com/2002/10/17/technology/microsoft/microsoft_outside_sign.03.jpg gave result: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp464E.tmp
Argument https://research.microsoft.com/~jiangli/portrait/portraitpc.jpg gave result: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp4650.tmp
Argument https://blog.seattlepi.nwsource.com/microsoft/archives/conceptcar.jpg gave result: C:\Documents and Settings\jmanning\Local Settings\Temp\tmp4651.tmp
Got 3 exceptions
Argument https://xxxbogus/number1 gave exception: The remote name could not be resolved: 'xxxbogus'
Argument https://yyybogus/number2 gave exception: The remote name could not be resolved: 'yyybogus'
Argument https://zzzbogus/number3 gave exception: The remote name could not be resolved: 'zzzbogus'

I'm not able to post the source just yet, but the concept is pretty straightforward, and I think you could probably bang it out in a couple of hours.  Just remember to keep the locking simple - you're expecting processing times that far dominate expected lock times, so the KISS principle is your friend here.