An Async Html cache – Part I – Writing the cache

Other posts:

In the process of converting a financial VBA Excel Addin to .NET (more on that in later posts), I found myself in dire need of a HTML cache that can be called from multiple threads without blocking them. Visualize it as a glorified dictionary where each entry is (url, cachedHtml). The only difference is that when you get the page, you pass a callback to be invoked when the html has been loaded (which could be immediately if the html had already been retrieved by someone else).

In essence, I want this:

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))

I’m not a big expert in the .Net Parallel Extensions, but I’ve got help. Stephen Toub helped so much with this that he could have blogged about it himself. And, by the way, this code runs on Visual Studio 2010, which we haven’t shipped yet. I believe with some modifications, it can be run in 2008 + .Net Parallel Extensions CTP, but you’ll have to change a bunch of names.

In any case, here it comes. First, let’s add some imports.

Imports System.Collections.Concurrent
Imports System.Threading.Tasks
Imports System.Threading
Imports System.Net

Then, let’s define an asynchronous cache.

Public Class AsyncCache(Of TKey, TValue)

This thing needs to store the (url, html) pairs somewhere and, luckily enough, there is an handy ConcurrentDictionary that I can use. Also the cache needs to know how to load a TValue given a TKey. In ‘programmingese’, that means.

    Private _loader As Func(Of TKey, TValue)
Private _map As New ConcurrentDictionary(Of TKey, Task(Of TValue))

I’ll need a way to create it.

    Public Sub New(ByVal l As Func(Of TKey, TValue))
_loader = l
End Sub

Notice in the above code the use of the Task class for my dictionary instead of TValue. Task is a very good abstraction for “do some work asynchronously and call me when you are done”. It’s easy to initialize and it’s easy to attach callbacks to it. Indeed, this is what we’ll do next:

    Public Sub GetValueAsync(ByVal key As TKey, ByVal callback As Action(Of TValue))

Dim task As Task(Of TValue) = Nothing
If Not
_map.TryGetValue(key, task) Then
task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent)
If _map.TryAdd(key, task) Then
_map.TryGetValue(key, task)
End If
End If

task.ContinueWith(Sub(t) callback(t.Result))
End Sub

Wow. Ok, let me explain. This method is divided in two parts. The first part is just a thread safe way to say “give me the task corresponding to this key or, if the task hasn’t been inserted in the cache yet, create it and insert it”. The second part just says “add callback to the list of functions to be called when the task has finished running”.

The first part needs some more explanation. What is TaskCreationOptions.DetachedFromParent? It essentially says that the created task is not going to prevent the parent task from terminating. In essence, the task that created the child task won’t wait for its conclusion. The rest is better explained in comments.

        If Not _map.TryGetValue(key, task) Then ‘ Is the task in the cache? (Loc. X)
task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent) ‘ No, create it
If _map.TryAdd(key, task) Then ‘ Try to add it
task.Start() ‘ I succeeded. I’m the one who added this task. I can safely start it.
task.Cancel() ‘ I failed, someone inserted the task after I checked in (Loc. X). Cancel it.
_map.TryGetValue(key, task) ‘ And get the one that someone inserted
End If
End If

Got it? Well, I admit I trust Stephen that this is what I should do …

I can then create my little HTML Cache by using the above class as in:

Public Class HtmlCache

Public Sub
GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))
_asyncCache.GetValueAsync(url, callback)
End Sub

Private Function LoadWebPage(ByVal url As String) As String
client As New WebClient()
‘Test.PrintThread(“Downloading on thread {0} …”)
Return client.DownloadString(url)
End Using
End Function

Private _asyncCache As New AsyncCache(Of String, String)(AddressOf LoadWebPage)

End Class

I have no idea why coloring got disabled when I copy/paste. It doesn’t matter, this is trivial. I just create an AsyncCache and initialize it with a method that knows how to load a web page. I then simply implement GetHtmlAsync by delegating to the underlying GetValueAsync on AsyncCache.

It is somehow bizarre to call Webclient.DownloadString, when the design could be revised to take advantage of its asynchronous version. Maybe I’ll do it in another post. Next time, I’ll write code to use this thing.

Comments (6)

  1. Greg says:

    It would be much easier to use a normal thread safe collection class.

    Each element would have:


     url (string)

     status (loaded, failed, waiting to load, partially loaded)

     last status change (date/time)

     html_loaded (string)

     last_referenced (date/time)


    Class methods

      Get HTML from URL(boolean lookup_only = false, int max_block_seconds = 0 /* -1 block forever, 0 – don’t block, otherwise block for X seconds*/)

      Get HTML from KEY(boolean lookup_only = false)



    A thread or threads internal to the class would load the html asychronously and be invoked via a clock timer with ticks a few seconds apart.

    Attaching a callback for each request is much harder to implement.  It is upto the method requesting the URL to decide whether or not it blocks, needs an asychronous callback/interrupt or polls for data.  

    The idea is that for nearly all cases, no new threads should be created and no new callbacks should be hooked up.  This keeps your code easier to understand and debug.  Common faults and scenarios are handled easily:

     – requesting thread terminates

     – asynchronous load times out

     – error loading html

     – html hasn’t been used for 5 minutes and can be removed (a tunable cache parameter)

     – memory limit of cache reached and unreferenced html strings can be removed (a tunable cache parameter)

     – duplicate request for a URL/KEY from more than one thread

     – html can be loaded from multiple sources (web, file, network share, ftp, database, etc.).

     – html load failed as html string exceeds the size limit on loaded string (e.g., a tunable cache parameter)

     – The common problem with attempting a callback for a method that is terminated is avoided.  That’s a problem when the callback requires the cache to build a complex packet of data to pass in the callback.

    This is quite similar to basic page handling algorithm in a virtual memory system (circa 1980).  It’s how one handled this in systems lacking real threading or with non-reentrant GUI message handling (VB6 GUI/MFC GUI posting a message to the current winform indiciating asynchronous request completed).

  2. lucabol says:

    Thanks Greg, these are good comments.

    We have a different design goal though. Both solutions are valid. I want the method requesting the URL to have the flexibility of deciding what to do (aka have a callback). I do want the exposed API to be async.

    The rest of your comments talk to the difference between writing production code and a conceptual example. I’m doing the latter here.

  3. Greg says:

    The idea of wrapping the asychronous cache handler in a class is to reduce or eliminate the need for callers to bbe asychronous.  This makes coding the caller’s class much easier.

    The other aspect is that the amount of work done in an asychronous call back should be minimal since you don’t know when it will be executed.  For example, you get a callback call with the HTML you need whilst you are destroying the caller’s object.  This is more important when dealing with large amounts of data in each cach entry (e.g., large xml strings) since processing each cache entry may take considerable time.

  4. You may know Luca Bolognese from his well-known work on C# LINQ. Luca is now the Group Program Manager

  5. Other posts: Part I – Writing the cache Let’s try out our little cache. First I want to write a synchronous