How does [BlobInput] work?

The Azure WebJobs SDK supports running functions when a new blob is added.  IE, you can write code like this:

         public static void CopyWithStream(
            [BlobInput("container/in/{name}")] Stream input,
            [BlobOutput("container/out1/{name}")] Stream output
            )
        {
            Debug.Assert(input.CanRead && !input.CanWrite);
            Debug.Assert(!output.CanRead && output.CanWrite);

            input.CopyTo(output);
        }

See modelbinding to blobs for how we bind the blob to types like Stream.  In this entry, I wanted to explain how we handle the blob listening.  The executive summary is:

  1. The existing blobs in your container are processed immediately on startup.
  2. But once you’re in steady state, [BlobInput] detection (from external sources) can take up to 10 minutes. If you need fast responses, use [QueueInput].
  3. [BlobInput] can be triggered multiple times on the same blob. But the function will only run if the input is newer than the outputs.

More details…

Blob listening is tricky since the Azure Storage APIs don’t provide this directly. WebJobs SDK builds this on top of the existing storage APIs by:

1. Determining the set of containers to listen on by scanning  the [BlobInput] attributes in your program via reflection in the JobHost ctor. This is a  fixed list because while the blob names can have { } expressions, the container names must be constants.  IE, in the above case, the container is named “container”, and then we scan for any blobs in that container that match the name “in/{name}”.

2. When JobHost.RunAndBlock is first called, it will kick off a background scan of the containers. This is naively using CloudBlobContainer.ListBlobs.

    a. For small containers, this is quick and gives a nice instant feel.  
    b. For large containers, the scan can take a long time.

3. For steady state, it will scan the azure storage logs. This provides a highly efficient way of getting notifications for blobs across all containers without pulling. Unfortunately, the storage logs are buffered and only updated every 10 minutes, and so that means that the steady state detection for new blobs can have a 5-10 minute lag. For fast response times at scale, our recommendation is to use Queues.

The scanning from #2 and #3 are done in parallel.

4. There is an optimization where any blob written via a [BlobOutput] (as opposed to being written by some external source) will optimistically check for any matching [BlobInputs], without relying on #2 or #3. This lets them chain very quickly. This means that a [QueueInput] can start a chain of blob outputs / inputs, and it can still be very efficient.

See Also