Async workflow [how to hibernate async methods, part 2]

Four years ago I'd written a post "How to hibernate async methods (how to serialize a task)". I decided to dig it up and improve it. Here's what I can now write with it:

await AlphaAsync().RunWithCheckpointing("a.json");
// This kicks off the async method. But it allows the async method to
// serialize its state to disk.

await Checkpoint.Save();
// This serializes a snapshot of the async method’s current state to disk.
// It also serializes the async caller who is awaiting it, and so on
// all the way up the async callstack (up to "RunWithCheckpointing")

throw new Checkpoint.Defer(Timespan.FromHours(1));
// I can abort my current method now. A worker will resume
// it in time, once a condition is met (in this case a delay for one hour)

Task t = Checkpoint.ResumeFrom("a.json");
// You can transfer that serialized state across the wire to another machine,
// who can deserialize it to resume work exactly where it left off.
// The task "t" here stands for the top-level AlphaAsync().


My idea is to enable a kind of "async workflow" -- I'd like to write my long-running processes as async methods, using for loops/while loops and exceptions and recursive functions and all that - things where the current execution state is actually a pretty complicated thing to serialize manually, and where I need all the assistance I can get. The "Checkpoint" class provides that assistance.


Where might it be useful? ...

  • You could kick off a long-running method on Azure, which saves its state periodically. If the Azure VM goes down then another VM can just pick up where it left off.
  • You could write an interactive website which prompts the user to click a button, then saves its state to disk and releases all resources. When the user clicks the button on their website, that incoming HTTP message causes the state to be picked up from where it left off.
  • You could write a mobile agent, which can migrate to another machine to be near to some local resources to do the rest of its work.


I don't know if this idea has legs! It might be completely crazy. It might be that serializing state is too difficult to do reliably.


Source code is here:


Please let me know if you have thoughts on this - feasibility, directions, improvements etc. Next time I revisit this topic, I plan to make the code more robust - make it deal with VB, with ConfigureAwait(), with Task.WhenAll and Task.WhenAny, and most importantly make it uses EntityFramework to serialize the current state of async methods in the async callstack. Once that's done I'll package it up as a NuGet library.

Comments (15)

  1. Ben Adams says:

    Reliable execution with Service Fabric (building on reliable state)
    Or a ParallelMap->Pause->Distribute->Resume<-Reduce pattern

  2. Joshua says:


    I nearly fell off my seat today when I stumbled across your original article, and then fully fell off when I noticed you had updated it only a few days ago. I am very interested in your work here. I have been building (and rebuilding) a workflow-style framework using C# for a number of years (always interrupted by other things). I had recently begun to pull together all of my old work and move the project forward again, and I hit this exact roadblock. I have what amounts to a very declarative, continuation-passing style, evaluation structure. The elements of the structure handle business logic as well as basic control flow. More importantly, the individual elements can be arbitrarily composed while still retaining certain fundamental, and highly desirable properties. Unfortunately, the implementation of the individual elements are starting to look like a rather ugly, manually coded state-machine.

    I thought to myself that it would be perfect if I could somehow leverage the async/await state machine rewrite that MS has already done, but not for awaiting. Instead, I just need the syntactic sugar, and I need the async method to return to the caller, and then have the caller be able to call back to the method when it is ready (maybe having serialized the state, maybe not). Sometimes when I return to the caller, I do want to serialize (my whole stack) and then wait for the next request and continue. Other times, I don’t want to serialize at all. Instead, I just want to avoid the actual call-stack that would normally arise in CPS (also avoid having to figure out some tail-call optimization) so I have a manager at the top that calls each method, and the method then either tells the manager it is done, or asks the manager to call a child for it (so, I am never deeply nested in a call-stack, I am only ever one level below the manager). Having the method ask the manager to call its child (and then of course call the original method again as the continuation), however, makes the method quite ugly – a nasty state machine. It would be beautiful if I could use async/await style and when I want the manager to call a child on my behalf, I return to the manager, and give it the child to call, and my continuation to call afterwards. This means I need to yield to the manager, allow myself to stop (not wait for something as in normal await – it looks like you do this by throwing an exception in your older version, not sure yet how you do it now, exception felt heavy, but maybe it is the best way), give the manager my continuation (my async state machine structure already poised to continue where it left off), and give the manager the child I want it to call. Then, the manager can call the child, and then call my continuation.

    This way I get CPS with no deeply-nested call stack. Plus, I get the ability to allow myself, and any of my children to completely yield, all the way to the top (the original request to the manager), get serialized, and get resurrected at any time in the future, on any machine. This is exactly what I need. Its really beautiful. I am anxious to read through your code, and I would love to be able to continue a discussion with you. Awesome work, and thank you.

  3. Joshua says:


    I would love to tell you I got my code working after a few hours today, but unfortunately, I spent a couple of hours and found I could not detangle the OnCompleted handlers from the rest of the code.

    I would like to do something like this:

    static void Main()

    static async Task JoshTest()
    IDictionary result = null;
    result = await JoshTesterAsync().RunWithCheckpointing();
    while (true)
    if ((string)result[“_ContinueWithThis”] == null) break;
    Task t = Checkpoint.ResumeFrom((string)result[“_ContinueWithThis”]);
    result = await t.RunWithCheckpointing();

    static async Task JoshTesterAsync()
    Console.WriteLine(“Josh Tester Starting…”);

    int i = 0;
    Console.WriteLine(“i = ” + i);
    throw new Checkpoint.CallChildException(“Child 1”);

    Console.WriteLine(“i = ” + i);

    throw new Checkpoint.CallChildException(“Child 2”);

    Console.WriteLine(“i = ” + i);

    throw new Checkpoint.YieldedException(“Yielding”);

    Console.WriteLine(“i = ” + i);

    return; // completed

    So, basically, I have added two different types of dispositions. One is CallChild (which will ultimately pass a function to be called, perhaps in the exception itself, or better, in the result), and the other is Yielded, which tells the parent (caller) to yield (serialize the state machine state). I don’t think I need any INotifyCompletion implementations in what I am trying to do, and I can’t seem to determine the continuation I am supposed to be passing to TryGetStateMachineForDebugger(), which allows me to save the state of the current state machine.

    Notice, I don’t want to call Checkpoint.Save(). I just want the Save to happen whenever I throw one of the new exception types (CallChild or Yield), and I want the state to be saved into the result returned to the caller. I have made the result a dictionary and just reserved a special key, _ContinueWithThis, to store the state machine state. On CallChild exceptions, I won’t serialize the state, I will just keep a pointer to it, and then continue with that pointer (after calling the child method). On Yield, I will serialize the state.

    In summary, I can’t figure out what to pass to TryGetStateMachineForDebugger(), within the CheckpointSaveAwaiter, which I would like to call directly from RunWithCheckpointingAwaiter::GetResult(), which catches my new Exceptions (CallChild and Yield). That way, when my exceptions are caught, I can get the state of the state machine, and ferry it back in the result to the caller.

    I would be extremely grateful if you could assist me with this use case.

    Kind Regards,

    1. @Joshua I think it’s not conceptually possible to do what you want with just “throw”. That’s because the moment the throw is executed, control jumps to the exception handler, and information about where the execution was has been definitively lost. You will have to use an await there. “await Checkpoint.YieldedException”.

  4. Ed T says:

    This is very interesting and I think it deserves proper attention from the language design team. The ability to do this is exactly what I’m looking for in a project I’m working on and I think in general it would be a great addition to .NET. I worry about it being published as a separate NuGet since (at a quick glance) it seems to rely on magic names the compiler emits.

  5. Joshua says:

    Well, I tried to convert the code to return the state instead of serializing it to a file. I had no luck at all. This is an awesome idea, but I can’t parse through your hard work. When I finally got the state out of the exception, doing something more akin to your original await Hibernator.Hibernate, the resume just choked on it. I have no doubt this can be done, but it is not at all obvious to me. It would be even better if MS would make this standard instead of having to reflect into the API. It is incredibly useful functionality. I hope you can help provide me with a solution. Thanks.

  6. Hi Lucian. I followed your blog, and for me your first post was mind-blowing.
    I have extended your approach using an IL Code Weaver, to be able to use async calls in all contexts.
    In general the weaving approach is as follows:
    1) On each Async method I insert a call to push the async machine reference into a stack.
    2) A created a Pause object and a PauseAwaiter. The PauseAwaiter always returns false. But the weaver removes the call to the AsyncCompletedCallback

    With that I am able to do thins like:

    call asyncMethod1();
    after call ask if a Pause was triggered
    Persist the stack of async machines
    Recover from persistance
    And resume execution.

    I am still exploring the approach. What do you think of it?
    I can share the code if you like

    1. @Mauricio That’s interesting about an IL-weaver.

      There’s a new feature (hopefully coming to C#7) which might let you do weaving around awaits purely within the language, without the need for an external IL weaver.

      Here’s the feature:

      Here’s an example of using it to weave around awaits:

  7. Hi Lucian,

    I just found your Checkpoint class. It’s amazing and is exactly what I was looking for. There a currently 3 Todo in your source:

    // TODO:
    // (3) Support Task.WhenAll/Task.WhenAny. This will involve digging down into tasks, and will
    // only work in a single-threaded execution context.
    // (4) Support ConfigureAwait(). Needs a bit more state stored in the AsyncMethodState class.
    // (6) VB support in reflection

    Are you going to implement this? Especially (3) and (4) are important for us.
    Do you think such a functionallity is planed for a future .NET release, too?

    Best regards

  8. Hubix2000 says:


    I found a small issue in your source. OnCompletedRunner is never called for Disposition.Completed.
    So the file can’t be deleted at the end. I’m currently looking for a fix.


  9. Hubix2000 says:

    If I change this in ReconstructStateMachine a async method can return void as well.
    Just replace

    task.GetType().GetMethod(“GetAwaiter”).Invoke(task, new object[] { });

    with the following code:

    object taskAwaiter;

    var mi_awaiter = task.GetType().GetMethod(“GetAwaiter”);
    if (mi_awaiter.DeclaringType.IsGenericType && mi_awaiter.DeclaringType.GenericTypeArguments[0] == Type.GetType(“System.Threading.Tasks.VoidTaskResult”))
    taskAwaiter = Activator.CreateInstance(Type.GetType(“System.Runtime.CompilerServices.TaskAwaiter”),
    BindingFlags.NonPublic | BindingFlags.Instance, null, new object[] { (Task)task }, null);
    taskAwaiter = mi_awaiter.Invoke(task, new object[] { });

    Best regards.

    1. Hubix2000 says:

      Much simpler is

      // task = sm.Builder.Task
      var task = (Task)Expression.Lambda<Func>(Expression.Property(Expression.Field(Expression.Constant(sm), builderField), builderField.FieldType.GetProperty(“Task”)))

      return new ReadStateMachineResult
      StateMachine = sm,
      Task = task,
      AwaiterForAwaitingThisStateMachine = task.GetAwaiter(),
      LeafActionToStartWork = awaited.LeafActionToStartWork,
      LeafCheckpointSaveAwaiter = awaited.LeafCheckpointSaveAwaiter

      This works as well. The trick is to cast the result of Expression.Lambda to (Task). Then just call task.GetAwaiter(). 😉

  10. If someone is interested in the hibernate of mothod.
    I invite you to another solution – the virtual machine (which executes instruction by instruction and all the virtual machine state can be at “any” time serialized (hibernating))

    This solution allow you to write composable workflows with recursion

  11. Mason Wheeler says:

    Just ran across this. I’ve got another real-world use case for you:

    Imagine a game engine with scripts. Some scripts run immediately, while others can take some measurable amount of time to run. Some might even take several minutes, or run in an infinite loop for as long as you’re in a particular level.

    At some point, you want to be able to save the game, and then resume where you left off. This requires serializing the state of long-running scripts.

    If your script engine is sufficiently simplistic, you can simply save the current script and line number, but as soon as you start having call stacks and local variables, you need something a lot like this: Make game scripts and all APIs that take a measurable amount of time async, and then use this solution to serialize in-progress scripts and reload them.

  12. Aaron Stainback says:

    I want to write code from the perspective of the cluster, not the perspective of the individual instance. This is almost there except the flow should be handled in a framework of sleeping when a network I/O call happens and resuming on a different node when the I/O call is complete.

Skip to main content