Long Running Business Processes in SharePoint (and other Environments)

I'd like to reiterate a topic which I have previously written about: Long Running Processes. It will be a general discussion which can be applied to any environment or system, but SharePoint will be used as an example. This post is a bit theoretical but I think the topic is worth discussing as you may come across this process in the systems you build out there.

What pushed me to write this post is that Paul Andrew recently discussed Patterns for Long Running Activities in Windows Workflow Foundation, in this post he discusses processes running longer than 1/10 second which is a very valid topic. However, as a contrast to his point of view I'd like to talk a about long running processs with the assumption that they live for more than a single business day (not 1/10 of a second).

Two problems with long running processes (using my definition) are:

  • Updates of components in the solution (or system) becomes more complex
  • The solution becomes sensitive to problems

Updating components

Normally when you update a component you always have to make a choice if you are to maintain compatibility with previous versions of not. Perhaps you wish to add a few new columns to the database supporting your component along with changing the order in which steps are executed. This may cause you to create migration scripts which are responsible for updating old data to conform to the new data.

Now add the complexity that there are instances of the component already running which depend on the old structure. How do you compromise between the need for change and the need to maintain backward compatibility?

Also, what if you even wish for running instances of the component to actually pick up the changes and start appliying the new behaviour?

More than a year ago I wrote something I called Business Process Versioning - Updating Running Business Processes which discuss how these problems can be addressed when building BizTalk orchestrations. Today I don't work that much with BizTalk but I still think about the same thing, but now in the context of SharePoint and workflows.

When building workflows in SharePoint the problem is basically the same. It is of course possible to start a workflow instance when an invoice arrives and then allow the same instance to live through two approvals after which it sleeps for 90 days while you delay payment. But what happens if version 2.0 of the solution is deployed after 35 days? Or if a Service Pack for SharePoint is installed after 70 days. Are you absolutely sure nothing will affect your running instances? It is extremely hard to test these upgrade scenarios as there will be 1000s of instances in different stages of completion.

An even more probable event is that you deploy a small update for your process to fix a bug you have found. How can you make that change apply to the running instances? If you simply replace the assembly, will the serialized state be compatible with the new serialization state?

One approach to avoid this problem could be to do your best to avoid using long running processes. But of course, when there is a need for human interaction the process will by have a much higher risk of becoming long running.

Sensitive to problems

Let us establishing the well known fact that things that can go wrong will go wrong, or more to the point "sh*t always happens"!

This has great relevance when talking about long running processes because if an unexpected critical problem occur 7 days into the processing of an item how will you fix it? Perhaps the problem was catastrophic and caused the business process to simply terminate or (more likely) you handled the problem in a generic error handler and ended the processes in a controlled way.

As you are a very knowledgeable developer you quickly identify the problem in the process (or dependent systems) and create a workaround. Now all you have to do is to deploy the fix and you're done, or are you? What about the processes that has been prematurely terminiated? They may have been approved and handled by multiple people and you don't want to waste their time and have them redo their work!

You could add code to your updated process to somehow handle this, but it will be hard to maintain and when the second fix is done the code may be a mess!

Solution approach: Atomizing the process

Part of the solution to both of these problems is to identify sub-processes in the overall process which can be executed independently, i.e. we should identify the atoms of the process. The atom analogy can be further used to discuss the activities that make up a sub-process, they are like sub-atomic particles where the same particles appear everywhere but when they are combined differently they form different sub-processes.

To be a bit more concrete there will need to be some sort of über-process which drives the process forward. In my previous post I talked about creating a very light-weight über-process which starts the sub-processen. Using sub-processes will enable you to restart a failed process at an appropriate point in the overall process. However, I'm not so sure anymore that the correct way to proceed is to create an actual process which runs all the time, it will in itself be sensitive to problems...

One thing is clear to me, the complete business process cannot be seen as a single workflow. The complete business process must consist of a number of workflows exeuted in sequence, where the end of each workflow initiate another workflow. Currently I'm considering using a status field on the ListItem to determine how far the process has progressed.

An upcoming post will provide some more details about this, I'll just need to finalize my thought and make a test implementation. :-)