Completed with discarded messages??? (zombies)

Now that I have gotten my post approved by Scott (Woodgate) (check out his blog, too, for more info. He actually seems to know a thing or two about this product. :) Kidding ... he knows a lot, I just like to pull his leg.) here is my first post with real content.

This is a discussion we have had recently trying to figure out what we can do for this problem, how prevelant a problem it is and giving a much clearer understanding of what the problem is. I have seen numerous postings on different internal and external aliases about this topic ... some complaints, some questions, some random musings. Well here are my random musings. :)

I admit that sometimes my posts might go over the head of BTS newbies. Hopefully it will get you thinking about some things and when you do hit issues, you will know to come back and reread this and everything will all of a sudden become clear. One thing I want to make very clear is that when I talk about things like what we could do, I am simply musing and looking to get feedback from you as to what you think of these ideas. You are, afterall, our customer. However, none of this means that any of this will actually come to fruition. There are priorities and timelines and resource limitations with every project and management people who make tough decisions on what to do and what not to. So please just consider this a forum and not any type of promise or look into what we are doing in the next release.

First, an orchestration instance gets into the “completed with discarded messages“ when it reaches a terminate point, but messages which were routed to the schedule have not been consumed. To us, this looks like message loss so we get nervous and suspend your orchestration instance and ask you to take a look at it. If you don't think it is message loss, you can then go ahead and terminate the instance. Internally we call these zombies.

Why do zombies occur. We will break them down for now into three categories:

1)       Terminate control messages – the protocol allows for some type of control message to be sent which basically cancels all currently running work in a specific orchestration instance. Since this is a protocol level control message, the user actually wants to just kill everything and tends to handle the zombies by simply terminating them. A number of Human Workflow related designs tend to use this mechanism as well as various other designs.

2)       Parallel listen receives – in this scenario the protocol waits for 1 of n messages and when it receives certain messages it does some work and terminates. A listen is truly designed to handle the case where only one message is going to come. It is not meant for ... well once one message comes you can do some stuff and finish cause I don't really care about the other message. Hence, if those messages are correlated together, it is possible to terminate as we are receiving a message for a different branch. The scenario for this is somewhat less clear. I image users would still tend towards just terminating the running instance.

3)       Sequential convoys with non-deterministic endpoints – this is for cases where the user builds a master schedule to handle all messages of a certain type in order to manage some type of protocol requirement. The typical ones are ordered delivery, resource dispenser, and batching, although I am sure there are others. In these cases, the users tend to define a while loop surrounding a listen with one branch having a receive and the other having a delay shape followed by some a construct which sets some variable to indicate that the while loop should stop. This is of course non-deterministic since the delay could be triggered, but a message could still be delivered. Non-determinstic endpoints like this are always prone to zombies.

What options could we provide for the above scenarios:

1)       We can keep what we have which is that when a zombie is detected, we suspend the orchestration and require the user to investigate and act upon the instance appropriately. We currently fire a WMI event so they can script this to a degree, although that WMI event is only fired for zombies triggered by the messagebox. For virtual zombies, which are cases where a schedule is terminating, but looks in memory and sees that it has messages loaded for processing to a specific receive shape (subscription), but it has not processed them. In that case, the orchestration instance will suspended itself and a suspend event will be fired, but not a zombie event. In either case, you would catch the WMI error and signal someone to take a look

2)       Allow the user to specify a property on the receive shape which indicates whether or not we should worry about messages delivered to this receive which are unconsumed when a schedule terminates. If the user says not to worry, then at terminate time, we discard those messages and if nothing is left, we cleanly terminate.

3)       Allow the user to specify a property on the terminate shape which indicates if it is a hard terminate or not. If it is a hard terminate, all unconsumed messages are simply discarded and we terminate cleanly, regardless of any properties at the port level. This makes a lot of sense for the terminate control message case.

4)       Allow a property to be set at the schedule level which simply says “no zombies” which in essence sets the discard property for all receive shapes.

5)       Drain – This is the most complicated one and applies mostly to cases which fall into the third bucket from the “why” list. In this case we allow the user to continue looping over a receive shape, but they have somehow disabled that receive and are now draining it. Once there are no more messages available for that receive, it would then follow through and terminate. Much more technically difficult and I have no idea how to build something user friendly to do this but solves a much different problem than the above cases.

How might we expose these options:

1)       No change … this is what we do right now

2)       A new property on the receive shape which gets translated into an attribute of the subscription by the compiler.

3)       A new “property” on terminate. So now you have terminate “Control message triggered termination” hard

4)       Potentially a setting at the orchestration level. Not sure.

The other thing to consider for all of these settings is that not only do we want to consider exposing these at a designer level for the developer, but the administrator might also like access to this type of information and control. A developer might not fully understand the protocol or understand the IT policy for handling these cases and so would be very hesitant to set a hard terminate or discard messages property. However, the IT admin would notice that a specific orchestration type is causing a lot of zombies and would like to be able to adjust the “zombie policy” without recompiling the orchestration. Not really sure about this.

5)       Okay, I am not completely sure how we would express this. This is the most non-trivial one.

Did we solve anything:

            So the first two cases are problems that are a bit easier to understand what the goal of the customer is. The new flags we would be exposing would be simply about reducing that amount of management infrastructure they would need to build around handling their orchestrations. There is a workaround right now to catch the wmi events and act upon them programmatically, but that might not be enough depending on how much detail is provided in the suspend event fired for non-messagebox triggered (virtual) zombies (as described above). The other thing to consider is if you have a set of linked orchestrations communicating with each other and one is terminated by some type of control message, the other schedules might continue to try to communicate with it and so get routing failures. Obviously this is what the customer wants to a degree, I am just not sure what we would be expected to do in this case.

            So in the third case, it is all about why they have a non-deterministically terminating schedule. The most common cases are the ones I listed above (order, resource dispenser, batching). If we look at these, we have to ask have we helped the users by adding drain. For ordered-delivery, probably not. There would be a new race condition introduced which would allow multiple orchestration to execute simultaneously on the same correlation set messages (one is draining while the other just got kicked off from a new message). This would break the point of their ordered delivery implementation. The better solution for us is to help solve the actual ordered delivery problem and not the zombie problem. For the resource dispenser paradigm, it depends on how strict the limitations on those resources are. If they are rigid, then it has the same problem as ordered delivery. If they are slightly more flexible and can handle an occasional blip of one or two extra running scheds, then this would probably help them. For batching, this could help a lot. In fact, this is pretty much what batching wants … to gather up a certain amount of data in one schedule and put it together into one message using some type of logic and send it on. Of course this would only be usefull for not particularly strict, time based batching systems. For systems which require batching of exactly 10 messages we would need additional mechanism to make a more deterministic cutoff point for the start of drain. I am not sure how often that is the case. I think the basic key is to understand all of the cases where customers are using non-deterministic terminating protocols and see if we can help in those cases. I am not sure if we try to only attack the zombie problem for this case we will be able to solve anyones problem fully.

Hopefully this thread has given you a bit more insight into why you get orchestration instances in this “completed with discarded messages“ (zombied) state and why we are not always sure what to do. If you have feedback as to why you are getting hit by this, please send it on to me. Since I am a developer, :), this is my outlet to talk to customers and the field.

Hope this has helped some and also feel free to send requests for topics of discussion. Again, I have a job which requires a lot of time, but getting the word out is also part of my job so I will try to make time. Just don't get mad at me if it takes a while or if your topic doesn't make it cause it is way outside of my field. If there is enough demand, I will yell at some of my co-workers and get some more blogs going. Have a good day.

Thanks
Lee

PS: Upcoming post ideas seem to be: “Mapping: Receive / Send Ports or Orchestrations? Why and when?”, “What is this convoy thing people keep talking about”, “Is there a pub/sub system in BizTalk?”, “How do I debug routing failures?”, “What is an orchestration persistence point and why should I care?”, “How does this delivery notification thing actually work?”, “What are service links and roles?”. These are just a couple of ideas. Feel free to send me your ideas. (I have a bad feeling that I might get blasted with that statement, but it will be good to know what people want more info on). Thanks.