Hanging by a thread


Sorry that there hasn’t been an update for a while. I had a bug of a different kind but feeling better now.


 


Since we are in the summer doldrums, I will talk about another subject where nothing much seems to happen. Hangs. Hangs are not nice. Fatal errors are nice since they generally let you know just where they died. Hangs don’t tell you anything.


 


Before we start, I would like to explain exactly what I mean when I say a hang. I don’t mean when the who OS locks up because some driver goes bad. Sure, that is a hang but Kernel mode is a big scary thing and I don’t go there. I also don’t mean poor performance. It is astounding how many people do mean poor performance when they say “hang”. Technically, a hang is when it never comes back. Performance issues are only hangs for very low values of never. I will be happy to talk about performance issues if anyone is interested (please use the comments button) but today’s blog is on the real deal. So, a hang is when a request never returns or when a process becomes unresponsive for an extended period.


 


So, how long should you wait before you decide that something is never coming back? Well, technically, it would be an infinite wait. Obviously no-one will wait that long. It depends on how long a task normally takes. If you are doing a compile and the link normally takes 10 minutes (can happen with big complex apps) then you might want to leave it for an hour. A web server that stops serving pages for 60 seconds is probably dead.


 


Hangs used to be so simple in the days when everything was single threaded. Almost all hangs were simply endless loops. You could step through them in a debugger and it was normally pretty obvious what was wrong. They were not timing dependant and normally would reproduce consistently with the same data. They were good days to be a debugger. It was pretty hard to deadlock yourself but not impossible. There was a lovely single threaded deadlock bug in the VB runtime (fixed several service packs ago) although you could easily make the same mistake in C++ code. There are two APIs to send a message called SendMessage and PostMessage. PostMessage sends off the message but doesn’t make sure that it was processed. SendMessage is more conscientious and waits to for the recipient to process the message. This sounds like a good thing but it isn’t always. What happens if you send a message to a window that isn’t currently pumping messages? Well, you block for a bit. What happens if you send yourself a message? That is an interesting case. Let us assume that you are a single threaded app and you are in the message queue processing logic. You call SendMessage and the message goes  to your message queue. You will wait for the message to be processed. It doesn’t get processed because processing messages is your job and you are still waiting on the message to be processed which can’t happen because you are waiting on the message to be processed. Lather. Rinse. Repeat. A single threaded deadlock, rare but possible.


 


Things are harder in the multithreaded world. Unless threads are tightly coupled, it is rare for a process to completely hang. It might not be able to do any useful work but some semblance of life is maintained. Let us imagine that you had an app where the UI all happened on thread 0. This is a pretty common scenario since windows (the GDI object, not the operating system) have some thread affinity. All the real work is passed to a worker thread which does it and lets the UI thread know when the work is done. The UI thread doesn’t block waiting for the results but polls to see if they are ready yet. This is quite a nice design in a lot of ways. It looks nice on a whiteboard. It is nice and responsive. Unless it has logic to detect that the worker thread has been gone a long time, it will keep processing messages and look like a perfectly healthy application except that it never actually gets the work done. As a little side topic, what can it do if it detects that the worker thread has hung? There is an API called TerminateThread. It does exactly what it says on the tin. It kills a thread with no buts and no maybes. That sounds like a pretty good thing. However, was the thread holding any resources that were not local to the thread? Almost certainly. There is no way in the unmanaged world to know what the thread had access to – managed code is much cleaner in this respect. The thread owned memory. It is gone. It probably held a mutex or a critical section since a lot of hangs involve those. They are gone. How stable is your process state after you call TerminateThread? It isn’t. It is in an unknown state. That isn’t a good thing. If you are going to shoot one of your threads (again, in the unmanaged world) then you will have to shoot all the other ones as well and let the process die.


 


So, why do things hang? There are two types of hang. Busy hangs and deadlocks. A busy hang is where one or more threads are doing something that will never end. An example of this is chasing around a circular linked list looking for a value that is never going to be there. There are many others. An idle hang is where threads are like the actors in “Waiting for Godot”. They are waiting for something that never happens. A popular form of this is the classic deadly embrace.


 


You quite often get busy hangs in error handling conditions. Imagine that you have some code that is looping through a recordset or resultset or whatever brand of data container that you are using today. You have some logic that loops through until it reaches the end of the data set. I have written that code. I bet that you have as well. What happens if we can’t move on to the next record? Well, there is an exception that your code handles and the code logs that there was a problem with that record. Then it tries the next record. That makes sense unless you consider what the current record will be if the attempt to move to the next failed. It will be the record before the one that you can’t get to. So, the next is the record that you can’t get to. Which gives an error and around we go again. We fixed a very similar hang in the processing of radio buttons where we looped for ever when using the keyboard to move backwards through a group where the first control was disabled and therefore inaccessible.


 


Idle hangs are normally accidents of timing. If something is a 1,000,000 to 1 chance and you do 2,000,000 operations then it will probably happen. There is a classic example of resource contention called “Dining with philosophers”. They taught it when I was at college and I am sure that they still do. A classic way to shoot yourself in the foot is to have 2 bits if code that acquire resources in a different order. Let us assume that you have 2 Mutexs, called MutexRead and MutexWrite. Thread x calls some code that claims MutexRead and then MutexWrite which is what a coder might do if the routine was mainly about reading. Thread x+1 calls another routine is mainly about writing so it claims MutexWrite then MutexRead. 99.999% of the time, this works just fine. It will work even better if it is a single processor system since the threads are less parallel. What can happen is that thread x gets MutexRead at the same time (more or less) as thread x+1 gets MutexWrite. Thread x then wants MutexWrite which it can’t have because it has been claimed. Thread x+1 wants MutexRead but it has already been claimed so it blocks. Thread x waits for thread x+1 and thread x+1 waits on thread x. No-one gets to go anywhere. Of course, there are other causes such as orphaned critical sections (see earlier blogs) or starting to wait for something after it has already happened.


 


So, how to debug this sort of thing? I think that would be a good subject for my next blog.


Skip to main content