On the field, a customer has a big Windows application made with C/C++, Windows API and MFC. The application is used by 16.000 users. This app presents multiple windows on the same screens like MDI child browser windows, Win32 app and more. The app reorganizes windows in perspectives.
The problem, is that, sometimes, the windows are re-arranged and the panels contain some empty windows and the app runs in a infinite loop... There is a problem on the display. There are multiples threads and a lot of logic with Windows messaging subsystem. It took me hours to find where there can exists a failure in the code and an infinite loop. It put traces on some methods... There was some zones in the code that could enter the app in infinite loop so I was fighting in the code at multiple places.
There was already a trace mode so that entering in each method was traced with parameters. The problem is that when the code is huge, it does not help so much... So I put my level of traces and I start learning the method calls by myself. It was hard because the internal data structures are mixed between Win32 messages, custom messages and business code. There is little spaguetti stuff !
After some failed attempts to analyze the code, I was stucked on a method called GetView() wich , when it begins to hang, return NULL... And displaying NULL a an MDI child window is not very acceptable... It works all the time but, under a special business scenario, it fails. I was sure there was an exception code somewhere where I could see and system exception. The code contains SEH (structured exception handling) and there was traces on __except(). I reactivated the original tracing system and saw exception messages.... Multiples exception messages. And my GetView() function was called immediatly after the exception so it could explain that I get a NULL. It fails so some parts of the code are not executed and the flow of windows messages produce a hang.
After some days of learning the methods names and the program flow, I decided to rely on the existing trace system. After several scenarios, I have a business case that produces a app hang. Same actions and app hangs every time. A good candidate... The result is clear : the cache subsystem hangs and fails to recover cache object. It explains why my GetVies() returns NULL...
The devil is in the details. I decided to remove the thread of the cache subsystem. It stores Windows and kill them after 5 minutes of inactivity. Now there are no more problems.
To finish the job, I just need to fix the cache subsystem but, we can run without it so... It's the opinion of the customer to see if we continue investigations whether or not.
So the old-school advice is: always trace your exception handlers.