A customer called in, complained angrily. “A call to ShellExecute API, passing in a local txt file path, it opens a GIF file occasionally with the txt file! Definitely!” The customer was sure that the parameter was correct, and the return value for ShellExecute indicated the call succeeded.
I thought it for 2 mins, told the customer.” Impossible, if you are doing right”. The only possible cause is the parameter is a batch file, or the system is cracked. And even if this is the case, the behavior should be consistent, not occasionally.
I was wrong. After checking the screen-shot, and reproduced the problem locally with the project received from the customer, I have to believe a single call to ShellExecute opens two files sometime.
In below analysis, you will find out how it occurs. But the important thing here is, believe the truth, not the experience. If I told the customer that ShellExecute API came from NT4, so robust, the only possibility was some anti-Virus software, I would loss the chance to find out the truth.
The application is a traditional MFC Dialog Application. There is an HTMLView, which renders a local HTML file. There is an IMG tag with a local GIF in the HTML. When right clicking on the GIF, customer-defined context menus shows. When a menu item is clicked, the menu item Message Handler calls ShellExecute to open a local TXT file. The TXT extension is bound to UltraEdit, thus the file will be opened by UltraEdit. When the problem occurs, UltraEdit opens TXT file, but it also opens an additional file in binary mode. The additional file is the GIF file where we click the right mouse.
The customer uses the following code to pop up the context menu, which replaces the default WebControl context menu:
CMenu *pMenu = menu.GetSubMenu(0);
pMenu->TrackPopupMenu(TPM_LEFTALIGN, pt.x, pt.y, this);
Ok, the additional GIF is related to the HTMLView, but what’s the steps to troubleshoot?
I thought the problem like this. Since ShellExecute opens the GIF file, it means somehow the GIF is related to the ShellExecute call. The ShellExecute should know the GIF file’s path, otherwise how it opens that. Some action plans I can try:
1. Use Windbg, set bp on ShellExecute to debug. Checking the parameters and step-in to trace how the GIF gets opened.
2. Perform further research on the customer’s code to narrow down the problem.
With some try, I gave up the 1st way. First, the file open is done in UltraEdit, not ShellExecute. It is crazy to set bp on both ShellExecute and UltraEdit. Second, I cannot reproduce the issue consistently.
Think about the code. First, ShellExecute is called in menu event (WM_COMMAND) handler function. Let’s try to isolate the relationship between ShellExecute and Windows Message. I created a Timer. In Timer function, I invoke the menu event handler directly. Based on such test, the issue does not occur any more. Thus I know the problem should be related to the context menu. So the next is to check the context menu related code, the PreTranslateMessage.
The customer uses PreTranslateMessage to intercept the WM_RBUTTONDOWN message, and then shows the context menu. Pay attention, the return value is False!!! MSND describes the function like the following:
False means the message should be considered not handled, and should be postpone to next handling chain. While the real case is, the customer’s code handles the message already to show the context menu. Thus I changed the return value to True, and the issue stops.
Never ignore any tiny evidence and resource you have. Try to figure out the cause and result. When the clue is identified, stick to it.
The real problem just begins.
I am lucky. It did not take me a lot of time and I got the solution. However, it is too early to stop. Changing from False to True causes new problem, and above analysis does not convince me why a single ShellExecute opens 2 files.
The new problem is, in the menu item handler, the customer uses some HTMLDocument properties to check which HTML tag the customer operates on, like <img> tag or <div> tag. After changing the return value to True, above function does not work any more. It is easy to explain. Since HTMLView does not continue to handle mouse message, it is reasonable that customer action is not visiable in HTMLView.
The confusing part is still about the two open files. PreTranslateMessage’s different return values cause two different behaviors. A good way to analysis is to compare the executions. First I diassembled ShellExecute, but found it was too complex to follow. A better way is to use wt command in Windbg to monitor the ShellExecute code-flow and callstack. After some research, I found ShellExecute depends on DDE to open the target file. The DDE depends on Windows message. So I set conditional breakpoint on PostMessageW/SendMessageW. Every time the two functions trigger, I print the message parameters and callstacks in windbg.
At last, I got the callstack when the problem occurs. Are you able to figure out the root cause from below callstack?
From the CDropTarget::Drop function, we can guess how the gif gets opened. The strange thing is how the ShellExecute involves mshtml? If we recall the problem, we missed a very very important clue:
Since the PreTranslateMessage return false, which means the message should flow to HTMLView, why the IE default context menu does not show up!
Based on the callstack, the story is:
1. The user click the right mouse button down (not release), and the system generates WM_RBUTTONDOWN windows message.
2. In PreTranslateMessage, the code pops up the context menu, and the application blocks in the TrackPopMenu function call.
3. The TrackPopMenu displays the menu, waiting for the customer’s choose.
4. The customer releases right button and click menu item.
5. After menu item is clicked, TrackPopMenu call returns, and the system generates WM_COMMAND in posted message queue.
6. PreTranslateMessage return false, thus WM_RBUTTON flows to HTMLView. Here the system generates an additional WM_MOUSEMOVE message to sent message queue.
7. The WM_RBUTTONDOWN message on HTMLView control brings the same effect as the customer clicks the GIF but not release. The IE default context menu shows when receiving WM_RBUTTONUP. Since the pop up menu blocks WM_RBUTTONUP message, the IE context menu does not show up.
8. The application returns back to message pump. The Sent Message prioritize Posted Message, thus WM_COMMAND gets executed firstly. ShellExecute gets called.
9. ShellExecute opens the TXT file and brings the UltraEdit window to foreground.
10. ShellExecute uses DDE for postpone communication with UltraEdit, like querying the result. Since DDE depends on Windows Message, and ShellExecute is an UI independent API, the ShellExecute has to maintain dedicated message pump. This is what SHProcessMessagesUntilEventEx does.
11. Due to the message pump in ShellExecute, WM_MOUSEMOVE in step6 gets dispatched. From the callstack, the message(s) go(es) to HTMLView.
12. As described in step7, the GIF is in captured status due to the lack of WM_RBUTTONUP message. The foreground window is UltraEdit, and the WM_MOUSEMOVE message comes. These unexpected messages cause HTMLView generates Drag operation incorrectly, just like using mouse to drop the GIF from HTMLView to UltraEdit. GIF gets opened.
There is another question. Drag involves three steps, mouse down, mouse move and mouse up. Here we do not see the source of mouse move. The WebControl implementation is too complex here and I did not go further. I just guess, due to this uncertainness, the problem occurs randomly. Hope you can make the story 100% complete J
In step6, I am not sure why an additional WM_MOUSEMOVE message is generated, but this is how it works in Windows XP. Anyone can tell me why we have this design?
Answer comes from theoldnewthing again, see the 1st comment. Put it here
Answer comes from theoldnewthing again, see the 1st comment. Put it here
The problem gets reproduced in IE6. Due to some change in IE7, we may not be able to reproduce with IE7.
I feel I am a binary machine, Windows + MFC, executes the code strictly, responses to the Windows message, even this case is full of randomness. Do not blind faith, respect the CPU and use the binary way to think. A single bit, true or false, is the main clue for this case.
The case gets solved. With root cause identified, it is easy to find the right solution to show the context menu in WebBrowser control:
How to disable the default pop-up menu for CHtmlView in Visual C++
1. Trust evidence, not experience.
2. Be sensitive for any tiny clue.
3. Use comparing for effective analysis.
4. Use the binary way to think and understand computer program.
MSDN is the most trustable resource?
Not only the experience, but also MSDN, is less trustable than evidence. After careful testing, identified there is a bug in MSDN and requested a knowledge base article to explain:
Description of a documentation error in the "Assembly.Load Method (Byte)" topic in the .NET Framework Class Library online documentation
The first line is:
“The "Assembly.Load Method (Byte)" topic in the Microsoft .NET Framework Class Library online documentation contains an error
You dare to say the CPU is broken?
This case was handled by a US CPR in COM+ team. There were two load-balanced main frame PC, with 64 CPU each. The software and hardware configurations are the same. One PC works fine while the MSDTC crashes in the other one occasionally. Based on the engineer’s debugging, he said to customer, “I suspect the CPU #17 is broken, switch to a new one as a try”. The customer then involved Intel engineer, skeptically. After hot switch, the problem does not occur any more. I am thinking, if someday I see eax is non-zero after the execution of xor eax,eax, dare I say the CPU is broken? Look:
There's an awful lot of overclocking out there
Meanwhile, some rootkit is able to cheat debugger now. And sometimes, some new OS feature causes some interesting behavior:
VS 2003 crashes when pushing EDI: (Chinese)
SEH,DEP, Compiler,FS:, LOAD_CONFIG and PE format
It is not easy to figure out the truth and the lie…
DWORD and file length:
Keep sensitive on some digits helps clue finding. The customer created a file class. The class causes strange behavior when the file length exceeds 4GB. Guess the cause immediately!
4GB is the maximum of DWORD. If the customer uses DWORD to represent the length, then…
To solve the issue, the customer should use GetFileSizeEx, not GetFileSize. In fact, this is not a “stupid” error that for entry level programmers. Look at the following kb:
FIX: Generation 1 garbage collections and generation 2 garbage collections occur much more frequently on computers that have 4 GB or more of physical memory in the .NET Framework 1.1
(there are still some content in original Chinese paper. Lack of time, skip here)
Next, I will discuss a boring, but typical, difficult case, ASP.NET session lost.