It's hard to pick a favorite, but this is a recent bug story:
Report: customer support alerted the development team that several users were complaining about a recent release. Intermittently, the application would slow down to such a great extent that they would have to reboot their machines.
System: A popular J2EE web application framework using Ajax.
Team: Small, skilled XP team with a product owner.
Actions: Programmers ran automated JUnit tests, and Fitnesse acceptance tests. Programmers had the product owner try to repeat it, had customer support try to repeat it, and a couple of developers worked on it for a day or two. They chalked it up to an intermittent issue.
Next action: Programmers asked me to help them track an intermittent bug.
- Talked to customer support, and tried to find patterns. Where there any specific areas of the app that this occurred in? What were the users doing when they saw the behavior? How long had they used the application? What web browser were they using? What did they mean by "it slowed down"? What slowed down? Their web browser? Something else?
- Talked to the development team and did a structural analysis. Had them walk me through their design and tests, and looked at the system structure.
- Talked to the product owner and found out what typical workflows might be.
- Using the above information, I started working on manual, exploratory test scenarios but didn't find anything in my first afternoon of work.
- Knowing that allowing time to pass is often a key to tracking down an intermittent bug, I left my web browser open overnight, on a certain part of the application that is used frequently over long periods of time.
- When I came in the next day, my machine was almost frozen and I had to reboot.
- I repeated the same action, but only left the browser open on that page for 2 hours - same result.
- I touched base with the product owner about the area I left my browser open to, and talked to the system administrator who looked after the system. He suggested I try running Wireshark to look at the HTTP traffic to see if there was a clue there.
- I repeated the test, and ran Wireshark for an hour on my machine, and noticed that my entire machine was slowing down, and that my web browser memory consumption was on a trend curve - it was acting like it had a massive memory leak.
- I opened my recorded Wireshark traffic and found that the last few packets were massive in size. I couldn't open them without crashing my machine. I eventually opened an HTTP response in the middle of my captured traffic stream, and when my text editor eventually refreshed, I found that the whole HTTP body was full of the same repeated HTML and CSS.
- I then talked to a programmer about my findings, and asked what code was being called there - it seemed to refresh the page every 30 seconds with an Ajax call.
- The programmer slapped his forehead, ("Why didn't I think of that?!!") and we quickly narrowed in on the code. In spite it being unit tested, someone had accidentally appended a space at the end of the HTML/CSS contents to clear it instead of deleting the object and starting over.
Result: If an end user was using certain reporting functions of the application, Ajax calls were made every 30 seconds to dynamically update their report dashboard. The HTML for that portion of the page was created dynamically on the server, then passed back to the client machine via an XMLHTTPRequest call and response. Over time, it would grow in size, resulting in very, very large HTTP responses that were too much for the client's web browser to process, which would cause the machine to slow down. Hence the memory-leaking behavior of the browser.
The fix was a one-line code change to clear the object instead of appending a space. Appending a space just ensured the content was appended over and over - clearing the object ensured that only a small amount of HTML was passed in the object, ever.
This bug was attributed to a late Friday pair session where both programmers were tired. The unit tests passed so they called it a night. It ended up becoming a learning lesson, and part of the lore of that project team. "Don't just append a space!" people would joke. The offending line of code was also pasted into a wiki quotes page and never failed to get a laugh in a stressful moment.
Moral of the story: relying only on continuous integration tests that run rapidly, automated unit tests that can't check state, integration and memory management etc. and simple automated acceptance tests based on simple scenarios that don't take time and timing into account can get you into trouble. Also, Ajax can have strange unintended consequences that we haven't experienced much in web applications before. If your application has Ajax, start sniffing the traffic, and see what happens over time.
Do you have a bug whose story you love to tell? Let me know!