Trials of Working on the WF Designer Test Team

In a different vein from my usual posts, I’m going to talk about regular work – testing. For the release of WF 4.0 and Visual Studio 10, our test team did a lot of testing via UI Automation. These UI automation tests would run in the lab daily, with selected tests running as pre-checkin validation systems many times a day. Testing via UI automation on machines set up by the lab team occasionally presents certain challenges compared to testing manually.

The worst difficulties we faced early in our project were massive failures of automation in the daily run or pre-checkin validation environments. Failures in our test runs occur daily, a failure now and then. Even after months of product and test stabilization, this is expected as par for course. But massive failures are different. In daily test runs they first give you no useful results to compare to yesterday, and second tie up all your test team’s time and energy. Leaving you blind to incoming bugs in the product. And massive failures in pre-checkin validation? Spurious failures will block both developers and testers from getting their work done, and cause bad feelings, and those feelings will be multiplied by the length in hours of the checkin queue (Unfortunately, during peak times our checkin queue was about 2 or 3 days long(!)), and ultimately your tests will be disabled and you lose the protection that pre-checkin validation is supposed to provide.

There were two relatively minor issues which occurred early in the development cycle but left us in a lot of pain. The reason they were super painful was that we didn’t have a good way to diagnose them. Our test codebase was still immature and

1) despite our tests once launched having periodic logging statements, full exception logs were not available when the test e.g. failed to startup
2) we had no useful record the machine state that would allow us to figure out why the test would fail to startup on a particular lab machine

Occasionally we would try to enable one of our automated tests as a pre-checkin validation, it would run on the lab machine, it would fail, and we would wonder what on Earth happened? Even worse, occasionally we would succeed in enabling one of our automated tests as a pre-checkin validation, and then one or two days later it would start mysteriously failing and blocking many people’s check-ins. At this point there would be cries of protest from the affected parties, and our test would be disabled, which would again put us back at square one.

If you don’t have enough information, you have to somehow get the information you need.

Our first mysterious failure become a lot less mysterious when we had full logs, and screenshots. Especially when we started capturing screenshots in all our tests whenever our tests failed. And we found that one of our failures was due to a little error dialog from a Windows Services popping up during system startup blocking us from interacting with the product.

But this screenshot code brought on its own interesting issues. Occasionally our tests would crash in the screenshot code, after taking a screenshot that was completely black (no taskbar, no nothing). The error code that BitBlt() returned from GetLastError() was 5: ERROR_ACCESS_DENIED. Much more rarely, BitBlt() would return a different error, and fail to generate a screenshot at all.

Both of these issues were intermittent, but the black screenshot issue was by far the more frequent of the two. Asking on internal mailing lists, I found someone else had encountered the same problem, and they stated that the logged in user’s desktop session which the test is running as isn’t the current active desktop session. You can reproduce this behavior by

1) locking your screen (Start+L) while your test is running
2) connecting to the test machine by remote desktop, launching tests, and then minimizing remote desktop

What’s the story behind this?

Well, I had another chance to play with this recently. Windows exposes some APIs where you can create a new Desktop session, which brings up yet another (purer?) way to reproduce the behavior:

- Create a new desktop session, “foo”
- Using SwitchDesktop() make “foo” the current desktop. (You will see the screen change to a blank desktop with just your wallpaper, no task bar, etc)
- In your main thread, which is still associated with the “default” desktop, take a screenshot.

Voila, the black screenshot of doom. :-)

It turns out that only one Desktop is allowed to receive input from the user at a time. This active desktop is also sometimes called the input desktop. It also seems like the input desktop is also the only desktop that will allow you to access the pixels of the display from BitBlt because while the screen is not active for input, it basically doesn’t exist. This was a little disappointment to me to find out, in that I had hoped each desktop would have its own separate input/output system. I guess Desktop is just not the abstraction I was looking for…

Oh, and how did the black screenshot problem get solved in the end? Well, from the info we know so far, our best guess should be that these failures are caused by someone logging in to the lab machines and then logging out again. The lab team since locked down remote desktop to make unsafe connection much harder, and the issue is now nice and rare.

Plug: does this job sound like fun? Of course it is! We have positions in workflow designer test open right now. Send resumes or CVs to (tilovell at guess the software company dot com)?