Why UI Automation Is Not All That And A Bag Of Chips

I spent some time today looking at an issue on a Korean system where a test was searching for a specific item in a combo box.  The test would iteratively select each item in the combo box, examine it and continue looping until it found the right entry.  The only problem was, this particular test was failing.  Oddly enough, when I watched the test run, I could see it selecting the right item in the combo box.  Upon debugging, I found that even though the right item was selected, the call to get the selected item was returning the text of the previously selected item.  So what was going on?

It turns out that the test was interacting with the combo box by sending key strokes.  At first glance, this might seem like a good approach--after all that is what users are going to do, right?  The problem is that there is a fundamental difference between users and test automation.  Users can take advantage of visual feedback to know when the system has actually processed their input.  Automation cannot do this. 

The authors of the SendKeys code clearly realized this and they attempted to ascertain when the input had actually been processed.  Their strategy was to make a SendMessageCallback call immediately after sending the keystrokes.  In case you are not familiar with it, the behavior of SendMessageCallback is to call the WndProc of the specified window and then return immediately.  After the WndProc handles the message, the specified callback function is called and the results are passed back.  At first glance, this seems like a sound strategy.  Since the call is cross-process (and therefore cross-thread), the end result of the SendMessageCallback call is that a message is placed in the thread's message queue after the keystroke messages.  Since messages are processed serially, in order for the callback to be called, the keystroke processing must have completed, right?  Wrong.  The oversight here is re-entrancy.  If the key stroke handling code itself pumps messages, then the callback message will be completely processed before the key stroke handling has completed.  In other words, the callback message gets processed in the middle of the key stroke handling.  This is what was happening with the bug I was investigating; the code to get the currently selected item in the combo box was getting called before the combo box had a chance to update the selection.

 The other very serious problem with this mechanism is that the SendKeys implementation uses the Win32 SendInput function.  All SendInput does is place input primitives on the system event queue for processing by the Raw Input Thread.  Once it has done that, it immediately returns.  On single-procecessor machines this will work since the RIT will be running with higher priority and context will switch.  On multi-processor machines, however, there is the possibility that the scheduling will work out such that the RIT ends up running concurrently on another processor and no context switch away from the SendKeys thread occurs.  In that case, it is possible for the SendMessageCallback to be executed prior to the RIT posting the keyboard message to the application thread's queue.  In that case, the result will be the same as the re-entrancy scenario; SendKeys will return prior to the application's keystroke handler actually executing.

This very issue is the fundamental problem with all test code that attempts to automate user input using keyboard and mouse inputs.  While it is often trivial to generate the inputs, it is very difficult to reliably determine when the input has actually been completely processed.  It is even more difficult to do this in a generic fashion (such as what SendKeys was attempting).  The end result is that ui automation tends to be flaky because it so timing dependent.  Inevitably, the inherent timing problems lead to a proliferation of Sleep statements strewn through out test cases and the automation will gradually get slower and slower over time.

In addition to the timing problems, UI automation turns into a nightmare when localized builds of the product must be tested.  The localized versions of all of those UI strings must somehow be located if you want to test a different language.  Typically, this will involve grovelling the binaries you are testing for string resources and trying to map resource identifiers to strings.  The problem gets more difficult if you have duplicate strings.  For example, you might be several "OK" strings that differ only by context.  Unfortunately, it turns out that these strings are only duplicated in English.  In other languages, the context results in completely different strings.  So if you are working with English, you cannot assume that the first OK you find is necessarily the right string.

 So all of these issues beg the question: why are we trying to automate UI at all?  The typical answer is "well we have to test what our users will actually do."  However, I would argue that such a view is overly simplistic.  For example, let's take the case of the combo box.  The combo box is a self-contained Win32 control.  User input has already been tested by the people that developed the control.  The combo box is just a state machine.  There is no question what will happen with respect to any of the inputs a user can throw at a combo box. 

The part that has not been tested is how the application using the combo box will react to the state changes.  In fact, the code we are responsible for testing is completely independent of the combo box UI.  All we need to do to fully test it is put the combo box into a state and/or cause the combo box to generate the events that we might be listening to.  We can do this by bypassing the user interface and coding directly against the combo box API.  To put it another way, we can theoretically achieve 100 percent code coverage without ever automating the UI.  Assuming that to be true, how could it be argued that we have not met our burden as testers?

Certainly, there are those who would cry "but what if the combo box is broken and clicks don't do what they are supposed to do?"  While that may be true, the reality is that nobody relies solely on automation to ship a product.  The rare bug that is actually truly a UI issue will get discovered through manual testing.

The other thing to consider is what we are trying to accomplish through automation.  We automate tests because we expect to have to execute them many times before the product ships.  Why is this?  So we can ensure that changes to the product do not errode its quality--that is, we use automation to catch regressions.  So in the context of UI, how often do we really expect regressions in UI behavior?  The answer is, not often.  Moreover, such bugs would generally be simple and low risk fixes; i.e. adding some event hookup code that was incorrectly removed.  These are the kind of fixes that can be applied as a patch to a daily build if they do actually happen.

At the end of the day, testing is about trade-offs.  This is particularly true with respect to automation.  We simply cannot automate everything, so we must make smart choices about where choose to invest our resources.  In my mind, the costs of UI automation are tremendous, yet the benefits are marginal when compared to approaches that enable testing immediately below the UI layer.  The amount of time spent developing, maintaining, and most importantly, analyzing failed test cases is far greater than the time it would take just test the UI manually.  In the face of such costs and difficulties, it would seem that the only time full UI automation may be appropriate when the team is directly responsible for testing UI primitives that they have developed.  Even then, it may still be cheaper to manually test the keyboard and mouse interaction and restrict automation to more direct calls.