What's better than usability testing?

Catching up on my reading, today I finally read this Microsoft Watch article on the Office 2007 redesign: How Microsoft Wrapped the "Ribbon" in a Bow.

Among other things, it talks about how the Office team placed early (way pre-Beta) copies with testers in Fortune 500 companies in the Seattle, with an aim to evaluating how people adapt to the new "Ribbon" UI in Office 2007 over time (what we call in the biz a 'longitudinal study'). On top of this, consider the 1.3 billion sessions of user data Microsoft has collected about how users make use of the features of Office 2003.

Now consider how much usage data you are able to collect when considering the UI design for your application or product. ... Sheesh, it's OK if you're a global mega-company, huh?

For most of us, the best usage data we get to influence our designs is the humble usability test. Microsoft does heaps of usability testing too, but through the sorts of activities above, they (we - must remember that) have been able to address a couple of fundamental issues with usability testing :

  1. Usability testing almost always only tests people's initial reactions to a user interface (because we don't have time/money to watch people use our products for a long time). This is problematic for systems like Office, where people's long-term performance is arguably more important than their initial reactions.
  2. Samples sizes for usability testing are usually very small. We justify this by pointing out that even if we only test with a small number of users, the problems we see are still problems. We just don't know how representative they are.

Now there are those that are not big fans of usability testing for these and other reasons. Let me say that I am not one of them. Even when you think you are a genius designer, there is still stuff to learn from usability testing. Teena Harkins taught me this lesson many years ago. I was designing the UI for part of an application for a federal government department. As usual, the timeframes were ridiculous, and I decided that I needed to spend the time available designing, not testing. So Teena sat down next to my desk with a couple of sample users and printouts of my PowerPoint screen mockups. She ran a couple of users through the UI design in an informal usability test while I half-listened from my desk. The stuff Teena discovered from this simple usability testing exercise caused me to make some fundamental changes to the UI design.

So, by all means go forth and usability test, but where you have a chance, consider how you can address the concerns of sample size and long-term usage.

If you are working a web site, the news is good. Firstly most websites are designed with an emphasis on the 'first use' scenario, since usage of almost all websites is discretionary. Secondly, you can do as many organisations do today and trial design alternatives, and completely new ideas, with actual users in live sessions, and observe their behaviour through logging.

Designers of line-of-business applications (like Office) have it worse. One advantage of an in-house audience, however, is you might at least be able to do some longitudinal studies by getting the same people to come back for a series of usability tests over time (make sure you keep track of how many sessions each user has attended so you can interpret your observations properly). On the other hand, you may not be able to try out multiple versions of your UI 'live' on users - either because you don't have the resources to make multiple field-ready versions, or possibly because you have a relatively small audience of users to start with. If you have a small pool of users to draw from, then you have less scope to try out multiple design concepts, simply because too many different versions will add to the confusion overall. Ie in the interest of finding the most usable UI, you can find yourself adversely affecting usability, by confusing people.

Sheesh, I knew I should have become a bicycle courier instead...

What experiences have you guys had trying to address these usability testing issues of sample size and long-versus-short-term usage?