Thoughts on testing speech applications

Personally, I feel one area that we did not address overly well in Speech Server is allowing developers to test their speech applications.  Granted, this seems to be missing from many platforms these days but I think in the case of speech applications this is especially important because they can be tricky to test.

First, "testing" is a very broad topic.  There are many types of testing that you can (and should) do on a speech application not limited to the following.

Functionality - This testing basically makes sure your application does not crash and works as designed.  This is what most people think of as "testing" and there are a number of publicly available test frameworks (such as NUnit) that can help you here.

Performance - For speech applications, this means your applications respond in an appropriate amount of time such that users do not feel that the application is slow to respond.  Our tuning tools can help you determine whether this is an issue, but we provide no automated way to do this.

Load - This involves running as many instances of your application as possible and verifying that they work reasonably under high server load.

Long haul - This involves running your application over a long period of time.  We performed long haul testing on Speech Server itself, but you want to make sure your application can handle running 24/7 for a long period of time.

Usability - Are your users able to use the system?  Actually we do provide a mechanism to test this - using the debugger that ships with the Speech Server tools.  The debugger was designed with specific usability testing features.

Globalization - Does your application work properly in other languages and locale?

There are other forms of testing, but I think for speech applications the above are the most useful.  For most companies, performance, load, and long haul testing are performed in pilot runs.  Functionality testing is often performed by the engineers themselves, though larger companies may have test teams.  Globalization testing is often molded into functionality testing.

What would be nice is to have a way to automatically determine whether your changes broke another feature.  This is the way the majority of testing occurs at Microsoft - with automated tests that run regularly.  This automated testing occurred not just for the platform itself, but also for the tools and sample applications that we shipped.

Within our test and development teams, we had several platforms we could use to test the platform and applications.  Unfortunately none of these will be publically released but I have a pet project to create something separate that I can release here (don't get your hopes up though - chances are I won't be able to finish it).

When we were shipping Speech Server, we tested our sample applications by automating the debugger.  As we had an extensive test framework that automated and verified the debugger, it was easy to expose this to the applications testers.  This worked fine for testing our applications, but breaks down with larger applications that make use of external resources such as databases.  For external users, this is also a difficult approach because it requires automating Visual Studio and the debugger, which are no simple tasks.

The two methods remaining are using UCMA and UCCA. 

UCMA would be very ideal here because we can create multiple connections to the application and therefore test both functionality and stress and performance tests.  Another nice thing about using UCMA to test is that a framework built this way could also test non-speech apps also built using UCMA.  Unfortunately UCMA does not currently have a media stack.  This makes verifying responses sent from the application and sending audio to it exceedingly difficult.

The other option is UCCA.  UCCA does allow us to send and receive media, but we cannot create peer to peer connections.  For Speech Server, we shouldn't need to register our client.  I have not experimented with UCCA in depth, but it would also seem difficult to perform stress testing with it.  Still, when I have time to investigate UCCA I may look into the feasability of this approach.