My Experience with Microsoft Speech Server 2007

Article
01/26/2007

I just completed building my first MSS07 application and I thought I would jot down my thoughts and findings.

My Application

My application is very simple, I only take one piece of information from the user, query a database via a web service and play back a prompt based upon the results of the query. Simple right? Well one requirement was to support both English and Spanish and not speaking Spanish this made it a little difficult. I decided to go with a Managed workflow application as I am very green to Speech development but very familiar with managed code so it just seemed like a good choice. The application currently does all inbound traffic but I hope to extend it to do outbound too. The entire development process took a couple of months of working only at night.

Lessons learned

Speech is hard. I have written many command prompt, windows, controls, services, and web applications in my 10 years at Microsoft but developing a good Speech application was not a trivial task. Part of the problem is captured in my #2 lesson learned but speech in just inherently hard to do correctly. I grew to appreciate just how difficult Speech and IVR development can be and was really glad that the Speech team went with Windows Workflow Foundation (WF) as the primary interface. Developing speech applications with WF really works well; a very nice partnership of technologies. Some of the challenges I had was dealing with multiple languages, switching languages, handling silence, handling barge ins, developing a custom grammar, and handling no recognition. Don't get me wrong MSS helps tremendously here but it cannot do everything.
Read the docs. Yea I hear this one all the time too but for Speech Server, esp while in Beta, it pays off. The dividends are the saving of time and frustration. Fortunately for me the Speech Server team is not only passionate about developing a solid Speech product but helping out folks such as myself that tend to get stuck from time to time. Their quick and detailed responses to my questions not only resolved my issues but offered insight into how the product works.
The samples are your friend. I typically learn by example and the sample applications that ship with MSS07 Beta are very helpful when you need to solve a particular task such as playing a thinking sound, changing languages, handing silence and non-recognition, etc.
A really well thought out and user friendly UI directly impacts your user's experience and. For speech applications prompts and grammars are your UI. Do not underestimate the value of professional voice talent when developing a speech application. My first attempt at prompts was when I sent a friend of mine's girl that works for him to a studio to record the prompts. Yuk, that was a $400 lesson. I then used Digital Base Productions; they were easy to work with and very reasonably priced. It is best to not engage a voice talent producer until you are locked in on your prompts, going back and rerecording prompts costs time and money (yea I did this too). I created a spreadsheet (Excel of course) for my prompts with two columns, one for English and the other for Spanish. I had a friend of mine do the translation to Spanish and found not everything translates very well, for example the # key, 'pound' or 'hash'? I used both. Digital Base was able to record my prompts and deliver them in only a couple of days. Ensure that when you have various language prompts recorded that the voice talent's voices are in sync since they will most likely be recorded by different individuals and one voice is not overpowering and the volumes on the wav files are leveled in sync.
Unless you speak the language (and I do not) the telecoms will get really confused when trying to match a solution to what you think your needs are. I went to both AT&T and Verizon asking for a VoIP solution where I would get SIP from them which could then be sent directly to my MSS07 server. I got the feeling this is really new for them and AT&T really had a hard time with the request. Verizon finally got it and then offered a solution that required my purchasing Cisco's Call Manager. This was a deal breaker when I priced out Call Manager, WOW$$. I received some good advice from one of the guys on the Speech Server team, Keep it simple. That being said I went with a standard PRI T1 from AT&T. I found that Verizon and AT&T are really close on price however since I already had AT&T and their long distance ('LD' in telecom speak) costs were a bit less than Verizon. A T1 has 23 channels (phone lines) which is way more than I need.
I purchased a Mediant 2000 by Audiocodes with a single T1 card. I live in Dallas and there just happened to be an ISV here locally that sold me the Mediant and 4 hours of setup where the tech guys connected to the box and set it up. We probably only used about 30 min of that time however so if you decide to go with a Mediant you may want to try to set it up yourself. It has a web server interface much like many of the home routers. Of course one change we did have to make that was buried pretty deep in the UI was configuring it to send SIP over TCP as its default is UDP but it worked on the 10th test phone call (see next lesson)!
So my Mediant did not work on the first call but I used Netmon3 to sniff and found that it has a really nice SIP parser built right in. My problem of course was that MSS supports SIP over TCP and the Mediant uses UDP. With Netmon it was a breeze to figure this out and get it changed on the Mediant.
I used a ton of Debug.WriteLine() statements to learn the flow my speech application took through the various turns and learned when various events where raised. This was critical to learning where to place code within the application.
AT&T provided me with 100 phone numbers and to keep things simple I just mapped them all to my application. As you can imagine I get plenty of wrong numbers. The occasional wrong number is not such a bother however I noticed I was getting some calls from the same number. To try to keep my LD costs down and keep from tying up my channels I wrote a small piece of code that took the number that was calling and check it against a denied/black list and drops the call if on the list. I monitor my logs for phone spam and fax machines hitting my application and add their calling numbers to my black list. I wish MSS would provide this as I am sure I am not the first person to have to write a kind of firewall for speech.
To decide on how I wanted my application to flow I called a ton of different IVR applications to see how they solved various problems and how they handled someone trying to trip them up. I found some I knew I did not want my application to resemble and others that I really liked. One in particular that I liked was American Airlines.

The hardware

Dell Precision 1850 with x64 Xeon processors and 4 GB of memory. 1Gb teaming NIC, runs Speech server like a champ.

Audiocodes Mediant 2000 with a single T1 card.

I developed my application on a Toshiba laptop which although it has a built in microphone I decided to purchase a Plantronics headphone and microphone since I mainly worked on this at night and I kept waking up my wife "talking to my computer". Money well spent.

Thanks

Todd

My Experience with Microsoft Speech Server 2007

Additional resources