At the convergence conference in Copenhagen I wanted a demo that could demonstrate another way of integrating with Ax: I went ahead and wrote a demo that leverages the Speech Server (2007) to accept spoken commands via the telephone to enter sales orders. As it happens this has some customer value apart from the purely technical demonstration: nearly everyone has access to a telephone, but not everyone has a smartphone or a phone that is capable of interfacing with the internet. Those that do often have user interfaces that are not rich enough for entering sales orders. It would be useful to allow the use of a phone, portable or otherwise, with a simple keypad and the spoken word to enter orders.
It turns out that there is a toolkit from MS that is very suitable for implementing the kind of application I was contemplating: the Microsoft Office Communications Server 2007 Speech Server SDK. In fact this framework is a large superset of what is actually needed for this application. In a production environment you would have a Office Communications Server 2007 that is hooked up to the incoming phone line by a piece of dedicated hardware (a Voice-over-IP/VoIP gateway). The speech application is then run in the Windows Workflow Foundation (WF) running on the Speech Server. The Speech Server (2007) Developer Edition (download from http://www.microsoft.com/downloads/details.aspx?FamilyId=BB183640-4B8F-4828-80C9-E83C3B2E7A2C&displaylang=en) leverages the Windows Workflow Foundation in that it takes the form of a series of activities that are linked together in a dialog workflow that describe the phone conversation. The system features a rich drawing surface where these activities can be wired together and their properties set.
You may find more about the Speech Server by going into the MSDN Forums at http://forums.microsoft.com/unifiedcommunications/default.aspx?siteid=57. You will find people there who can address questions you may have, both technical and from a product perspective. Also, there is a lot of really solid information to be found on http://gotspeech.net/.
The workflow that was created for this demo is discussed below. Please bear in mind that the scenario may not be very realistic; it exists to present the underlying technology, not to solve a customer problem. The flow is as shown in workflow. The code for the demo can be found in http://cid-b2f15fddcf82cd4e.skydrive.live.com/self.aspx/Public/VoiceResponseOrdering2.zip.
As you can see, the system will start up by saying hello and welcome to the user phoning in (after setting up the Ax connection in the code activity). Then an activity will solicit the user’s company by uttering:
“Enter your customer I.D, either by using the phone keypad or by saying your customer number or by saying your company name.”
The user has some choices here: Either the user can use the phone’s keypad to press the customer number (like 4-0-0-1), the user can say the digits (“four”, “zero”, “zero”, “one”) or the number (“four thousand and one”) or the user can say the name of the company (“The Glass Bulb”). The functionality of the Question/Answer activity is such that the question soliciting the company will be repeated until an answer is provided. When a number is entered by any of the means discussed above, there is a validation that the number denotes a company in Ax. For convenience, the initialization phase of the workflow gets all the customers into an XML document (through the business connector); The validation described above consists of an XPath query over this document. In this way the chattiness between the workflow and Ax is reduced. If the validation fails, the user is informed that the input did not designate a valid customer and he is asked for the ID again through the execution of the Goto activity.
At this time, the user has supplied a valid company. We now need to get the items that the user is ordering. This is done in a two phase process: The first step is to enter an item by uttering:
“Enter an order item. You may say either the item number or the item name.”
This is very similar to what was done above to get the customer id. In this case, however, the items ids do not lend themselves easily to entry through the keypad. The user may enter an item id, like “CN-01” or an item name like “Chrome Architect Lamp”. Once the item has been entered and validated (again by consulting an XML document that was fetched from Ax up front using the business connector), the user us asked for the quantity that he wants to order:
“How many items do you wish to order? Use the keypad or say the number.”
The user can say “Three” or use the 3 key on the phone to achieve the same result. At this point the user will be notified of the item and how many were ordered:
“You added three CN-01 to your cart”
The information about the items ordered are stored in a list ready for later use. At this point the process of identifying an item and how many restarts. To be able to exit this loop, there is the option of pressing the hash key (#) on the keypad or saying “stop” where an item is expected. When this is done, execution picks up at the code activity that adds the sales order to Ax, again using the business connector. The user is they told that the order will be shipped to the address that the system has on file for the customer.
The demo as presented to you here was written in a hurry, mainly in a hotel room with jetlag, as is the norm before conferences. There is one thing that I would have liked to invest more in that would have made the solution more suitable for production, but wasn’t a concern In a tightly controlled demo situation. This has to do with how the grammars to recognize the user’s utterances are created. Currently the grammars are built from the contents of the Ax database (through the XML files that I alluded to above). While this does work in a demo, it is an inflexible solution because it only allows the user to say exactly what is expected as in:
“chrome architect lamp”.
What is needed is a conversational grammar so the user would be able to say:
“I’d like a chrome architect lamp”,
“Gimme a chrome architect lamp please”
The Speech Server SDK does provide for such grammars, but I did not have the time to create the conversational grammars that are needed for this, most notably time to figure out how to use the editor used for creating these grammars. In that tool you can define all sentences that a caller can use. I hope to learn that in a future project, so Speech Server can understand flexible spoken input based on content from a database with natural language sentences. For that tool seems really powerful!