Choosing speech API features

James makes a good point that limiting API features to known examples of applications potentially lowers the ceiling on how well an API can adapt to unanticipated needs.  I think he’s right.  But it’s only one factor.  When multiple new features are competing for a slot in the development schedule, another important factor will always be the relative demand for that feature.  So it’s always good to understand what scenarios a feature will enable, whether demand already exists, and, as James mentions, the strategic value of widening the API and enabling new innovation.  (as well as other things, like cost, completeness, etc).

Comments (3)

  1. not to be a punk … but it feels like i’ve been waiting so long for a new version of SAPI … that i dont mind waiting longer for more features.

    i also think phoneme segmentation would provide more visibility into the ‘magic’ behind speech technologies. if this allowed more people to understand the ‘magic’ … then i think it would bring more developers into the speech reco field.

    it would also be good if i had an API that i could provide the word as text, and then get the phonemes for that word returned. not sure if you’d make that public, but it could be useful as well.

  2. Casey: try CMUDICT — — what you want is already free in a flat text file. All you have to do is make sure the phoneme alphabet is correct for your target API (and perform the necessary set of global search/replaces if it isn’t) and then use the lookup table of your choice (125,000+ unique mappings is often too much for linear (grep) search, but these days 3.5 MB is usually a pretty easy allocation, even in a general hash table.)

    Robert: Thank you for your kind words and careful thoughts. I remember making this same request of Mike Rozak more than ten years ago now(!) at AVIOS’94, and I was very glad when phoneme segmentation made it into SAPI 4.0a (but not so glad about access to it being so obscured as to be essentially restricted to the WEDIT.EXE demo and having to wait for someone in Germany to reverse-engineer the magic access values out of it.)

    I understand the trade-offs which were made to achieve recognition results access methods — and everything else — in as high a level as possible for SAPI 5, but it was still very disappointing to see phoneme segmentation go away. However, it is heartening to know that lower-level phoneme results’ API functions are conceptually easy to add — possibly one method to warn the recognizer that they’re going to be requested (so as to turn on recording of segment endpoints if they aren’t already recorded), and at least one other new function to request the phoneme results in the new format.

    I don’t know whether their omission in SAPI 5 was more of an attempt to make everything look as simple as possible in the hope of attracting a wider developer base, or a resource constraint, such as just too many other things needing to get done. If the former, the clean break of adding a new results method(s) can be tucked away in an appendix or chapter where it won’t bother the newbies. If it’s still a matter of too many other things of higher priority, well, that’s the way it goes, but I figured you wouldn’t have asked your question about what features people want to see if things were as tight as they seemed to be around the introduction of SAPI 5.