Phoneme segmentation

In a recent comment, James Salsman wrote “SAPI 4.0a had phoneme segmentation” and he asks that we put it back into our newer APIs.  (You can see more about SAPI 4 here).


It’s been a long time since we made an API with this functionality.  I’m curious to know whether anybody else would like to see this, and just as importantly, what scenarios it would enable for you.

Comments (3)

  1. by phoneme segmentation, do you mean i would say "hello" and it would return some string like "HH AH L OW"? this would be kind of like the InkDivider class in the Tablet API.

    if so … how about for speech therapy or learning a new language? i could speak the word the way i thought it was pronounced, and SAPI could return the phonemes that it heard. then my app could compare it against what it expected and correct me. if it returned what phoneme was stressed, as well as confidence … then you could do cooler stuff. e.g. give them some rating about how much they sound like a native speaker.

    or if i could get access to the phonemes as wav segments (or timings) for the actual wav, then maybe i could muck with those wav segments and write some sort of speaker verification biometric?

  2. Casey mentioned the educational uses that I’m most interested in. Enabling the ability to more effectively teach beginning reading or a second language is very important. With phoneme segmentation, you can zoom in on pronunciation errors, and compare likely alternatives to each phoneme in a word, pinpointing mispronunciations. My product ( ) does this.

    However, there are other industrial applications. Suppose you have a vocabulary of 1,000 names, 10 pairs of which are similar enough that wrong numbers are unacceptably frequent. You can use phoneme segmentation to "zoom in" on the most discriminant portion of each of the 20 suspect names, using a second pass recognition on the discriminant segment (and the nerighboring phonemes; in my experience trying to do recognition on a single phoneme segment hardly ever works) to make the final decision.

    Another example is use by the speech scientists who use recognition to automate manual speech transcription. It is much easier and more reliable to have two human proofreadings of the same imperfect machine transcription (and resolve any conflicts with a third) than it is to have human(s) do the entire transcription by hand.

    For a completely unrelated example, phoneme segmentation can help animators automate mouth movements keyed to a voice recording. The dollar size of the animation industry is probably bigger than the speech-based educational software, IVR, and speech science industries combined.

  3. I should also mention the philosophical point of providing API use possibilities to your developers which will maximize — and I mean this very seriously — their freedom to innovate. We can’t anticipate everything that someone might someday do with phoneme segmentation, so trying to decide whether to include it in an API by attempting to make an exhaustive list of all the known use cases is probably a mistake.

    The Microsoft GUI APIs have ways to draw complex high-level shapes in layered windows along with hundreds of fonts, but they also all have ways to provide the programmer with raw bitmap access to each pixel of the display. Likewise, we can use sophisticated edit controls in all Microsoft operating systems, but also access the up/down state and corresponding events for each key and hardware button. Think of the outcry that would result if suddenly those capabilities were removed. How many unanticipated yet useful applications have such low-level functions enabled over the years?

    For the same reasons, access to phoneme recognition results, their temporal endpoints (and preferably their posterior probabilities or confidence scores) should not be withheld. The most fundamental argument for exposing phoneme segmentation in the API is not a particular set of use cases, it is the information-theoretic argument that the segmentation is a meaningful recognition result and so obscuring it truncates the universe of possibilities available to developers and their users, limiting the functionality of the applications possible with the API.

    If people don’t need it, it won’t get in their way. You have to ask yourself: Would you rather offer only the tools which have been required by most people so far, or also the tools foreseeably needed for solutions in the future?

Skip to main content