Scientific American recently posted an article overview of text-to-speech. I know that other bloggers have been mentioning it, but since I’m a TTS-centric blogger, I thought that I should put in a plug for it here.
One of the interesting question raised by the authors is the following, “Should machine speech be indistinguishable from a human speaker, as in the well-known Turing test for artificial intelligence?” The authors conclude, “probably not.” They say that a “better goal” (rather than a voice that could ‘trick’ a human listener) is a voice that is “pleasing [and expressive[ to which people feel comfortable listening.” Do the authors truly believe this? Or do they mean a more realistic goal rather than a better goal. I find it hard to believe that every TTS engineer isn’t trying to make TTS voices that sound as realistic as possible. Just because you can make a voice ‘trick’ a user, doesn’t mean that it has to be implemented as such. That is, if you can create a voice that would trick a user, you could just as easily tweak it so that it’s less realistic and perhaps more suited for a warning system or “video games” (I’m not sure the authors are gamers else they wouldn’t have suggested that natural human speech is not most appropriate for video games). You can bet that if the folks at AT&T labs could make a voice to trick a user, they would be writing a much different end to their article.