Speech Grammars in F#

People say that Vim keys are a grammar for talking to your editor and that's exactly what they are. One weekend some time back I had fun making VimSpeak to see how well mapping English words to Vim keys would work. It turned out quite nice and some pieces of how it was built (in particular the grammar description format) might be useful to others, so here's how it works. And here's a demo of VimSpeak in action:

[View:https://www.youtube.com/watch?v=qy84TYvXJbk]

If you want to peruse the code you can actually learn quite a bit about the grammar of Vim itself. You'll notice that it's a very declarative set of definitions. The API given by System.Speech.Recognition is very imperative and somewhat ugly, so I made this little helper:

 open System
open System.Speech.Recognition

type GrammarAST<'a> = 
 ____| Word ______of string * 'a option
 ____| Optional __of GrammarAST<'a> 
 ____| Repeatable of GrammarAST<'a> 
 ____| Sequence __of GrammarAST<'a> list
 ____| Choice ____of GrammarAST<'a> list
 ____| Dictation

let rec speechGrammar = function
 ____| Word (say, Some value) -> 
 ________let g = new GrammarBuilder(say) 
 ________g.Append(new SemanticResultValue(value.ToString())) 
 ________g
 ____| Word (say, None) -> new GrammarBuilder(say) 
 ____| Optional g -> new GrammarBuilder(speechGrammar g, 0, 1) 
 ____| Repeatable g -> new GrammarBuilder(speechGrammar g, 1, Int32.MaxValue) 
 ____| Sequence gs -> 
 ________let builder = new GrammarBuilder() 
 ________List.iter (fun g -> builder.Append(speechGrammar g)) gs
 ________builder
 ____| Choice cs -> new GrammarBuilder(new Choices(List.map speechGrammar cs |> Array.ofList)) 
 ____| Dictation -> 
 ________let dict = new GrammarBuilder() 
 ________dict.AppendDictation() 
 ________let spelling = new GrammarBuilder() 
 ________spelling.AppendDictation("spelling") 
 ________new GrammarBuilder(new Choices(dict, spelling)) 

This lets you construct nice looking, declarative grammars from the discriminated union and then run them through the speechGrammar function to get GrammarBuilders used by System.Speech.Recognition.

You can have simple words and optionally associate them with some meaningful value. Restricted grammars are much more accurate to recognize than free dictation and spelling, but you can do that too. You can have optional bits of grammar, sequences of things that must be said in a particular order, choices from a set of options, etc.

A demo should make it clear enough. Lets start by letting someone introduce themselves. We could have a grammar listing choices of possible names, but here we'll just let them dictate their name. However the phrase preceding this is restricted to the grammar:

 let name = Dictation

let intro = 
 ____Sequence [ 
 ________Choice [ 
 ____________Word ("My name is", None) 
 ____________Word ("I'm", None)] 
 ________name] 

This lets you say, "My name is Ashley" or "I'm Fred", etc. Let's let them say various greetings and goodbye phrases as well:

 let greeting = 
 ____Sequence [ 
 ________Choice [ 
 ____________Word ("Hello", Some "greeting") 
 ____________Word ("Howdy", Some "greeting") 
 ____________Word ("Hi", ___Some "greeting")] 
 ________Optional name] 

let goodbye = 
 ____Sequence [ 
 ________Choice [ 
 ____________Word ("Goodbye", Some "goodbye") 
 ____________Word ("See ya", _Some "goodbye") 
 ____________Word ("Ciao", ___Some "goodbye")] 
 ________Optional name] 

Now we can say "Hello Joe", "Howdy", "See ya Mr. Bean", "Ciao", ... Notice now we're attaching a semantic value indicating whether it's a "greeting" or a "goodbye". This makes it easy (without parsing) to pull this information out of recognized phrases later.

We can create and initialize the speech reco engine:

 let reco = new SpeechRecognitionEngine() 
try reco.SetInputToDefaultAudioDevice() 
with _ -> failwith "No default audio device! Plug in a microphone, man." 

reco.LoadGrammar(new Grammar(speechGrammar greeting)) 
reco.LoadGrammar(new Grammar(speechGrammar intro)) 
reco.LoadGrammar(new Grammar(speechGrammar goodbye)) 

And for the heck of it, let's throw in some speech synthesis while we're at it:

 open System.Speech.Synthesis

let synth = new SpeechSynthesizer() 
synth.SelectVoiceByHints(VoiceGender.Female) 

let speak (text : string) = 
 ____reco.RecognizeAsyncStop() 
 ____synth.Speak text |> ignore
 ____reco.RecognizeAsync(RecognizeMode.Multiple) 

Funny enough, it is possible for the machine to talk to itself! This is why the speak function temporarily stops recognition.

Finally, we can do use use all this for a simple demo:

 reco.SpeechRecognized.Add(fun a -> 
 ____let res = a.Result
 ____if res <> null then
 ________printfn "%s (%f)" res.Text res.Confidence
 ________let sem = res.Semantics.Value
 ________if sem <> null then
 ____________match sem.ToString() with
 ____________| "greeting" -> speak "Hello there!" 
 ____________| "goodbye" _-> speak "See you later!") 
reco.RecognizeAsync(RecognizeMode.Multiple) 

Console.ReadLine() 

Here we just echo back what we think we heard and also speak back depending on the semantic value of what was said.

Take this and have some fun with it!