Tuesday, August 16, 2011

Hello Kinect : Speech recognition

If you have ever used speech recognition built into recent version of Windows, you would know of all the troubles it has. It is good, but not good enough to make you want it use on everyday basis. It is good when it works, and painful other times. Handwriting recognition or merely typing is much less annoying to get your real work done. Most of the trouble comes when you are in a room where other people are also talking. The engine gets confused.
With Kinect, however, things are better as the speech recognizer also uses additional inputs from the camera multi-array microphone to estimate the source of sound to recognize. Which pretty much makes speech recognition quite interesting.
So what all do we need to identify speech using the Kinect SDK?
First of all you need to add Microsoft.Speech.dll to your project, and them make the following two imports:

using Microsoft.Speech.AudioFormat;
using Microsoft.Speech.Recognition;

All of the speech recognition stuff is then essentially handled by the classes: KinectAudioSource and SpeechRecognitionEngine. The later is part of Microsoft speech API and provides a generalized framework for speech recognition.
private KinectAudioSource kinectSource;
private SpeechRecognitionEngine sre;

RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers().Where(
r => r.Id == RecognizerId).FirstOrDefault();
if (ri == null) return;
sre = new SpeechRecognitionEngine(ri.Id);
var helloChoice = new Choices();
helloChoice.Add("hello");
helloChoice.Add("kinect");
var gb = new GrammarBuilder();
gb.Append(helloChoice);
var g = new Grammar(gb);
sre.LoadGrammar(g);
sre.SpeechRecognized += sre_SpeechRecognized;
sre.SpeechHypothesized += sre_SpeechHypothesized;
sre.SpeechRecognitionRejected += new EventHandler(sre_SpeechRecognitionRejected);
var t = new Thread(StartKinectAudioStream);
t.Start();

For the speech recognizer to work correctly, you need to provide words that need to identified. These are handled by constructing a 'grammer' for the same. In the above code we construct a simple grammar to recognizer only two words 'hello' and 'kinect'. Next we register event handlers, which are 'callbacks' when the SpeechRecognitionEngine recognizes (or does not) something that is spoken.
After this we open the Kinect's audio stream and start listening to it in a different thread.

The body of StartKinectAudioStream() function is as follows:

kinectSource = new KinectAudioSource();
kinectSource.SystemMode = SystemMode.OptibeamArrayOnly;
kinectSource.FeatureMode = true;
kinectSource.AutomaticGainControl = false;
kinectSource.MicArrayMode = MicArrayMode.MicArrayAdaptiveBeam;
var kinectStream = kinectSource.Start();
sre.SetInputToAudioStream(kinectStream, new SpeechAudioFormatInfo(
                                               EncodingFormat.Pcm, 16000, 16, 1,
                                               32000, 2, null));
sre.RecognizeAsync(RecognizeMode.Multiple);

The code above basically tries to construct a beam for each person recognized by Kinect (skeletal tracker).

Finally, the signature of event handlers for speech recognizer are as follows:

void sre_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
void sre_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
   Console.Write("\rSpeech Recognized: \t{0}", e.Result.Text);
   lastRecognizedWord = e.Result.Text;
}

Here a short video:

More to come soon :)

5 comments:

rkelly3001 said...

ganesh nice example. i've done a similiar thing, but have also used the window text to speech to respond back to my commands. check it out on youtube.

http://www.youtube.com/watch?v=OvqKMQNxF1I&feature=player_embedded

V. Ganesh said...

rkelly, this is quite good stuff. keep me posted of your progress :)

Prassanna said...

Nice Post.

But If I am not wrong, the Kinect has an audio sensor array, which tries to determine the source angle and confidence. This angle is has nothing to do with the camera system.

Correct me if I am wrong.

Prassanna said...

Nice Post.

But If I am not wrong, the Kinect has an audio sensor array, which tries to determine the source angle and confidence. This angle is has nothing to do with the camera system.

Correct me if I am wrong.

V. Ganesh said...

@Prassanna Thanks for pointing this out, I had overlooked the text (updated the post).
The only reason a camera is involved is to connect the sound source to a skeletal object.