Kinect SDK 1.0 - 5 - Speech Recognition

22. April 2012 21:04 by Renaud in   //  Tags:   //   Comments (6)
 1. Introduction to the API
 2. Use the ColorImageStream
 3. Track the users with the SkeletonStream
 4. Kinect in depth!
 5. Speech recognition

Here is the last post of this serie talking about the Kinect SDK 1.0. In the previous ones, we saw how to display the ColorImageStream video, how to track the Skeleton, and how to produce a 3D video with the DepthImageStream.

In this final post, we will see how to add speech recognition capabilities to your application!

Think "user experience"

When you develop an application using Kinect, you should also try to think as if you were a user. In which way are you going to use your Kinect? Do you want the user to take a specific posture or to accomplish a gesture in order to trigger an action? Is it a repetitive task?

Keep in mind that gestures aren't easy to recognize with a lot of confidence because everybody do it in a different way. Think also that the range of gestures is limited! You can easily discern a right-hand wave from a left-hand wave. But some others gestures aren't that easy. For example, to distinguish a fist and an open hand, you need to process the image stream by yourself!

Also, how would you trigger several actions at the same time?

If at the end, your gestures are too complicated, difficult to execute, your users will quickly be weary of it! And this result in bad UX.

SDK 1.0 and speech recognition

By installing the SDK 1.0, you also installed the Microsoft Kinect Speech Recognition Language Pack (en-Us). For the moment, this pack is only available in English, but more languages have been announced with the release 1.5, such as French, Italian, Japanese and Spanish.

You have to know that the speech recognition doesn't require a Kinect! You could do it with any microphone.

Hands-on

As usual, here is a simple project that you can use to follow this lab:

[caption id="attachment_126" align="aligncenter" width="92" caption="WpfKinect - 5 - Speech Recognition"][/caption]

For this example, you can use any of you existing project. I choose to used a previous project that draws the Skeleton a the two tracked users.  For each tracked user, we draw spheres for the hands and for the head.

To that basic application, we will add speech recognition capabilities to change the shape used to draw the skeletons (circles or squares), and the color of each user.

The goal is to be able to command the application with a phrase such as "Use red squares for player one!" to see an update of the player one.

1/ Initialize the Recognizer

First of all, you'll have to add a reference to the assembly: Microsoft.Speech.dll. Be careful to use that one and not System.Speech.dll. The second one doesn't give you access to the installed Kinect recognizer.

The SDK documentation gives us an helper method to retrieve the Kinect recognizer. We need that method because you probably have more than one recognizer installed and that we want to use the Kinect one.

This method creates a function which takes a RecognizerInfo as parameter, and which returns a Boolean. The RecognizerInfo is a object describing a speech recognition engine installed on your machine. This object has a Dictionary property which contains information about the recognizer, such as the supported cultures and languages, the name, the version number, ...

This function will return true if value of Kinect in the AdditionInfo dictionary is "True", and if the culture is "en-US" (which is the only one available for now).

Then, we will use that function in a Linq request so that for each installed recognizer, we will check if it fits the criteria, and return the first one that matches.

        private static RecognizerInfo GetKinectRecognizer()
        {
            Func<RecognizerInfo, bool> matchingFunc = r =>
            {
                string value;
                r.AdditionalInfo.TryGetValue("Kinect", out value);
                return "True".Equals(value, StringComparison.InvariantCultureIgnoreCase)
                    && "en-US".Equals(r.Culture.Name, StringComparison.InvariantCultureIgnoreCase);
            };
            return SpeechRecognitionEngine.InstalledRecognizers().Where(matchingFunc).FirstOrDefault();
        }

Here, we add a method to instantiate the SpeechRecognizerEngine. If an error occurs, we display a message box (this code also comes from the SDK documentation).

        private void InitializeSpeechRecognition()
        {
            RecognizerInfo ri = GetKinectRecognizer();

            if (ri == null)
            {
                MessageBox.Show(
                    @"There was a problem initializing Speech Recognition.
Ensure you have the Microsoft Speech SDK installed.",
                    "Failed to load Speech SDK",
                    MessageBoxButton.OK,
                    MessageBoxImage.Error);
                return;
            }

            try
            {
                speechRecognizer = new SpeechRecognitionEngine(ri.Id);
            }
            catch
            {
                MessageBox.Show(
                    @"There was a problem initializing Speech Recognition.
Ensure you have the Microsoft Speech SDK installed and configured.",
                    "Failed to load Speech SDK",
                    MessageBoxButton.OK,
                    MessageBoxImage.Error);
            }
            if (speechRecognizer == null)
                return;

            // Ajouter la suite ici!
        }
2/ Organize your keywords

To make it clearer, we will map some words to a significant value:

        #region Phrase mapping

        private Dictionary<string, Shape> Shapes = new Dictionary<string, Shape>
        {
            { "Circle", Shape.Circle },
            { "Square", Shape.Square },
        };

        private Dictionary<string, SolidColorBrush> BgrColors = new Dictionary<string, SolidColorBrush>
        {
            { "Yellow", Brushes.Yellow },
            { "Blue", Brushes.Blue },
            { "Red", Brushes.Red },
        };

        private Dictionary<string, int> PlayerIndexes = new Dictionary<string, int>
        {
            { "One", 0 },
            { "Two", 1 },
        };

        #endregion

The Shape enumeration contains all the supported shapes in the application. If we want to add more shape possibilities, they will appear here:

    public enum Shape
    {
        Circle,
        Square
    }
3/ Build the grammar

At the end of the InitializeSpeechRecognition method, we will add some code to build the phrase that we expect the user to tell. We will build Choices object that allows to expect different possibilities at a given position in the phrase.

And finally, we build the phrase based on "static" values and Choices. For example, the sentence should always start by "Use", followed by any of the possible colors, any of the possible shapes, and so on...

            // Create choices containing values of the lists
            var shapes = new Choices();
            foreach (string value in Shapes.Keys)
                shapes.Add(value);

            var colors = new Choices();
            foreach (string value in BgrColors.Keys)
                colors.Add(value);

            var playerIndexes = new Choices();
            foreach (string value in PlayerIndexes.Keys)
                playerIndexes.Add(value);

            // Describes how the phraze should look like
            var gb = new GrammarBuilder();
            //Specify the culture to match the recognizer in case we are running in a different culture.                                 
            gb.Culture = ri.Culture;
            // It should start with "Use"
            gb.Append("Use");
            // And then we should say any of the colors value
            gb.Append(colors);
            // Then one of the two possible shapes
            gb.Append(shapes);
            // then again the words "for player"
            gb.Append("for player");
            // and finally the player that we want to update
            gb.Append(playerIndexes);

            // Create the actual Grammar instance, and then load it into the speech recognizer.
            var g = new Grammar(gb);

            speechRecognizer.LoadGrammar(g);

We can easily build complex grammar! The whole GrammarBuilder could have been added to another Choices object as an alternative to another GrammarBuilder even more complex!

4/ Start the recognition

Then, we can subscribe to the speech recognizer events:

            speechRecognizer.SpeechRecognized += speechRecognizer_SpeechRecognized;
            speechRecognizer.SpeechHypothesized += speechRecognizer_SpeechHypothesized;
            speechRecognizer.SpeechRecognitionRejected += speechRecognizer_SpeechRecognitionRejected;

Then we start the Kinect audio stream, set that stream as the input source of the speech recognizer, and finally start the recognition!

            var audioSource = this.Kinect.AudioSource;
            audioSource.BeamAngleMode = BeamAngleMode.Adaptive;
            var kinectStream = audioSource.Start();

            speechRecognizer.SetInputToAudioStream(
                    kinectStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));
            speechRecognizer.RecognizeAsync(RecognizeMode.Multiple);

The RecognizeMode enumeration indicates whether you want the recognition to stop after a first recognition event (Single) or to continue until you stop it manually (Multiple).

5/ Process the results

Now the recognition has started, and we will start processing what the Kinect heard. The first two eventhandlers are fired respectively when a phrase is rejected (because the confidence is too low), and when a phrase is hypothesized which means that a phrase is recognized but is ambiguous because it matches different accepted results. In those case, we will just display the result text.

        void speechRecognizer_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
        {
            Console.WriteLine("Rejected: " + e.Result.Text);
        }

        void speechRecognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
        {
            Console.WriteLine("Hypothesized: " + e.Result.Text);
        }

The event that we really need is the SpeechRecognized. It will give us a result with a Confidence property. The confidence indicates how likely the result is the right one compared to other possibilities. Those possibilities are stored in the Alternatives property, which contains a collection of RecognizedPhrase.

Th recognized phrase is stored as a string in the Text property or as a collection of RecognizedWordUnit in the Words property. Each one of those words has its on Confidence property.

In that last code sample, we will analyze the result and modify the settings of the corresponding user:

        void speechRecognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            // Confidence indicates the likelihood that a phrase is recognized correctly
            // compared to alternatives. 0.8 confidence doesn't mean 80% chance of being correct.
            if (e.Result.Confidence < 0.50)
            {
                Console.WriteLine("Rejected: " + e.Result.Text + ", " + e.Result.Confidence);
                if (e.Result.Alternates.Count > 0)
                {
                    // Print the alternatives
                    Console.WriteLine("Alternates available: " + e.Result.Alternates.Count);
                    foreach (RecognizedPhrase alternate in e.Result.Alternates)
                    {
                        Console.WriteLine("Alternate: " + alternate.Text + ", " + alternate.Confidence);
                    }
                }
                return;
            }
            Console.WriteLine("Recognized: " + e.Result.Text + ", " + e.Result.Confidence);

            var index = PlayerIndexes[e.Result.Words[5].Text];

            var playerConfig = playerConfigs[index];

            if (playerConfig != null)
            {
                playerConfig.Brush = BgrColors[e.Result.Words[1].Text];

                playerConfig.Shape = Shapes[e.Result.Words[2].Text];
            }
        }

Comments (6) -

Juan
Juan
5/30/2012 5:05:02 PM #

Hi, i have some problems with "the spacename 'speechRecognizer' doesn't exists in this context" and i want to learn about this problem. Can you help me?

Renaud
Renaud
5/30/2012 10:05:45 PM #

Hi Juan,

Did you add a reference to the Microsoft.Speech.dll assembly to your project?

Geraldine
Geraldine
8/22/2012 8:08:33 AM #

Hi Renaud,
Thanks for explaining the speech recognition so well! Unfortunately I can't get past the first error message- " there was a problem..."! I've downloaded your project and it runs fine for me and so does the sample Shapes project, so I can't see what the problem is. I've added the Microsoft.Speech.dll and it builds fine too- any ideas what might be wrong?

Thanks,
Geraldine

Renaud
Renaud
8/22/2012 9:08:35 AM #

Hi Geraldine,

So obviously the GetKinectRecognizer() method returns null. Did you put a breakpoint in there to check what are the available recognizers ?

This line, will give you the collection of installed recognizers :
SpeechRecognitionEngine.InstalledRecognizers().

Also, could you compare the location (by double-clicking the reference in Visual Studio) of the Microsoft.Speech.dll assembly in one of the working projects and in your project ?

lunalightsworld
lunalightsworld
10/26/2012 4:10:23 PM #

Hi,, i'm a beginner in kinect..

in your tutorial the keyword using english,, if i want to add keyword using other language, for example in indonesian,, do kinect can recognice my keyword in my languange (indonesian) correctly (not using english accent)??

or I can make a keyword with sound file, for example .wav??

thanks for your help Laughing

*sorry for my bad english

Renaud
Renaud
11/6/2012 10:11:39 PM #

I don't think it will work correctly with a different language. With the SDK 1.5 they added support for English/Great Britain, English/Ireland, English/Australia, English/New Zealand, English/Canada, French/France, French/Canada, Italian/Italy, Japanese/Japan, Spanish/Spain and Spanish/Mexico.

You should try one of those languages ^^ I wish you success !

Source : blogs.msdn.com/.../what-s-ahead-a-sneak-peek.aspx

Pingbacks and trackbacks (1)+

Add comment

  Country flag

biuquote
Loading

TextBox

About the author

I'm a developer, blog writer, and author, mainly focused on Microsoft technologies (but not only Smile). I'm Microsoft MVP Client Development since July 2013.

Microsoft Certified Professional

I'm currently working as an IT Evangelist with an awesome team at the Microsoft Innovation Center Belgique, where I spend time and energy helping people to develop their projects. I also give training to enthusiastic developers and organize afterworks with the help of the Belgian community.

MIC Belgique

Take a look at my first book (french only): Développez en HTML 5 pour Windows 8

Développez en HTML5 pour Windows 8

Membre de l'association Fier d'être développeur

TextBox

Month List