Channels ▼


Using Speech APIs in Windows Phone 8

When the app uses different PhraseList elements for the voice commands, the speech recognition is pretty accurate. You can also use lists to constrain the text against which the speech recognizer must match. This significantly improves accuracy. For example, each recipe requires a specific skill level and you don't want the user to have to select the skill level from a dropdown list. You can provide the list to the speech recognizer in a similar way to how you defined the elements for the PhraseList.

The following lines constrain the possible results of speech recognition to the values included in the skillLevels List<string> by calling the skillLevelRecognizer.Recognizer.Grammars.AddGrammarFromList method:

var skillLevels = new List<string>()
    "good cooker",
    "great chef"

using (var skillLevelRecognizer = new SpeechRecognizerUI())
    skillLevelRecognizer.Settings.ListenText = "Which is the skill level required for this recipe?";
    skillLevelRecognizer.Settings.ExampleText = string.Join(", ", skillLevels);

    skillLevelRecognizer.Recognizer.Grammars.AddGrammarFromList("skillLevel", skillLevels);
    var result = await skillLevelRecognizer.RecognizeWithUIAsync();

    if ((result.ResultStatus == SpeechRecognitionUIStatus.Succeeded)
        && (result.RecognitionResult.TextConfidence != SpeechRecognitionConfidence.Rejected))
        var skillLevel = result.RecognitionResult.Text;

The code creates a new instance of Windows.Phone.Speech.Recognition.SpeechRecognizerUI, and initializes the settings to display "Which is the main element?" and provide sample text by joining the strings in the list (Figure 9). This way, the user knows what he can say.

Windows Phone 8 App Development Part 3
Figure 9: A speech recognition session to ask the user which is the skill level required for the recipe.

If cloud-based speech recognition can hear what you said, it displays the results of the recognition and the phone's voice will tell you what you said (Figure 10). You will notice that the recognition has really improved its accuracy with the use of the list.

Windows Phone 8 App Development Part 3
Figure 10: The speech recognition results provide feedback to the user.

You can also add your own grammar definitions to the speech recognizer by using an XML file that conforms to the Speech Recognition Grammar Specification (SRGS) W3C standard. With SRGS, you can improve accuracy for speech recognition required in complex scenarios. If you want to dive deeper on SRGS, you should check out the SRGS 1.0 specification.

Providing a Voice Response with Text-to-Speech

If you want to have an app that provides a voice-driven UX, you must use Text-to-Speech, also known as TTS, in order to turn text into spoken words. If the user is speaking to the phone, he won't want to read the output on the screen. Instead, he will expect the phone to provide voice feedback for each interaction.

The basic use of TTS is pretty simple. Add the following using statement to your code:

using Windows.Phone.Speech.Synthesis;

Now, you need only create a new instance of Windows.Phone.Speech.Synthesis.SpeechSynthesizer and call its SpeakTextAsync method with an asynchronous execution (and with the text that the phone's voice must read back to the user). The following lines show an example of TTS informing the user the recipe with a specific main element has been added to his wish list:

  var mainRecipeElement = "tomatoes";

  var speechSynthesizer = new SpeechSynthesizer();
  await speechSynthesizer.SpeakTextAsync(string.Format("I've added the new recipe with {0} to your wish list.", mainRecipeElement));

The SpeakTextAsync method is useful when you want the phone's voice to read one sentence. However, if you want the phone to read all the necessary steps for a recipe, you probably want to introduce breaks between each step. The speech synthesizer supports the W3C Speech Synthesis Markup Language (SSML) standard with minor differences You can use SSML to provide hints to the synthesizer on how to read the text.

The following lines show a simple example of three recipe steps that the code uses to generate an SSML string, which the synthesizer will read:

var recipeSteps = new List<string>()
    "Cut one tomato into 5 pieces",
    "Add olive oil to the tomato's pieces",
    "Cut three small potatoes"

var recipeSSMLBuilder = new System.Text.StringBuilder();
recipeSSMLBuilder.Append("<speak version=\"1.0\" xml:lang=\"en-us\">");
foreach (var step in recipeSteps)
    recipeSSMLBuilder.Append(string.Format("{0}{1}", step, "<break time=\"1s\" />"));

var recipeSSML = recipeSSMLBuilder.ToString();

var speechSynthesizer = new SpeechSynthesizer();
await speechSynthesizer.SpeakSsmlAsync(recipeSSML);

I've used a StringBuilder that will produce the following SSML XML when converted to a string:

<speak version=\"1.0\" xml:lang=\"en-us\">
    Cut one tomato into 5 pieces<break time=\"1s\" />
    Add olive oil to the tomato's pieces<break time=\"1s\" />
    Cut three small potatoes<break time=\"1s\" />

This way, the speech synthesizer will break one second after reading each recipe step. Once the SSML XML is built, the code creates a new instance of Windows.Phone.Speech.Synthesis.SpeechSynthesizer, and calls its SpeakSsmlAsync with an asynchronous execution and with the SSML XML. SSML allows you to further customize the speech output. If you want to dive deeper into SSML, consult the SSML 1.0 specification.

By using voice commands, speech recognition, and TTS capabilities, you can provide a complete speech-driven UX in Windows Phone 8 apps. Because many Windows Phone 8 apps take advantage of the speech features by default, users are expecting more apps that provide similar experiences. Give 'em what they want, and they'll be happy customers!

Gaston Hillar is a frequent contributor to Dr. Dobb's.

Related Articles

Windows Phone 8 App Development: Getting Started

Windows Phone 8 App Development: Using Voice Commands

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.