Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

.NET

Examining Microsoft's Speech SDK


Jul99: Programmer's Toolchest

Peter is coauthor of Windows Undocumented File Formats and can be contacted at [email protected].


Command-and-control is a user interface that lets users interact with applications by speaking to the computer, rather than using traditional input devices such as the mouse or keyboard. There are several commercially available toolkits that make it possible to implement command-and-control voice recognition in Win32-based programs, including IBM's VoiceType and IBM Via- Voice (see http://www.software.ibm.com/ is/voicetype/dev_home.html), and Dragon Systems DragonXTools (see "Examining the Dragon Speech-Recognition System," by Al Williams, DDJ, July 1998, and http://www.dragonsys.com/). For a comprehensive list of companies that provide speech-recognition engines and related tools, see http://www.tiac.net/ users/rwilcox/speech.html.

In this article, I'll focus on the Microsoft Speech (SAPI) SDK. In addition to documentation, tools, and sample code, the Microsoft Speech SAPI SDK includes code that can be freely redistributed with applications or speech engines. At this writing, the current version of the SDK is 4.0a. The SDK itself is freely available at http://research.microsoft.com/stg/. In examining the Speech SDK, I'll demonstrate how you add command-and-control voice recognition to applications using the Microsoft SDK. The sample code (available electronically; see "Resource Center," page 5), includes a program that shows how voice control is implemented and used, plus routines and classes you can reuse in your own applications. The Speech SDK handles both command-and-control voice recognition and text-to-speech dictation. Command-and-control can be used in almost any application and has some definite advantages, especially with applications where your hands need to be busy with something other than typing. Text-to-speech, on the other hand, can be used for the visually handicapped, as well as for verification of command-and-control commands without users needing to see the screen.

This COM-based API is extensive and beyond the scope of this article. For the near future, most applications will use command-and-control, and only a subset of the complete API. Still, with this subset, you can easily give users an extensive vocabulary to work with.

The Voice-Command Object

The first thing you need to do to add command-and-control voice recognition to applications is create a voice-command object. Assuming that you have voice-recognition software installed, all you need to do is call CoCreateInstance():

CoCreateInstance(CLSID_VCmd, NULL,

CLSCTX_LOCAL_SERVER, IID_IVoiceCmd,(LPVOID*)&pIVoiceCmd)

where CLSID_VCmd is the class ID for the voice-command object, IID_IVoiceCmd is the reference ID for the interface you want, and pIVoiceCmd is of type PIVOICECMD (defined in Microsoft's SPEECH.H header file for the Speech API). pIVoiceCmd places the indirect pointer to the voice-command object's voice-command interface in the pIVoiceCmd pointer. You can now access your voice-command object.

The Voice Notify Sink

The voice notify sink is a COM object you create as part of your application. You make this object available to the voice-command object; it then lets you know when commands are spoken via function calls to your voice notify sink object. It works in the same manner that callback functions serve in the Windows SDK -- it's just a little more tedious to write. The good news is that you can use the template CIVoiceNotifySink class I provide here in your own application with little or no modification. I've reused it for several applications with only minor changes.

You create the voice notify sink much as you do any COM object. You have the IUnknown interface (QueryInterface(), Add-Ref(), ReleaseRef()) and several additional functions -- the most important of which is CommandRecognize() -- that are called periodically by the voice-recognition engine. The voice-recognition engine calls CommandRecognize() when it has recognized a command from your list of commands with parameters letting you know the specifics.

Voice-Command Menus

Voice-command menus are lists of commands your application is expecting. The reason it's called a "voice-command menu" is that the structure is similar to the structure of windows menus. This is odd, because Microsoft's own research (see its speech SDK documentation) shows that users prefer the keyboard or mouse over voice when it comes to executing menu commands. While I suppose there's some value in this menu structure, it makes things more tedious in the long run and I have yet to find a use for it other than replicating my window menu structure.

Because they are tedious to build, voice-command menus are the only difficult part of the Speech API. Voice-command menus are made up of an array of VCMDCOMMAND structures. All of your VCMDCOMMAND structures need to be allocated in the same contiguous block of memory. Example 1 defines this structure. The fields are:

  • dwSize, the size, in bytes, of the VCMDCOMMAND structure (needed because the abData field is dynamically allocated).
  • dwFlags, which sets the only important flag for most applications -- VCMDCMD_VERIFY. This flag should be set for any commands that you want users to verify before executing. For example, saying "Format C Drive" deserves verification before execution. You'd hate to drop your microphone by accident and see a dialog box with "Formatting C: Drive" pop up.

  • dwID, a nonunique ID number you can assign to a command. That it's not unique is actually quite handy. For example, you could assign the same ID to "Run Wordpad" and "Start Wordpad." This lets you have synonyms for different command words or phrases.

  • dwCommand, an offset to the command phrase itself; for example, "Open File," relative to the beginning of the VCMDCOMMAND data structure.

  • dwDescription, an offset to the description of the action performed by the command (used by some applications, but I usually leave it the same as the command itself). Also relative to the beginning of VCMDCOMMAND.

  • dwCategory, an offset to a category ("File," "Edit," and so on) in which the command belongs. This is where the similarity to windows menus comes in. The Speech API documentation recommends 20 or fewer categories for performance reasons.

  • dwAction, an offset to the "Action" data in abData. This is essentially user-defined data that is passed to the application whenever the command is issued by users. You can put in whatever you want here.

  • dwActionSize, the size of the "Action" data in bytes.

  • abData, an array of bytes where your command, description, category, and action data are stored. Command, description, and category are all null-terminated strings (which is why the "Action" is the only one that requires a size field).

The reason voice-command menus are difficult is that all of the data in the abData field is DWORD aligned. That means that the command, description, category, and action are all DWORD aligned. Likewise, the final abData field itself is DWORD aligned. Like I said, it's not difficult because it's complicated, it's just tedious.

To help ease this, I've created the DWALIGN macro #define DWALIGN(len) ((len + 3) & (~3)) that takes a length parameter and returns a DWORD, or 32-bit aligned length. You can pass in the length of each of your strings (command, description, and category), as well as the length of your action structure. Then add these to get the total size of your abData field.

Once you've created your array of VCMDCOMMAND structures, you're about ready. First, you need to fill in an SDATA structure, which consists of two fields: dwSize, the size of your entire list of VCMDCOMMAND structures and pData, a pointer to the beginning of those structures.

Creating the Voice-Command Menu

Before creating your voice-command menu, you must fill out two more structures. The first is the VCMDNAME structure. This is a simple structure that also contains two fields. The first field is szApplication, which holds the null-terminated application name string. The second field is szState, which contains the state in which the menu is considered active. Essentially, you assign a state to menus that are available at different times. For example, if you have menus for each of several dialogs in your program, you would only want the menu associated with the currently displayed dialog to be active, as commands from the other dialog would not be valid. In my sample program, only one menu and one dialog are used for it, so you assign szState the name "Main."

The next structure you need to fill in is the LANGUAGE structure, which consists of a LanguageID field and an szDialect field. These tell the voice-recognition engine which language and dialect you want operating for this menu. You can pass a NULL instead of a pointer to this structure to the MenuCreate() function and it will pick the default. In my case, I use the LanguageID of LANG_ENGLISH and the dialect of "US English."

Finally, you're ready to call MenuCreate using pIVoiceCmd->MenuCreate(&vCmdName, &language, VCMDMC_CREATE_ TEMP, &pIVCmdMenu);. The third parameter is a flag. There are several values that determine the lifetime of your menu. You can add it to the permanent database for your voice-recognition system, in which case your application doesn't have to rebuild the VCMDCOMMAND structures every time it runs. Also, most voice-recognition systems need some sort of training for new commands that you add. If you make your commands a permanent part of the database, you will not have to retrain it when you rerun your application. In this case, I've provided a sample program and don't want to waste space in your voice-recognition system's word database, so my menu exists until the pVoiceCmd structure it's associated with is released.

The fourth and final parameter is the address of a pointer to an IVCmdMenu interface. This is how you now work with the voice-command menu. The first thing you want to do is deactivate the voice menu by calling pIVCmdMenu->Deactivate(). Once this is done, you can add your commands by calling pIVCmdMenu->Add(nCommands, sData, NULL), in which nCommands is the number of commands in the sData structure. The third parameter, which I pass as NULL in my application, is the address of a DWORD. Commands can be added to a menu one at a time or in groups. This third parameter tells you what the starting command number was for the group that you added. For example, if there are 33 commands currently in your command menu, and you add 15 more, the first of those 15 will be command number 34, and that is what will be placed in the DWORD.

Finally, you're ready to activate the voice menu using pIVCmdMenu->Activate(hWnd, NULL). The first parameter is the window handle of a window. When that window has the focus, this voice menu is active. The second parameter, which you pass as NULL, tells the engine to only recognize these commands when the recognition engine is awake. The other option, the flag VWGFLAG_ASLEEP, tells the engine to recognize this command only when the recognition engine has been put to sleep. Its purpose is to add a command that tells your recognition engine to wake up and start recognizing commands.

Back to CommandRecognize()

The CommandRecognize() function in the voice recognize sink gets called each time the recognition engine recognizes one of your commands. Example 2 is the prototype for CommandRecognize().

There are only a few fields in Example 2 that are really important. The first is the dwID field, which contains the dwID for the command that you assigned. This is useful for identifying the command. dwFlags tells you if the VCMDCMD_VERIFY flag was assigned to the command. dwActionSize and pAction are the size and pointer to the Action structure made in the VCMDCOMMAND structure. Finally, pszCommand is a pointer to the null- terminated string with the command that was recognized. This may be useful if you've assigned a single ID to several commands.

That's all there is to it. Your program can now recognize voice commands you've created. As always, you should read the SDK documentation to become more familiar with the other aspects of the SDK. You never know, sometimes you find features that you didn't realize you needed.

The sample program (available electronically; see "Resource Center," page 5) is an example of how to use voice recognition. Figure 1 shows this program running. Because this is command-and- control, you can't say someone's name unless you add the name to the voice-command menu. Instead, I add the command "Name," which takes you to the Name field so that you can type the name. For the sex and age fields, you just have to say a sex and an age, and those fields will be filled in. In this case, I could have said "Twenty Nine. Male." or "Male. Twenty Nine." Either way the fields would be filled. If you say "Sex" or "Age," it will put the focus in those controls, to allow quick typing. Finally, there are three checkboxes. To toggle their states, simply say, "Disabled," "Admin," or "Unlimited." As for the age field, I have to admit to being lazy. It only accepts ages between 18 and 53. I got tired of spelling out numbers.

DDJ


Copyright © 1999, Dr. Dobb's Journal

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.