Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

.NET

SALT: The Speech Application Markup Language


June, 2004: SALT: The Speech Application Markup Language

The convergence of speech technologies, the Web, and PCs

Robert is a principal software engineer for Vytek. He can be contacted at hartmanr hotmail.com.


Nearly everyone is familiar with using the telephone to conduct business by communicating with an automated service, such as checking bank account balances or verifying airline flight schedules. Many of these systems allow callers to speak answers to questions or selections from options, and they often respond to the caller's spoken words with either prerecorded or synthesized speech of their own. These Interactive Voice Recognition (IVR) systems have been around for a number of years now and were the initial experience of speech technologies for many people.

However, after the World Wide Web took off in the 1990s and web technologies became standardized and widespread, speech technology developers began looking for ways to integrate speech with the Web. The most natural place for this to first occur was in the blending of previously proprietary IVR/telephony engines with the power of web server infrastructures. It quickly became clear that there was a need for a domain-specific, standardized markup language based on XML. This would enable the definition of semantics and flow control logic for IVR applications in a web context. The response to this need was the development of VoiceXML by the VoiceXML Forum (http://www.voicexml.org/), a consortium established in 1999 by AT&T, IBM, Lucent, and Motorola, which now includes hundreds of member companies.

The introduction of VoiceXML not only produced more open and flexible IVR solutions, it also allowed speech-based access to web applications. As a case in point, several years ago, I developed a digital media commerce application that was accessible through a standard commerce web site and also through a speech-based interface. The speech interface was implemented in VoiceXML and let users browse, purchase, and initiate downloads from the web site, all via a telephone.

Parallel to these developments in speech and web integration, personal computers were becoming powerful enough to handle both of the primary tasks of speech technology: speech recognition (speech as input) and speech expression (speech as output). This is no small feat given that, as recently as the 1970s, it took over 50 computers for Carnegie Mellon University's HARPY system to perform reliable recognition of continuous speech (that is, speech at a natural, conversational pace).

As a result of this maturing of personal computers, as well as the maturing of speech technologies themselves, computer users began to experience speech technologies on their own personal computers. Today, the average computer user has access to speech products from a wide variety of vendors supporting a range of activities from controlling a computer through spoken commands to the dictation of documents directly into a word processor.

Software developers also reaped the benefit of the integration of speech technologies with the personal computer. A wide variety of options for integrating speech technologies into applications now exists for the developer. These range from SDKs provided by platform vendors, such as Microsoft's Speech API (SAPI) and Apple's PlainTalk, to third-party tools such as the speech and biometrics SDKs from NeuVoice.

With this increasing power of personal computers and their ability to handle speech technologies, a new concept in speech integration began to emerge—multimodal applications. "Multimodal" is the ability of an application to support client-side user interaction through voice, in addition to the more standard client input/output mechanisms of keyboard, mouse, and screen. As web browsers became increasingly popular and capable of presenting rich user interfaces, it became apparent that the development of multimodal applications required a tighter integration of voice technologies with the data and execution models that permeated the Web, such as event driven scripting languages, the document object model (DOM), and HTML.

Though VoiceXML was moving to extend itself to provide this level of integration, some in the industry felt that a fundamentally different model was required. This led to the development of Speech Application Language Tags (SALT) by the SALT Forum (http://www.saltforum.org/), established in 2002 by Microsoft, Intel, Cisco, and Philips, and now including over 70 member companies. The stated goal of the SALT effort is to integrate speech technologies into a broad range of user-oriented computing devices, from PCs to PDAs, through the enabling of multimodal applications. To accomplish this, SALT has taken a different approach than VoiceXML.

How VoiceXML and SALT Differ

VoiceXML was originally intended to provide a comprehensive environment for defining speech applications. As such, it specifies elements for defining the data (forms and the fields within forms) and execution path (dialog flow), and provides a self-contained execution environment for interpreting VoiceXML markup at runtime. Speech/web integration is typically provided by interactions between VoiceXML servers and web servers.

For example, in a typical VoiceXML application, the telephone functions as a web browser, with the user's speech input being sent to a VoiceXML server that interprets the speech against markup documents consisting of VoiceXML elements. The markup in the document can instruct the VoiceXML server to connect to a URL on a web server, where web script (JavaScript, for example) on the web server interacts with the web site/application and responds with appropriate VoiceXML markup. The VoiceXML server receives the markup from the web server, interprets it, and sends appropriate speech output over the telephone to users.

In contrast, SALT is intended to provide a minimal environment for defining speech applications. It is restricted to the definition of data and behavior specific to the speech interface, such as listening for input and specifying grammars to be used in interpretation of that input. All other data, such as the definition of forms and form elements, is left to the markup language within which the SALT is embedded, such as HTML.

To provide this level of integration, SALT tags are implemented as a set of XML elements defined to function as natural extensions of existing markup languages, such as HTML, XHTML, and WML. Further, SALT elements expose DOM interfaces, which allow them to function as full-fledged members of the data model of the hosting page. This means that they possess methods, properties, and event handlers that can be referenced directly from ECMA-compliant scripting languages such as JavaScript in the same way as any other element in a DOM-compliant web page.

These design features enable SALT to participate in the runtime execution environment of the host, typically the event-driven script engines of web browsers. In short, SALT is specifically designed to fit directly into the data model and execution environments used by today's web developers. The complete SALT specification is available at http://saltforum.org/devforum/default.asp.

SALT in Action

To see how all this looks in practice, I present a basic multimodal application using SALT. I will walk through four of the most important tag elements in the SALT specification, then assemble them all into a simple application.

First, you need a way of prompting users for input. The <prompt> SALT element is used to specify content spoken to the user, such as a greeting, instructions, or confirmation of actions. The <prompt> element allows the specification of prompt resources such as text to be spoken or audio files to be played. While audio files are simply played back to provide the output, text is transformed into speech by the text-to-speech (TTS) engine provided by the platform environment on which the application is running.

If text is used as a prompt resource, it can be augmented with markup that complies with the W3C's Speech Synthesis Markup Language (SSML) specification (http://www.w3.org/TR/speech-synthesis/). This lets you specify in detail how the TTS engine renders the text. In addition, the <prompt> element provides methods to start, stop, pause, and resume prompt playback, and provides a number of event handlers. Example 1 is an example of the use of the <prompt> element to speak the text "Welcome to a SALT multimodal application." An inline value is used to specify the text to be rendered as speech, and an oncomplete event handler starts a named <listen> element when the prompt playback is complete.

The <listen> element provides control over how a speech recognizer (SR), provided as part of the platform environment, translates speech into text and how the recognition results should be handled. The output from an SR is outside the scope of the SALT specification, but it is expected to be in conformance to an XML-based interoperability specification such as the W3C's Semantic Markup Language (SML) (http://www.w3.org/TR/2000/WD-nl-spec-20001120). Like the <prompt> element, the <listen> element provides methods for starting and stopping speech recognition and provides several recognition events such as successful recognition, timeouts, and recognition errors. Example 2 is an example of the use of the <listen> element. In this example, two child elements provide information for the listener regarding how it should operate. The <grammar> element specifies recognizable tokens for the SR to use in recognition, essentially telling the listener what spoken utterances the application will accept. The <bind> element indicates that a specific value from the SR's output should be assigned to another page element, such as an HTML control.

The <grammar> element specifies exactly what the SR can recognize and it can be thought of as a candidate list. The SR translates the digital signal representing the speaker's voice into phonemes that are matched against the phonetic representation of the words in the candidate list specified in the grammar. If a match is found, then the result specified by the grammar for that candidate is returned. The grammar itself can either be specified inline or can exist in a separate file pointed to by the <grammar> element's src attribute.

The specifics of grammar definition are outside the scope of SALT but, in general, as with SR output, grammars are expected to conform to an XML-based interoperability specification. The W3C's Speech Recognition Grammar Specification (SRGS) (http://www.w3.org/TR/2003/PR-speech-grammar-20031218) is one such specification.

The previous <listen> element example references a grammar contained in the external file "TagGrammar.xml"; see Example 3 for a listing of this grammar file. While a detailed explanation of grammar syntax is beyond the scope of this article (see the W3C's SRGS specification for more information), note that the line:

<rule id="Tag" scope="public">

begins the definition of a grammar rule that identifies the words "prompt," "listen," "grammar," and "bind," as the candidate list of recognizable tokens. The rule also specifies (by the <tag>$._value=...</tag> for each candidate item) the actual text value that will be returned to the application when a spoken word is recognized as corresponding to a particular item. In addition, this line indicates that the value should be returned in a <tag> node within the XML output from the recognizer. For example, if users speak the word "listen," the SR should be able to make a phonetic match between the utterance and the listen item in the grammar shown here, at which point the returned XML will contain the node <tag>listen</tag> in accordance with the listen item's definition in the grammar.

The <bind> element is a child element typically used to assign a value from the recognition results to another element on the page, such as an HTML text box. More specifically, the value can be assigned to another element within the scope of the DOM within which this <bind> element is a member. The line (repeated from the previous <listen> example):

<salt:bind value="//Tag" targetElement=
"selectTag" />

indicates that the value found in the <tag> node of the XML returned from the recognizer (note the standard XPATH syntax used to reference the node in the value attribute) should be assigned to the page element identified by the targetElement attribute; in this case, the element identified as selectTag.

A Multimodal Application Sample

With these four primary SALT elements, we can assemble a sample application as shown in Listing One. Note first that the page must reference the SALT namespace by the initial line <HTML xmlns:salt= "http://www.saltforum.org/2002/SALT">. With that important detail out of the way, take a look at how the markup operates. First, when the page loads into a browser, the line:

<body onload="promptWelcome.Start()">

starts playback of the prompt element identified as promptWelcome. This results in the TTS engine speaking the words "Welcome to a SALT multimodal sample. You may select a SALT tag from the dropdown list either by speaking or by using a mouse." Then, after the playback of the prompt is complete, the oncomplete event of the promptWelcome element starts the <listen> element identified in the DOM as listenTag. This starts the SR, which listens for speech input. In both these cases, note that SALT taps directly into the page's event model. When the SR detects input, it attempts to recognize it based on the grammar loaded by the <grammar> child element of the listener that started the SR. In this case, the grammar is found in a file named "TagGrammar.xml," as referenced by:

<salt:grammar src="TagGrammar.xml" />

When recognition occurs, the following <bind> child element of the listener is executed:

<salt:bind value="//Tag" targetElement=
"selectTag" />

This assigns the value of the <tag> node found in the XML output of the recognizer to the selectTag element, which is defined in the HTML of this example as a drop-down list. This causes the corresponding option in the HTML SELECT control to be selected. After the bind is executed, a successful recognition event is caught by the listen element's onreco event handler, which calls the JavaScript function ProcessInput(). The script in ProcessInput() then displays a definition for the text currently displayed in the HTML SELECT control. Note that ProcessInput() is also called by the onchange event handler of the SELECT control, which is fired when the user selects an option by using the mouse or keyboard. When this occurs, the calls in ProcessInput() to stop any prompt or listener activity result in a form of "barge-in," where a user is able to interrupt the speech interface by choosing to use the mouse and keyboard to interact with the application. The script then displays the text definition just as it does when called by the <listen> element's onreco event handler.

The net effect of this simple multimodal application is to demonstrate how SALT enables speech input to control page elements while still allowing the user to interact with those elements in the traditional mouse-and-keyboard manner. From a programming perspective, it also illustrates the seamless interactions between SALT elements and the data model and execution environments of a standard web page.

Serving SALT With Microsoft Web Technologies

The tight integration of SALT with the prevalent data model and execution environments of the Web (HTML, DOM, XML, and scripting) also support the full integration of SALT into the development environments used by many web developers. Microsoft's incorporation of SALT tools into ASP.NET through the Speech Application SDK (SASDK) for .NET is a case in point. Microsoft's support for SALT in .NET and Visual Studio is a topic for another time. For now, let's see how Microsoft supports SALT at runtime in the context of Internet Information Server (IIS) and Internet Explorer (IE) with no dependencies on .NET or the SASDK. It takes just a few simple steps to get our simple SALT application to run in this environment.

First, on the desktop client (mobile clients such as Pocket PC devices require special considerations beyond the scope of this article), you must install the Microsoft Speech Add-in for Internet Explorer. This provides DLLs that plug into IE 6.0 and perform the interpretation of SALT markup in the browser. Note that if you have installed the SASDK on the same computer (for example, you may be running your IIS server and IE client on the same machine for ease of development and testing), then this has already been done for you.

Second, create a directory on the IIS server and set the directory as an application root directory. In the new directory, create a file containing the markup and script of our simple SALT application (found in the "A Multimodal Application Sample" section of this article) and name it with an appropriate suffix, such as SimpleSALT.slt. Then create another file containing the grammar and name it appropriately, TagGrammar.xml, for example.

Finally, set the MIME type on your IIS server for the page type that contains SALT markup. For example, if your file containing the markup is named SimpleSALT.slt, then you need to associate *.slt files to the SALT MIME type. One way to do this is to go to the Internet Services Manager tool, select the web site or the application root, then open its property page, select the HTTP Headers tab, click the File Type button in the MIME Map section, and create a new type with ".slt" as the associated extension and "text/salt+html" as the content type. This mapping causes the Microsoft Speech Add-in to automatically instantiate the SALT objects that will provide a SALT implementation to IE.

You can now use your browser to navigate to the SALT-enabled web page in the application directory you created and run the application. If you "view source" in the browser, you will notice that, as a result of the MIME mapping you created, code similar to the following has been added by IIS to your web page prior to serving it to the browser:

<object id="saltobject26239220" CLASSID=
"clsid:DCF68E5B-84A1-4047-98A4-0A72276D19CC"
VIEWASTEXT WIDTH=0 HEIGHT=0></object> <?import namespace="salt" implementation="#saltobject26239220" /> <?import namespace="mm" implementation="#saltobject26239220" />

IIS inserted these lines to direct the IE client to instantiate the Speech Add-in that will process the SALT elements in your markup.

Conclusion

Though our discussion of SALT in this article has focused on the client side of a multi-modal application, it should be noted that SALT also supports traditional voice-only telephony applications with no visual interface. In this case, where a telephone functions as the terminal, a voice-only SALT interpreter integrates with telephony and speech servers in a manner similar to the traditional VXML model. In support of telephony scenarios, SALT provides elements for DTMF (e.g., touch-tone) processing as well as call control mechanisms such as outbound calling and call transfer.

Though Microsoft technologies were used in deploying the sample in this article, numerous other vendors are implementing SALT support in their products (see the SALT Forum web site for a partial list). In addition, there is even an open-source SALT implementation developed by the OpenSALT project (http://hap.speech.cs.cmu.edu/salt/) at Carnegie Mellon University, which is a contributor to the SALT Forum. OpenSALT has produced a SALT 1.0-compliant open-source browser based on the open-source Mozilla web browser and using the open-source Sphinx recognition and Festival speech synthesis software.

DDJ



Listing One

<HTML xmlns:salt="http://www.saltforum.org/2002/SALT">
    <HEAD>
        <TITLE>A SALT Multi-modal Example</TITLE>
        <SCRIPT language="JavaScript">
            function ProcessInput() 
            {
                promptWelcome.Stop();
                listenTag.Stop();

                switch(selectTag.value) {
                    case "prompt":
                        divDefinition.innerText = "The prompt element " +
                            "is used to specify the content of audio " +
                            "output, either as inline or referenced text, " +
                            "variable values, or links to audio files.";
                        break;
                    case "listen":
                        divDefinition.innerText = "The listen element " +
                            "is used for recognition and/or recording, " +
                            "and contains one or more grammars and " +
                            "optionally a set of bind elements to inspect " +
                            "and copy input.";
                        break;
                    case "grammar":
                        divDefinition.innerText = "The grammar element " +
                            "is used to specify possible user inputs " +
                            "with rules identified either inline or " +
                            "by reference.";
                        break;
                    case "bind":
                        divDefinition.innerText = "The bind element " +
                            "is used to bind values from spoken input " +
                            "into the page, or to call methods on " +
                            "page elements.";
                        break;
                    default:
                        divDefinition.innerText = "";
                }    
            }
        </SCRIPT>
    </HEAD>
    <body onload="promptWelcome.Start()">
        Select a SALT tag to view its definition:<P>
        <SELECT id="selectTag" onchange="ProcessInput()">
            <OPTION value="prompt">prompt</OPTION>
            <OPTION value="listen">listen</OPTION>
            <OPTION value="grammar">grammar</OPTION>
            <OPTION value="bind">bind</OPTION>
            <OPTION SELECTED value="">-- Select a Tag ---</OPTION>
        </SELECT>
        <P>Definition:<P>
        <DIV id="divDefinition">
        </DIV>
        
         <salt:prompt id="promptWelcome" oncomplete="listenTag.Start()">
            Welcome to a SALT multi-modal sample. You may select a SALT tag from 
            the dropdown list either by speaking or by using a mouse.
           </salt:prompt>

        <salt:listen id="listenTag" onreco="ProcessInput()">
            <salt:grammar src="TagGrammar.xml" />
            <salt:bind value="//Tag" targetElement="selectTag" />
        </salt:listen>

    </body>
</HTML>
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.