Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

VoiceXML & Instant Messaging


Jan04: VoiceXML & Instant Messaging

Moshe is a speech technology consultant and can be contacted at [email protected] and http://www.Disaggregate.com/.


When I'm at the airport and my flight is cancelled, I call my travel agent who gives me travel alternatives. But copying down new itineraries—flight numbers, confirmation numbers, addresses, and telephone numbers—can require several frantic minutes of repetition and scribbling. However, if my travel agent could send me instant messages, the interaction would be quite different. Instant messages would help me decipher complicated lists of choices. When we finished, instant messages would contain my entire itinerary.

Consequently, I decided to build a prototype system that enables speech technologies—speech recognition, text-to-speech, and speech biometrics—to interact with instant messaging. To do this, I needed a "speech server" to provide speech technologies, a "telephony server" to make/receive calls from the telephone network, and an instant messaging server. Instead of installing and maintaining my own servers, I decided to use servers that are freely available on the Internet.

Architecture

For my speech server and telephony server, I chose Voxeo's VoiceXML hosting service. VoiceXML is an XML-based specification from the W3C for "voice browsers." A voice browser uses speech technologies to interact with the user and Internet protocols to interact with information repositories; in other words, the audio equivalent of a web browser. Put another way, VoiceXML is a scripting language for speech technology applications. Voxeo's servers provide a VoiceXML interpreter, connections to the public telephone network, and all the speech technologies: recognition, text-to-speech, and biometrics. Voxeo has several competitors, all of which offer a comparable arrangement. For more information on VoiceXML, Voxeo, and similar services, see "VoiceXML and the Voice/Web Environment," by Lee Anne Phillips (DDJ, October 2001).

Voxeo's hosting service (http://www.voxeo.com/) associates a telephone number with a URL. When I call the telephone number, the VoiceXML interpreter fetches the application from that URL and starts the application. Voxeo supports developers by providing free incoming calls, free outgoing calls (within reason), and even HTTP-initiated outgoing calls. Voxeo has superb e-mail and live technical support, again free for registered developers. (One shortcoming is that Voxeo's system does not provide a syntax checker. When a VoiceXML application fails, you have to figure out where and why on your own.)

For my instant messaging (IM) server, I chose Jabber—an open-source instant messaging system with many publicly accessible servers (http://www.jabber.org/). Other presence and messaging protocols are available, but I chose Jabber because it has open-source software for clients, client APIs, and servers, as well as publicly accessible messaging servers. To the best of my knowledge, other protocols do not currently enjoy that level of community support.

Finally, I needed some way to tie all these servers together. Given VoiceXML's support of objects with URLs, I decided to use CGI scripts. My corporate web site supports CGI, so I placed CGI scripts there; I'll refer to that server as the "CGI server." The CGI scripts are written in Python, using the jabber.py package (http://jabberpy.sourceforge.net/).

Figure 1 shows the architecture of the system. When I dial the application's assigned telephone number, Voxeo's VoiceXML interpreter pulls the application from my "document server" (yes, it's actually yet another server). The VoiceXML interpreter loads the application and the application eventually determines what information I want sent to me. The VoiceXML interpreter sends that information, along with my user information, to the CGI server. The CGI server sends the message to the instant messaging server. The IM server sends it to me on my IM client. Simple, eh?

Tricking VoiceXML

VoiceXML applications consist of one or more documents. Broadly speaking, each document contains a series of forms, and each form in turn contains a series of fields. When users speak, the fields are filled in with data, and the data are then available to the application. The documents, as well as other objects such as voice files, are accessed by URIs.

In Figure 1, I need to send data collected by the VoiceXML interpreter to the CGI server; for example, the name of the city I wish to fly to. VoiceXML-based voice browsers, like their web browser counterparts, do not support data exchanges: The voice browser cannot receive simple data, only valid documents.

VoiceXML supports dynamic documents. VoiceXML's <submit> tag uses GET/POST methods to send data to a URI. In return, the interpreter receives a VoiceXML document to execute. That's a straightforward way to send information to the CGI server, but the penalty is that the interpreter will transfer control to the received document. Mixing dynamic document generation with simple data transfers struck me as a bad design choice, and doing this while attempting to debug a simple prototype would be doubly unwise.

VoiceXML also provides the <subdialog> tag that transfers control to the VoiceXML document named in the tag's URI. The subdialog document contains a <return> tag, which returns control to the original document. The <subdialog> tag supports GET/POST. I use subdialogs to move data back and forth between the speech server and the CGI server.

The VoiceXML Application

Listing One presents highlights of the VoiceXML application. (The complete source code and related files are available electronically; see "Resource Center," page 7.) Listing One(a) shows how a field collects data. The <prompt> tag gives text, which is transformed into audio by the text-to-speech resource on the speech server. The <grammar> tag creates a list of valid utterances. If users do not speak, the <noinput> tag reprompts. If the user's utterance is not understood, the <nomatch> tag gives a list of valid choices. (Do not argue with users; just reprompt with a list of valid utterances.) When the user's utterance is valid, the <filled> tag defines the action—the data collected are placed into a variable of document scope. The form-level <filled> tag then becomes active, and its <goto> transfers control to the next form.

That form, see Listing One(b), does some trickery. The VoiceXML interpreter queues prompts and plays them after it reaches a <field> or <subdialog> tag. Without the form in Listing One(b), the next tag to trigger a play would be the <subdialog> tag in Listing One(c), and that trigger would only happen after the CGI script returned the subdialog document, a delay of several seconds; and this delay would confuse users, since the prompt is what confirms to the user that the utterance was correctly received. To finesse this problem, the <field> in Listing One(b) performs a bogus recognition lasting 1 millisecond with a 1 millisecond silent prompt—which unqueues the real prompt. The real prompt implies that there may be a short silence while the IM is sent. The <goto> transfers control to the form in Listing One(c).

The <subdialog> tags in Listing One(c) are fetched while the prompt is playing, limiting audible delays. <subdialog> sends the variables listed in the "namelist" attribute to the CGI server. The first CGI script generates flight information, and the subdialog document that it returns places flight information into the variable flightInfo. The second <subdialog> calls the CGI server with this flight information, and that CGI script sends an instant message. For both of these scripts, the subdialog documents that they return do not contain any fields or prompts.

Both subdialog documents throw an event called "normal" on normal completion. The VoiceXML interpreter expects something to happen when it calls a subdialog, and throwing this event satisfies the interpreter.

The Speech UI When Trouble Strikes

The subdialogs in Listing One(c) both contain <error> tags. Why not just ignore errors? After all, when web browsers fail to fetch documents, the web browser does not attempt to recover. The web browser either displays an error page sent by the remote server or puts up a local dialog box. The user usually tries again by clicking on "reload" or "back" buttons. The page display is static, and the user can spend as much time as needed to read it and puzzle out what went wrong.

But speech applications are dynamic: Users either listen or talk—either of which interferes with thinking. Speech applications usually impose a strict time limit on the user's response. There is no "back" or "reload" button on a voice browser, and in almost all telephony applications, users must reenter all data if forced to hang up and dial again.

Users judge your VoiceXML service against a live human agent, not a web browser. The union between the Internet and telephony is an inherently unnatural act—the U.S. public telephone network has 99.999 percent reliability for calls in progress, but Internet connections frequently fail. Tying the Internet to the telephone network while providing a decent user experience requires ingenuity and care.

Rather than relying on the default error handling of the VoiceXML interpreter—essentially, roll over and die—I define my own error handlers. Use the <error> tag with extreme caution, and always provide a sensible exit from the tag. In my testing, I managed to get the speech server caught in an infinite loop of errors; now I use the count attribute to break out of the loop if I get more than one error.

The CGI Server Sends a Message

The CGI script (sendInfo.cgi) that sends the instant message is available electronically. The script is written in Python and uses the jabber.py API to implement a rudimentary jabber client. The CGI script uses a local database to associate the user's caller ID with the user's IM address.

To log into the server, function imLogin() creates a jabber Client object. If the object's connect() method reaches the server, the object's auth() method logs in with a previously registered username/password. The script registers callback functions to handle routine IM messages.

The imSend() function uses imLogin() to connect to the IM server, and the client object's send() method to send the message. imSend() breaks the connection immediately with the disconnect() method.

The endProgram() function sends a valid VoiceXML document back to the speech server. The first lines print headers needed by HTTP and then the headers used by VoiceXML. If endProgram() is called because of an error, the resulting document will throw a relevant error event; otherwise, the document throws a "normal" event.

Sending Instant Messages to VoiceXML

My original goal was just to send instant messages, and what I've just described accomplishes exactly that. However, after I got it working I decided to see if I could also use the IM client to send data to the VoiceXML application. This struck me as unlikely at first. The speech server is closed: The speech server is not extensible by remote users, and I can't add a module to it that detects IM messages and throws events to the VoiceXML interpreter.

I resorted to some familiar trickery. Listing Two is a form with two form items. The first form item, the field voiceChoice, solicits voice input. While the prompt plays and the VoiceXML interpreter waits for voice input, the interpreter fetches the subdialog document that constitutes the next form item—and that subdialog document is generated by a CGI script that also checks for incoming instant messages. The two form items bounce back and forth between each other until one or the other generates valid data.

The count attributes in the <prompt> tags define which prompt introduces the first attempt to fill the field, the second attempt, and the third and all subsequent attempts. The timeout attribute says that if the user does not begin speaking within four seconds after the prompt ends, recognition fails.

The count attributes in the <catch> tags select which <catch> tag is active for the first 10 attempts, and which is active for later attempts. The <reprompt/> tag forces the VoiceXML interpreter to prompt the next time it reaches a field; without the tag, the prompt does not play on retries.

This script may leave time windows in which speech recognition is not active—after the speech recognition times out but before the CGI script returns the subdialog document—but in my tests, this form seems to work well. If users send an instant message, there will likely be a delay until the speech recognition times out and the subdialog document executes, but I judge that delay acceptable since the user will perceive it as an Internet delay. Ignoring the user's utterance is far less acceptable, so I prompt users with a short phrase ("waiting") to solicit input when the recognizer is definitely available. If the CGI script fails twice, the test in the cond attribute prevents the CGI script from executing again.

Listing Three is a fragment of the CGI script that sends the data to the VoiceXML server. In Listing Three(a), recvIM() calls imLogin() and specifies a callback function that will place each incoming message into a tuple (IM address, message), and places those tuples on a list. After checking for incoming messages, recvIM() returns a list of tuples whose IM addresses match that of the user.

Listing Three(b), endProgram(), prints the subdialog document. The list of tuples becomes an ordinary string. The string and its length (which serves as a flag) are returned as part of the subdialog document. The <filled> tag in the <subdialog> (Listing Two) checks the flag and uses the string.

Improvements

A quick look at the complete source code shows that the CGI scripts are aimed at a single-user system: Each time I send an instant message or check for one, the application executes a separate login/logout—clearly a load on the public server. A multiuser system would use a local jabber server with a custom module, or perhaps a client daemon that remains logged in to the server and is accessible by the CGI script.

In this application, once the message has been sent, the application does not speak the flight information because I assume that the IM client received the message—not necessarily a valid assumption. A custom IM client could send acknowledgment messages. Ideally, the user's instant messaging client would support a simple point-and-tap interface for use with PDAs—a tap on the desired flight would transmit that choice to the speech server. (Jabber supports XHTML Basic, which includes forms.)

When waiting for the flight number, the speech recognizer's grammar is set to "number" and accepts any valid number spoken in a natural fashion ("eleven," "ten twenty-two," and so on). Since there's a short list of valid flights, I should restrict the user's choices to these flights—and add words like "help," "send message," and "start over." These changes would greatly increase recognition accuracy.

Since Voxeo supports HTTP-initiated calls, transactions can start by sending an IM (via the CGI server) to the speech server, which would result in a phone call to the user's telephone number.

Finally, Jabber provides an excellent platform for this test. However, it is currently structured to provide IM functions—not capability discovery. In this architecture, the association between the telephony connection and the IM service is not intrinsic to the telephony or IM connection; the association is made through a database lookup.

In an alternative architecture, the user's client would dynamically report to the application what capabilities it has available, such as a large or small display screen, voice input, speaker output, gestures, text, and so forth. The application would then decide what capabilities to utilize and send appropriate messages to the appropriate destination(s). Capability negotiation protocols are often based on SIP, Session Initiation Protocol. The IETF's SIMPLE working group is working on an extension to SIP for instant messaging; this means that SIP could provide both the IM and capability-negotiation protocols.

If you would like to try this system without installing it yourself, go to my web site at http://www.disaggregate.com/ and find the "demos" link.

Hardware and infrastructure providers are bringing us faster wireless networking and less-expensive wireless PDAs. Speech recognition and text-to-speech work well and are improving steadily, while voice biometrics provide secure access. Taken together, it's clear that multimodal interfaces are the wave of the future—and now's the time to get your feet wet.

DDJ

Listing One

<var name="flight.city"/>
<var name="flight.timeofday"/>
(a)
<var name="callerid" expr="session.telephone.ani"/>
<form id="flight_information">
    <field name="cityName">
        <prompt>
            What city did you want to fly to?
        </prompt>
        <grammar>
            [ chicago (new york) pittsburgh (san francisco) ]
        </grammar>
        <!-- User/ASR Errors -->
        <nomatch>
           <prompt>
              Choices are Chicago, New York, Pittsburgh, and San Francisco.
           </prompt>
           <reprompt/>
        </nomatch>
        <noinput>
           <reprompt/>
        </noinput>
       <filled>
            <assign name="flight.city" expr="cityName" />
        </filled>
    </field>
    <filled>
        <goto next="#sendFlightInformationAnnounce"/>
    </filled>
</form>
(b)
<form id="sendFlightInformationAnnounce">
    <block>
        <prompt>
            One moment please while I send you choices for flights
            to <value expr="flight.city"/>
            leaving in the <value expr="flight.timeofday"/>.
        </prompt>
    </block>
    <field name="scratch" type="digits?minlength=22">
        <prompt timeout="1"><break msecs="1"/></prompt>
        <catch event="noinput nomatch filled">
            <goto next="#sendFlightInformation"/>
        </catch>
    </field>
</form>
(c)
<form id="sendFlightInformation">
    <subdialog name="flightInfo" 
                        src="http://disaggregate.com/cgi/flightInfo.cgi"
            namelist="callerid flight.city flight.timeofday" method="post">
        <error count="1">
        </error>
        <error count="2">
            <assign name="FlightInfo.message" expr="'Actually, we couldn't 
                                           retrieve the data. Fake it.'"/>
            <goto nextitem="sendInfo"/>
        </error>
        <catch event="normal">
                <goto nextitem="sendInfo"/>
        </catch>
    </subdialog>
    <subdialog name="sendInfo" 
                       src="http:http://disaggregate.com/cgi/sendInfo.cgi"
            namelist="callerid flight.city flight.
                             timeofday flightInfo.message" method="post">
        <error count="1">
            <prompt>I had trouble sending to you. Let me try again.</prompt>
        </error>
        <error count="2">
            <goto next="#readFlightInformation"/>
        </error>
        <catch event="normal">
            <goto next="#getChoice"/>
        </catch>
    </subdialog>
</form>

Back to Article

Listing Two

<form id="getChoice">
    <var name="noDialogProblems" expr="0"/>
    <field name="choiceVoice" type="number">
        <prompt count="1" timeout="4s">
            Please look at the list of flights.
            From that list of flights, what is the flight number you prefer?
        </prompt>
        <prompt count="2" timeout="4s">
            Please take your time. I will wait for up to a minute for you
            to make up your mind. If I don't hear you, just speak again.
        </prompt>
        <prompt count="3" timeout="4s">
            Waiting:
        </prompt>
        <catch event="nomatch noinput">
           <goto nextitem="resultIM"/>
        </catch>
        <catch event="noinput nomatch" count="11">
            <goto next="#noChoice"/>
        </catch>
        <help>
            <prompt>
                Please say the flight number, or write the flight number,
                of the flight you prefer.
             </prompt>
        </help>
        <filled>
            <prompt>
                Thank you for choosing 
                          flight <value class="digits" expr="choiceVoice"/>.
            </prompt>
            <goto next="#topMenu"/>
        </filled>
    </field>
    <subdialog cond="noDialogProblems==0" name="resultIM"
            src="http://disaggregate.com/cgi/rcvInfo.cgi" 
                               namelist="callerid" method="post">
        <error>
            <reprompt/>
            <goto nextitem="choiceVoice"/>
        </error>
<error count="2">
            <reprompt/>
            <assign name="noDialogProblems" expr="1"/>
            <goto nextitem="choiceVoice"/>
        </error>
        <catch event="normal">
                <reprompt/>
                <goto nextitem="choiceVoice"/>
        </catch>
        <filled>
            <if cond="resultIM.flag != 0">
                <prompt>
                    Thank you for choosing flight <value class="digits" 
                                               expr="resultIM.message"/>.
                </prompt>
                <goto next="#topMenu"/>
            </if>
        </filled>
    </subdialog>
</form>

Back to Article

Listing Three

(a)

def rcvIM (address):
    """receive IM message: log in, check for message, log out"""
    con = imLogin(receive,rcvObj=receiveMessageCB)
    con.process(0.5)
    con.disconnect()
    return [ x for x in messageList if x[0].getStripped() == address ]
(b)
quote="" ; tick""; space=" "
def endProgram(messageList="", failed=0, eventname=None) :
    """End program by printing out VoiceXML
    """
    if failed :
        if eventname is None :
            eventname = "error.com.disaggregate.cgi.failed"
        else :
            eventname = "error.com.disaggregate." + eventname
    print "Content-type: text/plain"
    print ""
    print """<?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE vxml PUBLIC '-//Nuance/DTD VoiceXML 1.0//EN' 
     'http://voicexml.nuance.com/dtd/nuancevoicexml-1-3.dtd' >
    <vxml version="1.0">
    <form>
    """
    # Tell VoiceXML script that called us to throw an exception if, 
    # for some reason, we were unable to send the IM
    if failed :
        print '<block><return event="' + eventname + '"/></block>'
    else :
        if messageList :        # only if there are messages
            # assign return text to variable message, return message variable
            stringList = [ x[1] for x in messageList ]  # list of just strings
            messageString = " ".join(stringList)       # into one long message 
                                                       # separated by spaces
            # create variable message with value
            retval = '<var name=' + quote + 'message' + quote + space
            retval += 'expr=' + quote + tick
            retval += messageString + tick + quote
            retval += '/>'
            print retval
            # flag if message has actual info
            retval = '<var name=' + quote + 'flag' + quote + space
            retval += 'expr=' + quote
            retval += str(len(messageString)) + quote
            retval += '/>'
            print retval
            print '<block><return namelist="flag message"/></block>'
        # if there is no input, throw innocuous event
        else :
            print '<block><return event="normal"/></block>'
    print """</form></vxml>"""
   sys.exit()          # "successful" exit


Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.