Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Mobile

VoiceXML and the Voice-Driven Internet


Apr01: VoiceXML and the Voice-Driven Internet

David is a technologist for The Technical Resource Connection. He can be contacted at [email protected].


Wireless data services are growing at a phenomenal rate, driven to a large extent by the popularity of the Internet services they are delivering. These wireless-enabled Internet services are generally accessible not only by standard web browsers, but also by some mix of web phones, two-way pagers, and wireless organizers. The adoption of these modes of Internet access is being accelerated by the effects of mainstream Internet usage maturing from an initial novelty/hype phase into a ubiquitous set of services we use as common tools in everyday life. In this mode of use, how information is presented is less important than being able to get to the particular information you require easily, when and where you need it.

As with most Internet-related technologies, however, wireless Internet services have been over-hyped and still suffer from considerable limitations. These include usability issues such as incomplete wireless network coverage, as well as limited wireless device input, output, and bandwidth. A large percentage of wireless services are accessed from within cars and other situations requiring alert eyes and hands-free operation, yet the current mix of mobile wireless devices does not support this mode of use. Furthermore, initial and sustained costs associated with buying and activating wireless devices present a considerable hurdle that is limiting the penetration of wireless Internet services into the marketplace. This has set the stage for a consumer backlash and a reality check for wireless infrastructure and application providers. Web phones, wireless organizers, and two-way pagers are great for many applications, but not everything for which they are often promoted. The underlying technologies also have some maturing to do before they can deliver some of the promises on which they have been sold.

As the fog of hype surrounding the wireless Internet clears and the real strengths and effective applications of wireless services become apparent, unfulfilled needs are appearing. It is desirable to provide users with access to Internet services via a familiar and pervasive global communications network so that the issues of ease of use and incomplete network coverage are minimized. The interface for this access should improve usability and leverage existing infrastructure and services to avoid costs associated with buying and activating new wireless devices and services. These key unfulfilled needs are the drivers for the rapid growth of a complementary new mode of accessing Internet services, the implementation of which are called "voice portals."

Voice portals leverage both the most natural form of communication — speech — and the most pervasive and familiar communications network — the global telephone network. This network is accessible by either standard wired or mobile cellphones users already have, together with service plans, so no additional cost needs to be incurred for users to access Internet services via voice portals. This eliminates the expense barriers that are currently limiting the penetration of wireless services into the marketplace. Phones also permit eyes- and hands-free operation, enabling Internet service usage via voice portals in situations where wireless devices will not suffice.

In this article, I'll discuss the concept of voice portals and the associated architecture. I'll then show how simple design patterns — together with XML and XSL — can be used to deliver Internet content and services cost effectively not only to web browsers and various wireless devices, but also to any telephone via VoiceXML (for more information on the VoiceXML Standard, see http://www.voicexml.org/). I'll then present an implementation of this architecture that uses software that is freely available on the Internet. Finally, I'll examine key business and technical issues associated with voice-driven applications.

Voice Portal Application Architecture

Figure 1 shows the architecture of a typical voice portal application. To highlight the roles that each component of the architecture plays in delivering Internet services via voice, I'll discuss the responsibilities of each of the components. This is followed by a review of a key-use case scenario to expose the collaborations in chronological order that occur in a typical voice portal application. XML, XSL, VoiceXML, and Java snippets are included where applicable in the scenario to reinforce key concepts. The full documents, templates, and source code are available electronically; see "Resource Center," page 5.

Voice Portal. This component provides a speech gateway to Internet services. It consists of infrastructure services typically outsourced by application developers to a company that manages the hardware and software required to facilitate voice access to Internet services. The voice portal is responsible for integrating the Internet with the telephone network and providing an open programming model that enables developers to write VoiceXML to create voice applications that are executed by the voice portal.

Voice Browser with VoiceXML Interpreter and Controller. Web browsers are used to navigate the Internet, deliver content to users, and facilitate user input. They rely heavily, if not completely, on visual interfaces to achieve these objectives. Voice browsers, on the other hand, are analogous except for the important difference that they rely completely on an audio interface to achieve these objectives. The analogy is close with voice browsers, even supporting bookmarks, stop, go back, go forward, and other well-established web-browsing navigation concepts. Where web browsers are driven by HTML, voice browsers are driven by VoiceXML. One important capability of the full-duplex voice browser called "barge-in" lets users issue prompts to direct the browser during a session, even while being prompted. This delivers significantly more flexibility to voice interfaces than older half-duplex prompt and response systems where users must wait for prompts to complete before issuing input. Furthermore, VoiceXML permits layering of voice command sets in a hierarchical manner that enables users to rapidly navigate to the information they require without having to wade through tedious and rigid prompt/response structures like those deployed in early touch-tone automated telephone interfaces.

Speech Recognition. Just as web browser input is given by either clicking on a link or providing input through a form, equivalent voice browser input is given verbally and interpreted by the speech recognition engine of the voice browser. The VoiceXML driving the voice browser defines the valid input that may be given by users at any given time.

Speech Synthesis. Where web browsers display information visually, voice portals read information to users via the pre-recorded audio playback or on-the-fly speech synthesis part of the voice browser. What audio is played or speech is synthesized is again determined at any given time by the VoiceXML driving the voice browser.

Telephony Hardware. This is the hardware required to integrate the voice browser with the telephone network. It includes the ability to answer incoming calls and initiate voice browsing sessions. This hardware delivers audio input from the user via the telephone network to the speech recognition engine of the voice browser. Similarly, this component delivers audio played back or speech synthesized by the voice browser to the telephone network, phone, and ultimately, the user.

Telecommunications Infrastructure. This is the global telephone network consisting of many types of links and exchanges, and it includes both standard landline phones as well as mobile cellphones. This network is more pervasive than the Internet and has the capability to deliver Internet services to where they are not yet accessible, such as developing countries.

Phone User. This actor in the architecture is a person using natural speech and a standard wired telephone to access Internet services via a voice portal.

Mobile Phone User. Similarly, this actor in the architecture is a mobile user with a mobile cellphone that is accessing Internet services via the speech interface of the voice portal.

Web Server. A standard Internet web server accessed either by HTTP or HTTPS. This same web server may be used to deliver the same services to voice portals as well as web browsers, wireless devices, or external servers in a business-to-business extranet collaboration.

Servlet/JSP Engine. The servlet and/or JSP engine hosts Java servlets or Java Server Pages (JSPs) that act as gateways to deliver Internet services from back-end business and infrastructure services.

Product Data Servlet. The product data servlet is responsible for talking to back-end business services, infrastructure services, or databases, and where necessary, converting replies into XML. This component could also be implemented as a Java Server Page. Clients that require their content or services in its raw neutral XML format with no presentation information can talk directly to this component. Examples of such clients include external servers in a B2B collaboration or thick clients such as Java applications.

Product Presentation Servlet. Just as clients that require raw content talk directly to the data servlet, clients that require content in a presentation format, such as HTML for a web browser or VoiceXML for a voice browser, may talk directly to the presentation servlet. This component serves to take the XML content from the data servlet and transform it using XSL into the format suitable for the requesting client. To facilitate this, it must identify the type of client requesting services, either implicitly via the HTTP request header or explicitly via a CGI parameter specified by the client. Once the client type has been identified, this component loads the appropriate XSL stylesheet template to transform the XML into the target format required by the given client. Similarly, this component could also be implemented as a JSP.

XSL Templates. A repository of XSL templates used by the presentation servlet to transform XML content and services into output formats suitable for various clients. This component could be as simple as a file system or as complex as an enterprise database.

Web and XML Content. In any given Internet service, some content is static and some dynamic. Static content does not vary across requests. On the other hand, dynamic content may vary across requests depending, for example, on request parameters given by the client. The Web Content repository serves client-specific static content directly to clients without transformation, such as HTML for web browser clients or prerecorded audio for voice browsers. On the other hand, the XML content repository serves client-independent static content either directly through the data servlet, or with XSL transformation through the presentation servlet.

Product Service. This component is a business service responsible for managing and disseminating product-related information, for example, to facilitate product browsing. It is accessed via some type of middleware such as CORBA, EJB, or MQ Series.

Product Database. Repository that manages persisted product information.

An Example

Assume in this example that a phone user is browsing products via a voice portal. In chronological order, the collaborations that take place during a typical voice-driven Internet service are.

1. Using either a standard landline or mobile cellphone, users dial a phone number associated with the voice portal and product-service application. This phone number is managed by the voice portal and is typically a toll-free number. The mapping from a phone number to the product service application is done beforehand during configuration of the voice portal service.

2. The voice portal answers the call, starts a voice-browsing session, and instructs the voice browser to load the VoiceXML application template for the productbrowsing application. The template to load is again specified during configuration of the voice-portal service. To load the VoiceXML template, the voice browser issues an HTTP(S) request to the web server. The web server responds by loading the VoiceXML template, which is typically a static document from the web content, and returning it to the voice browser. The voice browser then interprets the VoiceXML and begins to execute it. This involves playing back prerecorded audio and/or synthesizing speech to guide the user through the application. The VoiceXML shown in Listing One welcomes the user to the product-browsing service. As shown, the VoiceXML high-level structure is composed of one or more forms that may be used to either prompt users and get input or deliver information to users.

3. In the case of this Product Browsing application, users are asked to first select the product group. Valid options include books, music, or video. The grammar of the VoiceXML defines the valid inputs. A given selection may be chosen via one or more words. For example, "video," "videos," "movie," and "dvd" are all acceptable inputs used to select the VIDEO product group (see Listing Two). Any valid option may also be selected using keys on a touchtone phone. For example, the BOOKS product group may be selected by pressing key 1 on a touchtone phone. This is specified by the DTMF assignments in the grammar of the VoiceXML.

4. Users may respond to this prompt by saying, for example, "books," "music," or "video." In this example, users respond by saying "books." The user's response is interpreted via the speech-recognition engine in the voice browser, and the associated value "BOOKS" defined in the grammar, is then stored by the voice browser in a local variable named document.generic.ProductGroup (Listing Two), which is associated with this field of the VoiceXML form.

5. Once users have selected the product group, the VoiceXML instructs the voice browser with a goto statement to navigate to the next form with the ID ChooseBooksType. This next form (see Listing Three), in turn, prompts users for the category of book they are interested in. Two options are presented, "architecture" and "art."

6. In this example, users select "architecture." The user's response is again interpreted by the speech-recognition engine of the voice browser and the value BKS-ARCH, stored in the local variable document.generic.ProductCategory, which is associated with the current form.

7. Once the product group and category have been selected, the VoiceXML instructs the voice browser with a goto statement to navigate to the form named "DoSearch;" the voice browser then sends a request to the web server for books on architecture via an HTTP request as in Listing Four. For a secure application, HTTPS could be used instead of HTTP.

8. The web server delegates this request to the product presentation servlet.

9. The product presentation servlet identifies the type of client via the ClientType CGI argument submitted with the request that, in this case, has the value VoicePortal. The product presentation servlet then delegates this request to the product data servlet.

10.The product data servlet responds by interpreting the parameters of the request. In this case, the parameters specify the product group as books and the category as architecture. The data servlet then issues a request to the product service for the requested products, generally using some kind of middleware such as EJB or CORBA.

11.The product service responds by loading the requested products from the product database and returning them to the product data servlet.

12.The product data servlet reformats the returned product information into XML by first creating an XML Document Object Model (DOM) using a Java API, then serializing this DOM to return the XML results to the product presentation servlet. Listing Five is a snippet of this XML code.

13.The product presentation servlet then loads from the XSL templates repository, the XSL stylesheet used to transform XML product data into VoiceXML for the voice portal. The complete XSL stylesheet for the product browser application is also available electronically.

14.The product presentation servlet uses the XSL stylesheet for the voice portal to transform the XML product data into VoiceXML using an XSL Transformation (XSLT) Java API. Listing Six is the key Java code used to do this transformation.

15.The result of this transformation is the VoiceXML that is returned to the voice browser via the web server.

16.As instructed by the VoiceXML, the voice browser and speech-synthesis engine read the details of the products retrieved to users.

17.After listening to the results, users are asked if they would like to hear the products again. As specified by the grammar, users can respond with either yes or no. The VoiceXML includes significant support for error handling during input. For example "no input" and "no match" conditions are handled if users don't respond or give an invalid response, respectively. Users may also ask for "help," in which case information is given to guide them through the required input.

18.In this example, users respond by saying "no." Users are then similarly given the option of either browsing for other products or ending the session.

19.After selecting to end the session, the voice portal cleans up the browsing session and the user's phone call is disconnected.

Building the example just presented involves a number of the freely available tools and technologies. These include:

Voice Portal. Tellme Networks (http://www.tellme.com/) provides the VoiceXML development studio and live testing area used to develop and test the product browser application.

Servlet/JSP Engine. JRun from Allaire (http://www.allaire.com/) was used to implement the servlet/JSP Engine. It provides standards-compliant APIs for servlets and Java Server Pages (JSPs), and serves to decouple servlets and JSPs from the web server. While some web servers may facilitate running servlets and/or JSPs directly, it is desirable to use a separate servlet/JSP engine in order to move load off the web server as well as shield servlets and JSPs from proprietary APIs that may couple them to a particular web server product. This approach also enables complete flexibility in swapping the particular web server implementation. Similarly, because the servlet/JSP engine used provides standards-compliant APIs, it too may be swapped for a different implementation, giving further deployment flexibility.

Java Development Kit (JDK). JDK1.2.2 (http://www.java.sun.com/) was used to run all Java code, including the servlets.

Java XML API. Xerces (http://xml.apache.org/) was used for the standards-compliant XML API implemented in Java. This API was used to create the XML product data in the product data servlet.

Java XSL API. Xalan (http://www.apache.org/) was used for the standards-compliant XSL API implemented in Java. This API was used by the servlets to transform the product list XML into the product list VoiceXML using the XSL template. These components are available electronically; see "Resource Center," page 5.

Java CORBA API. VisiBroker 3.4 (http://xml.inprise.com/) was used to implement the product service as a CORBA server using Java.

Repositories and Databases. The product database was implemented using Oracle 8i. All other repositories in this prototype used the file system but could be scaled to a database for better performance and management in a full production deployment.

Web Server. Microsoft Internet Information Server (IIS) 4.0 was used for the web server in this prototype.

Business and Technical Analysis

Many early voice applications managed the complexity of speech recognition and synthesis software and hardware in-house. In contrast to this, the architecture I present here enables application providers to outsource this complexity to voice portals, removing entry barriers associated with both cost and complexity that have limited prior use of these technologies. This, in turn, enables application providers to focus on their added business value rather than the underlying infrastructure.

Speech recognition and synthesis required for voice applications need considerable computing resources, as well as audio input and output capabilities. When faced with the challenge of delivering voice applications to remote clients, whether wired or wireless, it is currently more cost effective to use the remote client as an audio input and output device and depend on a central server to provide the computing resources to run the speech recognition and synthesis required for the voice application. This is the "thin client" voice application architecture currently used by voice portals. In this case the telephone is the simple remote client device with the required audio input and output capabilities. Although not cost effective for mainstream applications today, in the longer term as adequate computing resources and audio input and output capabilities on client devices become more prevalent, more of the speech recognition and synthesis tasks may be delegated to the client device, lessening demand on the central server. In this approach, the client device becomes thicker, enabling the client to operate in a standalone mode, decreasing response time, and improving the system scalability.

Voice portals may be secured on two levels. Communications may be secured with authentication, encryption, and data integrity measures using existing telephony security technologies in conjunction with HTTPS between the voice portal and web server. On the other hand, the application itself may be secured with authentication and access control. In voice applications, at least three approaches outlined later may be used for authentication.

The simplest form of authentication is the standard username/password security currently used pervasively throughout the Internet. For obvious reasons, it is not desirable to have the user vocalize username or password in a voice application. The alternative is to have the user key in this information using the phone keypad. This quickly becomes tedious, error prone, and requires the attention of the user's eyes and at least one hand. It is, therefore, desirable to minimize or eliminate this mode of authentication.

Voice browsers and voice applications written in VoiceXML are able to identify the phone number of the calling device. In cases where a given phone may be mapped to a single user, this enables a more convenient form of authentication. In this case, no username needs to be supplied since the device ID (phone number) is used instead to identify users. All users need to enter in this form of authentication is the password or PIN that is generally a short numeric code that may be conveniently keyed in on the telephone keypad.

Voice recognition may also be used for authentication in the case of speech-driven applications. In this case, the valid user's voice prints are acquired and archived at the time of account setup. Later, at each time of authentication, voice prints are acquired for the user being authenticated. These voice prints are compared to voice prints associated with the valid user. In the case of a match, authentication succeeds. This form of bio-authentication inherently built into voice applications has potential to significantly improve the security of applications while being convenient for users in that it does not require the user to remember usernames or passwords and enables eyes- and hands-free operation.

Interface Design Challenges

Where visual interfaces are more parallel with multiple inputs being processed by the user at a given time, voice interfaces are more serial. This fundamental difference makes voice application design challenging and presents one of the most significant obstacles to the widespread acceptance of voice-driven Internet services. This challenge may be met through several strategies. Robust fault tolerance may be built into voice applications, enabling them to recover and behave gracefully, for example, in the event of absent or unrecognized input. Help may also be built into voice applications and may be accessed at any point in a dialog with the user simply saying "help." The user should also be given clear cues regarding what input is expected at any point. Valid choices at any point should be limited in number to avoid overwhelming the user with too many options presented linearly. Where possible, voice applications should borrow heavily from well-established browsing navigation paradigms including go back, go forward, stop, bookmarks, and so forth. These features are all supported by voice portals driven by VoiceXML and are available for use by application developers in voice applications.

When navigating familiar web pages, it is common for users to select an option on a page before the web browser completes loading that page. This is especially true with experienced users. By the same token with voice applications, experienced users may know what option they want to select before a given prompt completes and may become frustrated with having to listen to the same prompts every time they use the application. For this reason, voice portals with VoiceXML support "barge in" where users may give input at any point in a dialog and interrupt a prompt. In this way, voice applications may be developed to enable inexperienced users to listen to prompts and navigate the system with help, while at the same time enabling experienced users to navigate through the application as fast as they are comfortable.

The VoiceXML Standard

VoiceXML is a new standard with significant industry backing. It promises to create a level playing field on which voice portals may compete for outsourcing the hosting of voice applications. This will drive down cost and improve quality of service for both application providers and their customers. From the application providers standpoint, creating voice applications using VoiceXML has the advantage that content is portable across different voice portals, delivering flexibility with respect to choosing voice portals to host voice applications.

Conclusion

Voice portals driven by VoiceXML provide a powerful complementary new mode of access that empowers users with more options regarding when, where, and how they consume Internet services. Using speech as the most natural form of communication, the existing familiar global telephone network as the most pervasive communications network, and enabling eyes- and hands-free operation, this new mode of access promises to further accelerate the growth and maturity of Internet services into a ubiquitous set of tools we use every day.

DDJ

Listing One

<?xml version="1.0"?>
<vxml application="http://resources.tellme.com/lib/universals.vxml">
<form id="Introduction">
  <block>
    <audio>Welcome to the TRC Product Browser.</audio>
    <pause>200</pause>
    <audio>Using this service you may browse products including books, 
                                                 music and video.</audio>
    <pause>200</pause>
    ><goto next="#ChooseProductType"/>
  </block>
</form

<H4><A NAME="l2">Listing Two</H4>
<form id="ChooseProductType">
 <field name="document.generic.ProductGroup">
  <grammar>
   <![CDATA[
           [
           [ (dtmf-1) book books] {<option "BOOKS">}
           [ (dtmf-2) music cd cds] {<option "MUSIC">}
           [ (dtmf-3) video videos movie dvd] {<option "VIDEO">}
           ]
   ]]>
  </grammar>
  <prompt>
   <audio>Please select a product type. Choose books, 
                                            music or video now.</audio>
   <pause>2000</pause>
  </prompt>
  <filled>
   <result name="BOOKS">
     <audio>You selected books.</audio>
     <goto next="#ChooseBooksType"/>
   </result>
 ... 

Back to Article

Listing Three

<form id="ChooseBooksType">
 <field name="document.generic.ProductCategory">
  <grammar>
   <![CDATA[
           [
           [ (dtmf-1) architecture] {<option "BKS-ARCH">}
           [ (dtmf-2) art] {<option "BKS-ART">}
           ]
   ]]>
  </grammar>
  <prompt>
   <audio>Please select a book type. Choose architecture or art now.</audio>
   <pause>2000</pause>
  </prompt>
  <filled>
   <result name="BKS-ARCH">
     <audio>You selected architecture.</audio>
     <goto next="#DoSearch"/>
   </result>
 ... 

Back to Article

Listing Four

<form id="DoSearch">
  <block>
   <audio>I am now searching for products.</audio>
     <goto next="http://www.trcinc.com/ProductPresentationServlet.jrun" 
       method="post" submit="ProductGroup={document.generic.ProductGroup}&
       ProductCategory={document.generic.ProductCategory}&
       ClientType=VoicePortal"/>
  </block>
</form>

Back to Article

Listing Five

<?xml version="1.0"?>
<ProductList>
  <Product>
    <ID>7</ID>
    <Name>Invisible New York : The Hidden Infrastructure of the City 
                                      (Creating the North American</Name>
    <ShortDescription>Invisible New York : The Hidden Infrastructure of 
                   the City (Creating the North American</ShortDescription>
    <Cost>21.0</Cost>
    <Active>false</Active>
    <CreatedOn>2000-02-01 00:00:00.0</CreatedOn>
    <LastUpdateOn>2000-02-01 00:00:00.0</LastUpdateOn>
    <ImageURL>http://www.trcinc.com/mobile/images/
                                  books/invisiblenewyork.gif</ImageURL>
  </Product>
 ... 

Back to Article

Listing Six

XSLTProcessor processor = XSLTProcessorFactory.getProcessor();
processor.process( new XSLTInputSource( productDataInputStream ),
                  new XSLTInputSource( stylesheetInputStream ),
                  new XSLTResultTarget( transformationOutputStringWriter ) );

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.