Channels ▼

Web Development

VoiceXML and the Voice/Web Environment

Source Code Accompanies This Article. Download It Now.

Oct01: Programmer's Toolchest

Lee Anne is the author of Using XML, Special Edition (Que, 2001) and Practical HTML (Que, 1999). She works as a freelance author and consultant and can be contacted at

From the time telephones were invented in 1876, people have wanted to use them to gather timely, up-to-date information and request real-time services. But even then, live operators were expensive and telephone companies soon realized that many of these requests could be automated. Consequently, it didn't take long for people to start fiddling with systems to let users do things such as set their watch by listening to a system of coded clicks and buzzes.

Nevertheless, data services were viewed as adjuncts to basic telephone service and applications had to be produced on specialized hardware designed to interface with the telephone network. Since then, the only thing that has really changed is the sophistication of applications, which now range from interactive voice response to automatic call distribution and outbound dialing systems. All of these require expensive hardware, programming talent familiar with the Byzantine complexity of telco protocols, and regulatory wizards skilled in dealing with the local telephone authorities and tariffs in the service area. In the case of nationwide or worldwide service areas, some of these tasks are nearly monumental.

Merging Voice and Data Networks

The Internet, however, is making inroads into the public switched-telephone network (PSTN), as the same shared network medium is being used for everything from standard telephone calls to stock reports and online banking. With Voice-Over-IP protocols (H.323, SIP, MGCP, H.248/MEGACO, and the like), you can now assign IP addresses to phones, bypassing the PSTN completely. In short, the distinction between telephones and data terminals is disappearing. Consequently, the next frontier for developers involves merged applications for the telephone. And because most wireless devices know approximately where you are when you use them, the information presented can be customized to an unprecedented degree.

The XML suite of protocols from the World Wide Web Consortium (W3C) helps to simplify this customization by letting you access a coherent set of tools that address the entire range of web applications, from specialized electronic data interchange formats to data definition languages to device-class-specific information rendering vocabularies. The power of XML lies not so much in individual languages but in the entire package, since all parts of the XML-related standards are designed to work well with each other and complement the strengths of all the others. So it's no longer necessary to choose the best compromise between competing tools — you can pick the best tool for every given task, allowing XML-aware processors to determine the capabilities of each target device and select the control flow and rendering engine that uses the particular XML vocabulary designed to support that target.

XML is a tree-structured language construction system that allows vocabularies of arbitrary complexity to be easily created. Elements of the language are primarily simple pairs of <tags>...</tags>, similar to the common HTML tags, that carry both local attributes affecting the scope of the tag and a universal content model that describes the sorts of tags or other content allowed to nest within the tag. Each tag carries its own content model, so content that is not permitted within the immediate scope of a particular tag may be allowable within another tag that is permitted.

Specialized vocabularies have already been defined for many needs. One in particular, VoiceXML, is well suited for telephone application development. VoiceXML has mechanisms available for speech recognition and text-to-speech (TTS) rendering, as well as TouchTone (DTMF) recognition capabilities and the ability to play prerecorded audio files. Human speech is much more pleasant to the ear than current TTS, so most applications use as much human voice as possible.

VoiceXML is currently under development as the basis of the W3C's Dialog Markup Language (DML), but VoiceXML is rapidly being deployed by vendors even before the standards process is final, and with full knowledge that many areas of concern have to be addressed before DML is completely suited to the rapidly evolving world of web-aware telephony. The commercial pressures to deliver working products now, as opposed to later, means that no one is willing to wait and see.

Rapid Voice Application Deployment Platforms

To speed development, the emerging class of "voice service providers" (VSPs) supply platforms and perform needed telephone call handling, letting you concentrate on creating applications. But call handling is only part of the story. You also have to create aural pages that perform the interaction between users and applications, just as with any web application, except aural interactions are inherently linear, without the ability to present and handle the large two-dimensional matrix of choices on visual web pages. Instead of dozens (or even hundreds) of potential paths accessible from a visual page, users on aural pages can only keep track of a few links — a maximum of five is the number most often recommended, and three to four is better. This means that user interaction may have to be redesigned. The most natural metaphor for aural human/machine conversation involves flowcharts.

Flowcharts are a metaphor used by VoiceXML-based design tools similar to those I discuss in this article. Flowcharts replace the text-based coding methods common to the first attempts at creating a VoiceXML programming environment. Among the graphical telephony design tools that support flowcharting are: Voxeo's Visual Designer (, Covigo's Studio (, Nuance V-Builder (, IBM WebSphere Studio (, VoiceGenie Developer Workshop (, VBVoice 4.31 (, VBVoice 4.0 (, and iConverse Mobile Studio 2.0 (

To illustrate how you use tools such as these to build VoiceXML-based applications, I'll focus on two of these toolkits that I've used — Covigo Studio and Voxeo Visual Designer 2.0.

Covigo Studio

Covigo Studio is a Java-based tool that supports telephony language and platform choices as plug-in components. Consequently, the environment supports round-trip VoiceXML, CallXML, cHTML, W-HTML, WML, and WAP development for any platform.

In fact, one application can support multiple access devices with Covigo-supplied platform logic controlling device detection, conditional execution, output media type, and other needed housekeeping tasks. To make it as flexible as possible, you can drag-and-drop symbols representing arbitrary "glue" — adapters to external data sources such as HTML, XML, EJB, JSP, and voice objects, or CRM tools to connect voice applications to existing databases, web applications, and enterprise systems. The resulting code can be modified with XSL or JSP to create dynamic pages for almost any purpose imaginable.

While Covigo doesn't supply voice connectivity directly, the tool can automatically generate appropriate code for most common VSPs such as Nuance, Voxeo, BeVocal, or Tellme. In addition, Covigo applications can run on other platforms, supporting a wide variety of portable devices.

As an example of how you can use Covigo, I'll present a currency conversion application that can be used by telephone, WAP phone, network-capable PDA, or even a computer terminal. The program lets users speak-in or type-in a currency conversion and have the application respond with the equivalent in any of seven currencies either by voice or data display. The complete source code and related files are available electronically; see "Resource Center," page 5.

Covigo Studio is organized into three separate workspaces allowing application flow, presentation, and integration to be performed simultaneously by different developers with unique skill sets. Figure 1 shows the application flow workspace, where each spherical symbol node in this directed graph represents an application state while the arcs represent possible transitions between states. States with little square inserts at the upper left represent a dialog with the user or other interaction. Some of these states may allow decisions that cause the application to conditionally branch to other paths. The green box on the display is a cursor that highlights a selected state and the box in the lower right shows the actual voice file supplied for the selected presentation dialog. Each resource used in the application can be readily assigned by means of these breakout dialog boxes. Application logic is implemented in easy-to-understand jump tables, as in Figures 2 and 3. The if is defined on the screen in Figure 2 and the corresponding jump table, based on the value voiceToCurrency in Figure 3.

Note the almost complete dissociation from any specific application language. The task is modeled as logical nodes and transitions in an abstract state machine, with the actual translation to specific languages or XML vocabularies performed as needed for any supported platform and device type.

Presentation elements on the left side of Figure 4 create all the user interface functionality needed for today's devices. If a new platform, type of device, or language is necessary, it's a simple matter to incorporate support for it. The executable code produced by Covigo Studio for this instantiation of the application is available on an associated web site.

Voxeo Visual Designer 2.0

Designed for any development project targeting the Voxeo or Nuance networks, Visual Designer from Voxeo is available at no cost. When combined with the Voxeo development tutorials and "jump start" applications, you can easily get started in voice web development such as the time and weather application under development in Figure 4.

The development desktop is divided into three main portions: A tool selection menu at the far left, the visual workspace in the middle, and a property inspector at the far right. The workspace shows one form, Time, highlighted by a blue box. The property inspector shows the possible VoiceXML properties and values that could be associated with that individual item. Only the ID value is filled in.

Since every nested element also has possible properties, the property manager automatically changes to offer the appropriate fields when an element is selected. It's really quite straightforward.

In use, callers are prompted to see whether they wanted to hear the time or weather. Based on their choice, the application plays an appropriate response. The entries for "noinput" and "nomatch" in the left TalkingClock block handles the possibility that the user is tongue-tied or incoherent.

While this scheme could easily be either elaborated or simplified, it demonstrates the minimal useful features of VoiceXML and the simplicity of the Voxeo design tool.

Toward the middle of the left block, a field that will receive the user's input is defined, a grammar is identified to limit the possible responses, and the field tested and control passed to whichever of the two choices has been made. Green arrows show the alternative paths to the two possible destinations. From there, control passes to the Goodbye block where a farewell message is spoken and the application exits.

The illustrated application doesn't actually do anything but merely presents aural stubs to demonstrate the potential call flow. If the application were elaborated, it would access the system time or a weather database or recording to generate the actual response, possibly checking the incoming Caller ID, ANI, or DNI to enable automatic localization of either time or weather. But the visual call flow would become more complex and difficult to present quickly.

In contrast to the Covigo generalized approach, Voxeo's Visual Designer is tied more closely to the actual target language. This tool knows the structure of VoiceXML and will gray out any element that cannot be used within a particular structure. This helps when creating the program because it's impossible to make a mistake that would create an invalid document. Voxeo also supports CallXML, an XML vocabulary with features added to permit better call handling.

Visual design tools have to translate their displays into standard code to be able to generate applications that use ordinary HTTP protocols. The visual display in Figure 5 translates directly into standard VoiceXML code in Listing One.

The <"voxeo-designer"> tags are placeholders to retain layout information for the visual display and don't affect the functionality of the code.

Since VoiceXML has the ability to throw and catch external messages, it's a fairly simple matter to interface to an actual database or external application. In some cases, application functionality can be enhanced by using PHP, Microsoft ASP, Java, ColdFusion, or scripting tools to dynamically generate VoiceXML on-the-fly rather than creating a static page. As VoiceXML becomes more integrated into other W3C standards, it will also be possible to use XSL/XSLT to access and modify a VXML document directly.


The two visual design tools examined here aren't the only ones available, and every such tool has both strengths and weaknesses. You may want to look at several tools from different vendors to determine which might be the most appropriate for your projects.


Listing One

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE vxml PUBLIC "-//Nuance/DTD VoiceXML 1.0b//EN"
<vxml lang="en" version="1.0">
   <form id="TalkingClock">
      <audio src="noinput.wav"/>
      <audio src="nomatch.wav"/>
    <field name="Destination">
      <prompt count="">
        <audio src="welcome.wav"/>
      <?voxeo-designer collapse="true"?>
      <grammar src="VXML.gsl#Timegrammar"></grammar>
    <filled namelist="Destination">
      <if cond="Destination == &apos;time&apos;">
        <goto next="Time"/>
      <elseif cond="Destination == &apos;weather&apos;"/>
        <goto next="#Weather"/>
   <?voxeo-designer x="296" y="252" ?>
  <form id="Time">
    <block name="SayTime">
      <audio src="notime.wav"/>
      <goto next="Goodbye"/>
   <?voxeo-designer x="287" y="433" ?>
  <form id="Weather">
    <block name="SayWeather">
      <audio src="noweather.wav"/>
      <goto next="Goodbye"/>
   <?voxeo-designer x="522" y="397" ?>
  <form id="Goodbye">
      <block name="SayGoodbye">
        <audio src="goodbye.wav"/>

Back to Article

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
Dr. Dobb's TV