Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Call Control XML & The Voice Conference Manager


April, 2005: Call Control XML and The Voice Conference Manager

Using CCXML to take charge of your phone system

Moshe is a speech technology consultant and can be contacted at [email protected] and http://www.Disaggregate.com/.


While most office telephones let you set up calls, there's an important limitation: If you start the call and then you hang up, the call ends and everyone else is cut off. In telephony jargon, these telephones have "first-party call control" because the "first party"—the person who originates the call—must participate or the call ends. Telephone operators don't have this limitation. When operators run conference calls, they place each party into the conference and then drop out of the call, but the call goes on. The operator is exercising "third-party call control," a capability that is crucial for advanced telephone services.

To build an application that uses third-party call control, you'll need two things: a telephony server to control the telephone network and an API to control the telephony server. Call Control XML (CCXML) is an API for third-party call control published by the World Wide Web Consortium's Voice Browser Working Group (http://www.w3.org/Voice/). CCXML is an outgrowth of the VoiceXML speech recognition API. Besides providing call control, CCXML can attach a VoiceXML server to the call to provide speech recognition, text-to-speech, and voice biometrics. CCXML can also send/receive data to any server that supports standard Internet protocols.

In this article, I present Voice Conference Manager (VCM), a CCXML/VoiceXML application that initiates telephone conferences. VCM can be found at http://vcm.sourceforge.net/. (The complete source code and related files are available electronically; see "Resource Center," page 5.) Figure 1 shows the VCM architecture. The telephony server's CCXML interpreter interfaces with the Public Switched Telephone Network (PSTN) and with SIP-based telephony. Speech recognition and text-to-speech are provided by the VoiceXML server. A database server contains the names and phone numbers of parties who can be included in conference calls; it also contains a list of parties authorized to initiate conference calls and any security-related information. A web server uses CGI to query the database server and transform the answers into appropriate formats: CCXML and VoiceXML scripts, as well as other CCXML-specific formats.

Rather than maintain my own CCXML and VoiceXML servers, I usually develop on publicly accessible servers. At the time of this writing, only Voxeo (http://community.voxeo.net/) offers public CCXML servers; a current list of other public telephony servers can be found through the VCM web site. In addition to the usual amenities, Voxeo allows both script-initiated and HTTP-initiated outbound telephone calls. Voxeo also supports inbound and outbound calls over SIP, which means I can test the system to my heart's content without tying up my office telephone lines.

Like many telephony APIs, CCXML uses events to drive a state machine. Telephony favors event-driven APIs for a simple reason: We are interacting with human beings who do odd things such as hang up in the middle of a call (not to mention a sometimes-fallible telephone network), and a state machine model lets the application respond to events as they occur. Figure 2 shows two of the state machines used by CCXML, those for incoming and outgoing calls. Both are intuitive: "Alerting" means a telephone is trying to get someone's attention, when the call is successfully picked up there's "connected," and so on. CCXML also has other state machines, such as the one that governs conference objects. The fine points of the different state machines are explained in the official W3C documentation.

But just responding to events of the state machine isn't enough to create an application because the application itself must have states—a hangup at the beginning of the application probably means a disgruntled customer, but a hangup at the end of the call is entirely normal. CCXML, therefore, incorporates a "user" state machine that defines the state of an application and that can be different for each script in the application. The interplay of all these different state machines requires rigorous attention to detail.

The Voice Conference Manager Application

Let's refer to the person who initiates the conference call as the "clerk." The clerk places a call to the application's telephone number and the call arrives at the telephony server. The telephony server consults its internal database to determine what to do with calls that arrive for that telephone number; it then creates an instance of the CCXML interpreter and passes it the URL of the application's initial CCXML script (Listing One).

Listing One(a) shows initial declarations of ECMAScript variables. The first few lines declare the names for the user-defined state machine; I recommend this design pattern, which prevents many hard-to-find typographical errors later in the script. Listing One(b)'s <eventprocessor> tag, of which there can be only one in any CCXML script, is a container for all tags that describe how the script responds to incoming events. The <eventprocessor> tag's statevariable attribute declares that the user variable callState contains the state information for the script's optional user-defined state machine; callState is initialized to the value s_init in Listing One(a).

The application runs by handling events, and <transition> tags define how events are handled. When an event arrives, the interpreter looks at each <transition> tag in script order and executes the first one that matches the event; if none match, the event is silently discarded—no need to fret over irrelevant events. A <transition> tag's attributes can restrict the events the tag will match. The state attribute restricts the <transition> tag to match only when the user-defined state is at some particular value. The event attribute selects which event or events the <transition> matches. The value of event can be a wildcard; event='error.*', for instance.

Listing One(b) shows several <transition> tags. The first telephony-related event to arrive is connection.alerting. This event means that the script's metaphorical telephone is ringing, not unexpected because this script was activated by an incoming call. The first <transition> tag matches this event, and the tag responds by metaphorically answering the phone with the <accept> tag. Another valid response is to <reject> the call.

This same <transition> takes care of several other tasks. To verify that the caller is authorized to use the application, and to determine what services the caller is allowed to use, a <send> tag queries the CGI server (Figure 1) with the caller ID of the incoming call. (For a discussion on security and telephone applications, see my article "Voice Biometrics and Application Security," DDJ, November 2002.) How do you know the caller ID? CCXML events are objects, and the connection.alerting event has a callerid attribute. How is the attribute accessed? The <transition> tag's name attribute is used to define an alias for the incoming event; in this <transition> the alias is "evt." Therefore, the caller ID is available as "evt.callerid." The caller ID is sent to the CGI server as GET data by using the <send> tag's namelist attribute. The answer to the CGI query will arrive as an event, which we'll discuss shortly.

The second <send> tag sends a user-defined event; the target attribute has a value of session.id, which is how the interpreter identifies this instance, so the event comes to this script. The user event doesn't have to be declared in advance—a user event is automatically defined by sending the event. Send checkCallerID_timeout and you will receive user.checkCallerID_timeout. The goal of this event is to make certain the CGI script responds within a reasonable time, so the <send> tag's delay attribute delays the event by 20 seconds (20,000 milliseconds). If the event arrives and the CGI server hasn't responded, the script can use this event to take appropriate action.

The <transition>'s final task is to change the user-state machine to its next state. To change the state, simply change the value; the <assign> tag sets callState to s_welcomeBase. The next event to arrive (which might already be waiting) will be processed according to this new state; in other words, events are processed according to the current state of the script, not the state when they arrive at the interpreter.

Once the script is in the s_welcomeBase state, there's a bit of a quandary: The script should accomplish several tasks before leaving the state, and while the best way to do this is to maintain some memory about the state, memory taints a purely event-driven design. One way to avoid state memory is to create several more states, and another is to use a separate script running in a separate instance of the interpreter. But the expected events won't arrive in any particular order, which argues that the application should stay in one state and use a few flags, and that's how the script is designed. The state s_welcomeBase accomplishes these tasks:

  • When the call is established (that is, when the event connection.connected is received), it starts a VoiceXML script using the <dialogstart> tag. The VoiceXML script greets the clerk and asks the clerk to wait a moment; see Listing One(c).
  • Wait for the event dialog.exit, which is sent when the VoiceXML script is over. It examines flags. If all other tasks are finished, go to the next state; see Listing One(d).
  • The response from the CGI query sent in the previous state can arrive at any time. The correct format of the response from the CGI script is in Listing Two. The first line, callerid, generates the event user.callerid. The next two lines are data that accompany the event. Check the event's valid attribute to see if the caller ID is recognized. Examine flags to see if it's time to transition to the next state. See Listing One(e).
  • If the attempt to send the CGI script fails (that is, if you receive the error.send event), you <exit/> the script. If you receive a timeout, you <exit/> the script; see Listing One(f).

Unless there's been an error, just before leaving s_welcomeBase, you send the event nextstate. The reason for sending nextstate is because otherwise, there are no events expected in the next state, and without an event, none of the <transition> tags will execute. Of course, there's an alternative design: You could move a few tasks from the next state into s_welcomeBase, then change states and wait for the events. However, I prefer this design pattern. If the tasks were in s_welcomeBase, the code would need several identical copies of the tasks, one for each <transition> tag that can change the state. The nextstate design pattern requires only a single copy of the tasks. It's cleaner and less error prone.

In Listing One(g), the <dialogstart> tag launches a VoiceXML script. The <dialogstart> tag's namelist attribute passes ECMAScript variables to the VoiceXML script as GET data. Unfortunately, the current VoiceXML interpreter does not do anything with this GET data. If dynamic data are truly necessary—as it would be, for example, if you try to restrict users' access to various functions and want the menu choices to reflect those restrictions—then the VoiceXML script must be dynamically generated on a server that can accept GET or POST data. The connectionid attribute is mandatory here; it tells the VoiceXML script which telephone call to attach to. A call is implicit when the <transition> is in response to a telephony-related event, but not when it's in response to a user-defined event.

The VoiceXML script (Listing Three) solicits the clerk's choice of what to do during the call; the service allows the choices of "start conference call," "add new users," and "edit list of users." The <exit> tag's namelist attribute sends an ECMAScript variable with the result from the VoiceXML script to the CCXML interpreter. Happily, the CCXML interpreter can accept data sent this way and the data are accessible to the script as attributes of the event object.

The clerk is passed to a different VoiceXML script and is prompted to speak either the names or telephone numbers of parties to invite to the conference call. Once all names are gathered, the service is ready to start the conference call; the service politely hangs up on the clerk—third-party call control, remember? To create the conference call, the service dials out to each party, and for each of these "call legs" the service must:

  • Initiate an outbound call to the party.
  • Connect the call if the party answers; otherwise redial or report that the party is not connected.
  • Play an announcement to each party about the conference call.
  • Place each party into the conference.

This is a series of complex actions for each call leg—how should the application handle this problem? One way would be to track the progress of each call leg in a series of arrays, modifying the state of each leg as each receives its events. While feasible, and somewhat tempting at first, this is a dangerous move—if you make that choice, you quickly end up duplicating the state-tracking functions of the CCXML interpreter itself. When designing CCXML applications, it's always best to let the interpreter handle state machines on your behalf—that's what the interpreter is there for.

The application launches multiple instances of the CCXML interpreter. In Listing One(h), the <fetch> tag preloads a CCXML "conference manager" script; <ccxmlcreate> starts the script as an instance, and the conference manager script manages the actual conference while the original script exits. Listing Four(a) shows the conference manager script creating a conference object; Listing Four(b) shows the conference manager using <ccxmlstart> to spawn multiple simultaneous instances of a "call leg manager" script to create and track call legs. By using a separate instance for each call leg, the problem of managing multiple call legs reduces to that of writing a simple script for a single call leg.

Listing Five(a) shows a fragment of the call leg manager script. The script is called by using the URL of the script, but the present CCXML interpreter cannot pass data included as GET or POST into the script. Instead, the conference manager script sends the call leg script an event ("user.info") with the phone number of the party to call and the conference object to join, neatly avoiding the necessity of using CGI to dynamically generate the script. The <createcall> tag calls the party, and if the party answers, the party receives an announcement and is placed into the conference call via the <join> tag; Listing Five(b).

How does the call end? In this version of Voice Conference Manager, we simply wait for the parties to hang up or for a fixed timer to expire. As each party hangs up, its call leg manager sends an event to the conference manager. When the count of connected parties drops to one or the fixed timer expires, the conference manager sends a "dropout" event to the remaining call leg manager(s). Each call leg manager removes their party from the conference, as in Listing Five(c), plays an announcement, sends a return event back to the conference manager, and exits. The conference manager then destroys the conference object and exits. Since all application-related instances of CCXML and VoiceXML have now ended, the application is over.

There are a few final design notes. While developing scripts, give each one its own "zombie timer" by sending a user-defined event with a delay; if it fires, the script exits, which catches runaway processes. As for service improvements, if the session ID of a conference in progress were sent to the database server, the clerk could call back and either join the call or add new conferees. The voice-user interface must be improved and the program must expand to include other usage scenarios. But most of all, there's no particular reason why the CCXML script has to be started by the clerk's telephone call; it's just as easy, if not easier, to control the conference call via a web page.

Why CCXML Is Important

VoiceXML and CCXML are the kind of technologies that start revolutions: They both break apart (disaggregate) the pre-existing technological infrastructure. VoiceXML and CCXML break the connection between the telephony server and the location of the application—applications are no longer locked away on inaccessible servers. Because they are standards based, the applications are no longer tied to a particular vendor's API or hardware—applications become portable across servers. Because the standards are based on XML, the APIs maintain synergy with the familiar web tools (CGI, PHP, and so on), and developers can dynamically integrate Internet-based information resources into a telephony application. The creative and commercial potential is enormous.

In just a few short years, VoiceXML has achieved very respectable market acceptance. It's available on commercial platforms for in-house deployment; as a hosted solution by service providers; it's even been integrated into routers for IP telephony. In other words, VoiceXML has proven a very successful model for speech technology services—and CCXML, as a companion specification, is likely to achieve similar success in short order. If you're considering telephony applications, consider CCXML.

DDJ



Listing One (a)

<?xml version="1.0" encoding="UTF-8"?>
<ccxml version="1.0">
<!-- list of symbolic substate names: -->
<var name="s_init"                  expr="'s_init'"/>
<var name="s_welcomeBase"           expr="'s_welcomeBase'"/>
<var name="s_findUserRequest"       expr="'s_findUserRequest'"/>
<!-- user event names -->
<var name="info"                    expr="'info'"/>
<var name="nextstate"               expr="'nextstate'"/>
<!-- Vars used throughout -->
<var name="callState" expr="s_init"/>
<!-- URLs of VoiceXML scripts -->
<var name="vxml_prefix" expr="''"/>
<var name="vxml_type" expr="'application/xml+vxml'"/>
<var name="vxml_greeting" expr=" vxml_prefix + 'greeting.vxml'"/>
<!-- URLs of CGI scripts -->
<var name="url_prefix" expr="'http://www.example.com/cgi-bin/vcm/'"/>
(b)
<eventprocessor statevariable="callState">
<transition state="s_init" event="connection.alerting" name="evt">
    <accept/>
    <send event="url_event" target="url_CheckCallerID" 
                     name="'checkCallerID'" namelist="evt.callerid"/>
    <send event="'checkCallerID_timeout'" target="session.id" 
                       delay="20000" name="'checkCallerID_timeout'"/>
    <assign name="callState" expr="s_welcomeBase"/>
</transition>
<!-- If call disconnects -->
<transition state="s_init" event="connection.disconnected">
    <log expr="'Base call disconnected, exiting, state=' + callState"/>
    <exit/>
</transition>   
(c)
<transition state="s_welcomeBase" event="connection.connected" name="evt">
    <log expr="'Base call connected'"/>
    <dialogstart src="vxml_greeting" type="vxml_type"/>
</transition>
(d)
<transition state="s_welcomeBase" event="dialog.exit" name="evt">
    <if cond="s_welcomeBase_gotdata == 1">
        <assign name="callState" expr="s_getConfereeList"/>
        <send event="nextstate" target="session.id" name="'nextstate_2'"/>
    </if>
    <assign name="s_welcomeBase_finishedgreeting" expr="1"/>
</transition>
(e)
<transition state="s_welcomeBase" event="user.calleriddata" name="evt">
    <if cond="evt.valid != 'True'">
        <exit/>
    </if>
    <var name="grammar_menu" expr="evt.grammar_menu"/>
    <if cond="s_welcomeBase_finishedgreeting == 1">
        <assign name="callState" expr="s_getConfereeList"/>
        <send event="nextstate" target="session.id" name="'nextstate_1'"/>
    </if>
    <assign name="s_welcomeBase_gotdata" expr="1"/>
</transition>
(f)
<transition state="s_welcomeBase" event="error.send.*" name="evt">
    <log expr="'ERROR: ' + evt.error + ' eventid: ' + evt.eventid"/>
    <exit/>
</transition>
<transition state="s_welcomeBase"event="user.checkCallerID_timeout"name="evt">
    <exit/>
</transition>
(g)
<transition state="s_findUserRequest" event="user.nextstate" name="evt">
   <dialogstart src="vxml_menu" type="vxml_type" 
         connectionid="base_connectionid" namelist=
            "grammar_menu s_findUserRequest_count"/>
</transition>
(h)
<transition state="s_makeConfObject" event="user.nextstate" name="evt">
    <fetch next="ccxml_conf_main" fetchid="foo"/> 
</transition>
<transition state="s_makeConfObject" event="fetch.done" name="evt">
    <createccxml fetchid="evt.fetchid" sessionid="conf_session" />
    <send event="info" target="conf_session" 
                              namelist="phoneList phoneListCount"/>
    <dialogstart src="vxml_confstarted" type="vxml_type" 
                                  connectionid="base_connectionid"/>
</transition>
Back to article


Listing Two
calleriddata
valid=True
grammar_menu=grammar.xml
Back to article


Listing Three
<?xml version="1.0"?>
<vxml version="2.0">
<menu>
    <prompt>
        Please choose from the following options: <enumerate/>
    </prompt>
    <choice event="choice" message="conference">
        Start conference call
    </choice>
    <choice event="choice" message="add">
        Add new users
    </choice>
    <choice event="choice" message="edit">
        Edit list of users
    </choice>
    <noinput>
        Please choose from the following options: <enumerate/>
    </noinput>
    <nomatch>
        Please choose from the following options: <enumerate/>
    </nomatch>
</menu>
<!-- find choice by catching event -->
<catch event="choice">
    <var name="menu_choice" expr="_message"/>
    <exit namelist="menu_choice"/>
</catch>
</vxml>
Back to article


Listing Four (a)
<transition state="cm_makeConfObject" event="user.nextstate">
    <createconference conferenceid="confID"/>
</transition>
(b)
<transition state="cm_addConferees" event="user.addconferee">
    <!-- Check to see if more to add -->
    <if cond="cm_addConfeerecm_listPtr < phoneListCount" >
        <!-- Yes. get phone number (work around lack of ECMAScript array) -->
        <assign name="currentPhone" 
                    expr="phoneList.substr(11*cm_addConfeerecm_listPtr,10)"/>
        <fetch next="ccxml_conf_legs" fetchid="foo"/> 
                                      <!-- get ready to create call leg -->
    <else/>
        <!-- No, all added, next state -->
        <assign name="cm_callState" expr="cm_confInProgress"/>
        <send event="nextstate" target="session.id"/>
    </if>
</transition>
<!-- after fetch, start instance -->
<transition state="cm_addConferees" event="fetch.done">
    <createccxml fetchid="evt.fetchid" sessionid="singleCallLegSessionID" />
    <!-- send phone number, conference object to instance just started -->
    <send event="info" target="singleCallLegSessionID" 
                                         namelist="currentPhone confID home"/>
    <!-- check if more calls to add. -->
    <send event="addconferee" target="session.id"/>
</transition>
Back to article


Listing Five (a)
<transition state="cl_init" event="user.info" name="evt">
    <assign name="destPhone" expr="evt.currentPhone"/>
    <assign name="conf" expr="evt.confID"/>
    <assign name="home" expr="evt.home"/>
    <createcall dest="destPhone"/> 
</transition>
(b)
<!-- play welcome message -->
<transition state="cl_init" event="connection.connected" name="evt">
    <!-- play an annoucement as each is connected -->
    <dialogstart src="vxml_youarejoining" type="vxml_type"/>
    <assign name="cl_callState" expr="cl_callInProgress"/>
    <assign name="thisCall" expr="evt.callid"/><!--should be "connectionid"-->
</transition>
<!-- join into the conference -->
<transition state="cl_callInProgress" event="dialog.exit">
    <join id1="thisCall" id2="conf"/>
</transition>
(c)
<transition state="cl_dropout_start" event="user.dropout">
    <unjoin id1="thisCall" id2="conf"/>
</transition>
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.