Moshe is a speech-technology consultant and can be contacted at [email protected] and http://www.Disaggregate.com/.
Security experts recommend a three-tier approach to proving identity"something you have, something you know, something you are." At O'Hare Airport in Chicago, for instance, employees use badges ("something you have") and access codes ("something you know") to open security doors. But as I wait for flights, I've found it fairly easy to see the codes employees enter on the keypadsand employee badges can be stolen, found, and even forged.
Some airports have added the third tier of security: biometrics, which measure "something you are." For example, employees might place their hands on hand-geometry scanners. If the shape of the hand matches previous measurements, the system grants access. Other biometrics for security include retinal and iris scans, facial recognition, and fingerprint readers. But these biometric technologies have a drawback: They require specialized, expensive, and easily vandalized equipment.
Voice biometrics, however, are an excellent option for application security. Voice biometrics, which measure the user's voice, require only a microphonea robust piece of equipment as close as the nearest telephone. In this article, I prototype an application that uses a telephone call to verify identity using freely available voice biometric resources that have simple APIs. Furthermore, the prototype can be easily integrated with Internet-capable applications.
Identify, Verify, Classify
Voice biometrics provide three different services: identification, verification, and classification. Speaker verification authenticates a claim of identity, similar to matching a person's face to the photo on their badge. Speaker identification selects the identity of a speaker out of a group of possible candidates, similar to finding a person's face in a group photograph. Speaker classification determines age, gender, and other characteristics. Here, I'll focus on speaker verification resources ("verifiers").
Older verifiers used simple voiceprints, which are essentially verbal passwords. During verification, the resource matches a user's current utterance against a stored voiceprint.
Modern verifiers create a model of a user's voice and can match against any phrase the user utters. This is a terrific advantage. First, ordinary dialogue can be used for verification, so an explicit verification dialogue may be unnecessary. Second, applications can challenge users to speak random phrases, which make attacks with stolen speech extremely difficult.
The prototype I present uses a telephony server to connect to the telephone network, a speech-technology server, and an application server to execute my code and control the other two servers; see Figure 1.
For the telephony server, speech-technology resource server, and application server, I use BeVocal's free developer hosting (http://cafe.bevocal.com/). BeVocal hosts VoiceXML-based applications. VoiceXML is an open specification from the W3C's "voice browser" working group (http://www.w3.org/Voice/). XML-based VoiceXML lets you write scripts with dialogues that use spoken or DTMF input, and text-to-speech or prerecorded audio for output. My scripts reside on the Internet and are fetched by the VoiceXML server via HTTP. Since the VoiceXML specification does not define a voice biometrics API, I used BeVocal's extensions to VoiceXML.
Another company that offers voice biometrics hosting is Voxeo (http://techpreview.voxeo.com/); Voxeo uses a different API. Voxeo lets you send tokens through HTTP to initiate calls from the VoiceXML server to users, which is convenient for web-based applicationsnot to mention more secure, as the application can easily restrict the calls to predefined telephone numbers. Both BeVocal and Voxeo offer free technical supportand they need to because documentation is often sparse or incorrect. Loggers track script execution and report errors, but you'll need your sleuthing skills to uncover the actual errors.
Before users can use the verifier, the verifier must obtain a model of the user's voiceusers must enroll. During enrollment, users speak several phrases, usually similar to those used during verification. Listing One highlights the enrollment application (the complete source code and related files are available electronically; see "Resource Center," page 5).
Users' voice models are stored in a database at the VoiceXML server. Each developer has a separate database, and the developer assigns keys to each user. Generally, users speak or enter ID numbers, which act as the keys.
If users do not speak, the <noinput> tag is activated; if the user's utterance does not match the grammar (is not a four-digit number), the <nomatch> tag is activated. In either case, a counter decrements; when the counter drops to zero, I emulate transfer to a human agent. This counter defends the application against malicious users who tie up the server, and helps users who are having trouble.
When user utterances match the grammar, the <filled> tag is activated, and <if> compares the utterance with the challenge. This ensures that the verifier hasn't inadvertently collected noise and mistaken it for a valid utteranceand that someone is not trying to spoof the system with prerecorded utterances. If the utterance matches the challenge, the application goes to the next step via the <goto> tag; if they do not match, the recognition result is reset via the <clear> tag, which causes the <register> to execute again. In the remainder of the enrollment application, users repeat a different four-digit number and current date.
To verify a user's identity, a user first claims an identity; in our case, by providing an ID number. Listing Two is the application after a user has made a claim.
The BeVocal API does not let you check whether the database of voice models actually contains the needed model. Instead, if the database key is incorrect, the server interrupts itself in midprompt when the mistake is discoveredwhich annoys me and users. Fortunately, a little judicious hacking solves the problem. Listing Two(a) starts with the <verify> tag, which activates the verifier and speech-recognition resource, and defines both a field to receive the results and the type of input expected. The identity claim is passed to the verifier via the keyExpr attribute. The <property> tag sets the total time to perform recognition to 1 ms, and the prompt is only 1-ms long. The first <catch> is processed if the key is in the database. The second <catch> is activated if the key is invalid; the user is sent back to the form that collects the ID number. As in enrollment, too many errors will send a user to an operator.
With a valid key, the system moves on to Listing Two(b). Variables are initialized with a random challenge phrase, and an announcement plays to users. The <verify> tag starts a verifier and speech-recognition resource, and users are asked to speak the four-digit challenge number. If users are silent (<noinput>) or say something other than a number (<nomatch>), <reprompt> reprompts them.
If users speak a number, the <filled> tag is activated, and <if> checks the number of attempts. If the user is still under the limit, the first <elseif> compares the utterance to the challenge number. If they are not the same, <clear> resets the results and users try again. Otherwise, <elseif> tags examine the decision of the verifier, which returns one of three confidence levels. If users are accepted, the transaction is approved; if users are decisively rejected, they are sent to operators for further assistance. If neither is trueif the result is "unsure"users are sent to further <verify> tags (not shown) with a second or third round of challenge phrases. Users who cannot be verified are sent to operators.
Threats, Hints, Pitfalls
Having worked in speech technology for many years, I regard the current level of achievement as almost magical. Speech recognizers have vocabularies of thousands of words, text-to-speech is very intelligible, and verification highly reliable.
But I did say "almost." Like all biometric technologies, verifiers by their very nature are prone to error. I occasionally make mistakes guessing at who's speaking during a conference call; speech technology may well be better at guesses than I am, but I don't expect it to be perfect.
Verifiers have two main errors: false negative, denying valid users; and false positive, accepting impostors. Verifiers have a decision threshold that can be adjusted. The verifier can reject almost all impostors but at the cost of rejecting many valid users, or can accept all valid users at the cost of accepting more impostors. The false accept/false reject curves (Figure 2) intersect at the "equal error rate," and a goal of speech technologists is to make the equal error rate as low as possible. Tests with actual users will reveal the error rates for your application.
Unlike passwords, which are either valid or not, verifiers produce results with varying levels of confidenceand sometimes verifiers are wrong. In this prototype, I authorize the transaction if users pass just one check. Depending on what's at stake in an actual application, I might demand more checks or longer challenge phrases, which are more secure because they use more acoustic data.
Verifiers are not immune to identify theft caused by careless procedures or poorly designed applications. For instance, Alice receives a letter in the mail from her credit-card company. She's given a phone number to call, and her ID number is her home phone number. Bob knows that Alice's company is doing this and knows her phone number, so Bob calls up and enrolls, pretending to be Alice. Bob can now make purchases and, pretending to be Alice, verify them.
In another scenario, Bob may be able to steal Alice's identity even after she enrolls. In Listing One, the <register> tag has its mode attribute set to "adapt," which means that each time Alice calls, the voice model further adapts to her voice. (Some companies give users an opportunity to improve their voice models.) But what happens if it's Bob who calls to "improve" Alice's model? Unless the verifier notices the radical shift in voicesand some do detect this kind of attackBob can retrain Alice's voice models. Setting the mode attribute to "delete" in the application would simply cause the old models to be discarded immediately and train solely on Bob's voice. If the mode is set to "skip," Bob can't modify Alice's voice models, but neither can Alice.
The lesson from these scenarios is straightforward: A voice model is a credential, and credentials require a chain of authenticity.
If Bob prerecords Alice's utterance, could Bob impersonate Alice? In theory, he could, but most (not all) voice biometric resources detect these attacks. To deter this type of attack, avoid using static challenge phrases such as phone numbers, or even phrases that change only gradually, such as the current date.
The application itself may be threatened, rather than individual users. Speech-technology resources are expensive, and most servers provide only a handful. I use counters or timers to prevent a single user from tying up a server indefinitely.
The human factors of speech applications are a separate subject. Make certain prompts clearly state what users must utter. Test your application using real users from your target population. When the verifier or recognizer cannot decipher the utterance, I avoid arguing with or scolding users ("your response was not understood"). Instead, the application is brisk and goal oriented, and simply asks users to repeat the utterance. I like to use male voices. While telephone applications traditionally use female voices, male voiceswith their lower frequenciesare actually more intelligible over the telephone network.
VoiceXML and Your Application
If your application is voice-only and over the phone, adding speaker verification is straightforward. But any Internet-capable application can add VoiceXML.
Figure 3 shows one method. After users fill out a web page and start the transaction, the application generates both a web page and VoiceXML script. The web page, sent to users, directs users to call the telephony server. The VoiceXML script contains a script for the verifier with user-specific data, and the result of verification is sent to the application.
VoiceXML lacks any defined method to send/receive pure data over the Internet. As a simple workaround, Listing Three shows how the VoiceXML server can notify the application of the verifier's result. When the VoiceXML <audio> tag calls a CGI script to fetch an announcement, it also sends the user's ID number. The CGI script relays this ID to the application. This method is obviously vulnerable to attack; production systems would use better security (tokens, for instance).
Biometrics in general, and speech technologies in particular, are imperfect and have a unique capacity for abuse: Voices, faces, and other characteristics can be scanned without knowledge or consent. Still, knowing "something you are" is a powerful security tool when coupled with "something you have" and "something you know."
<form id="samples"> <block> <prompt> We need several samples of your voice. We'll ask you to repeat some numbers. Just repeat what we say, and speak naturally. </prompt> </block> <var name="thisSample" expr="fourDigitRandom()"/> <var name="thisSampleString" expr="fourDigitString(thisSample)"/> <register name="fourDigits_1" keyExpr="key" type="digits?length=4" mode="adapt"> <prompt> Repeat after me: <break size="small" /> <say-as type="number:digits"> <value expr="thisSampleString"/> </say-as> </prompt> <noinput> <assign name="totalAttempts" expr="totalAttempts - 1" /> <if cond="totalAttempts <= 0"> <goto next="#bounce"/> </if> <reprompt/> </noinput> <nomatch> <assign name="totalAttempts" expr="totalAttempts - 1" /> <if cond="totalAttempts <= 0"> <goto next="#bounce"/> </if> <reprompt/> </nomatch> <filled> <var name="ok" expr="fourDigits_1 == thisSampleString" /> <!-- check to see that we match --> <if cond="!ok"> <clear namelist="fourDigits_1"/> <prompt> We didn't get that, please try again. <break/> </prompt> <else/> <!-- reset for use by next "register" --> <assign name="thisSample" expr="fourDigitRandom()"/> <assign name="thisSampleString" expr="fourDigitString(thisSample)"/> <assign name="totalAttempts" expr="3" /> </if> </filled>
<form id="verification"> <!-- Verify that the user is in the database --> <verify name="checkKey" keyExpr="key" type="digits?length=20" > <!-- make this a quick timeout --> <property name="timeout" value="1ms" /> <!-- inaudible prompt --> <prompt> <break time="1ms" /> </prompt> <!-- if there is a key we should timeout immediately and go here --> <catch event="noinput nomatch filled"> <!-- assign value to field so Field Interpreter Algorithm doesn't bring us here again --> <assign name="checkKey" expr="'123'" /> <goto nextitem="fourDigits_1" /> </catch> <!-- if the key is not present, we end up here --> <catch event="error.verify.keynotfound error.badfetch"> <!-- likely a bad key - send back to beginning of form --> <prompt> Sorry, we can't find account number <say-as type="number:digits"> <value expr="key" /> </say-as>. Let's try again. </prompt> <goto next="#getid" /> </catch> </verify>(b)
<var name="thisSample" expr="fourDigitRandom()"/> <var name="thisSampleString" expr="fourDigitString(thisSample)"/> <var name="totalAttempts" expr="3" /> <block> <prompt> We will ask you to repeat some numbers. Please speak naturally. </prompt> </block> <verify name="fourDigits_1" keyExpr="key" type="digits?length=4"> <prompt> Repeat after me. <break size="small" /> <say-as type="number:digits"> <value expr="thisSampleString"/> </say-as> </prompt> <noinput> <assign name="totalAttempts" expr="totalAttempts - 1" /> <if cond="totalAttempts <= 0"> <goto next="#denied" /> <else/> <reprompt/> </if> </noinput> <nomatch> <!-- utterance did not match grammar. Spoof? --> <assign name="totalAttempts" expr="totalAttempts - 1" /> <if cond="totalAttempts <= 0"> <goto next="#denied" /> <else/> <reprompt/> </if> </nomatch> <filled> <assign name="totalAttempts" expr="totalAttempts - 1" /> <!-- Too many attempts? Did we verify? --> <var name="check1" expr="totalAttempts <= 0" /> <var name="check2" expr="fourDigits_1 != thisSampleString" /> <!-- check to see that we match --> <if cond="check1"> <!-- too many attempts --> <goto next="#denied"/> <elseif cond="check2" /> <!-- person spoke incorrect number. Spoof in progress? --> <clear namelist="fourDigits_1"/> <reprompt/> <elseif cond="fourDigits_1$.decision=='accepted'" /> <goto next="#accepted" /> <elseif cond="fourDigits_1$.decision=='rejected'"/> <goto next="#denied" /> <else/> <!-- decision was "unsure." Proceed to next field --> <!-- reset attempts counter for use by next verify --> <assign name="totalAttempts" expr="3" /> </if> </filled>
<form id="accepted"> <block> <prompt> Your transaction has been accepted. Thank you. Goodbye. </prompt> <!-- Inform app of results --> <var name="resultString" expr="'http://www.example.com/cgi-bin/success.cgi'" /> <assign name="resultString" expr="resultString+='?accountID='+key"/> <assign name="resultString" expr="resultString+='&outcome=success'"/> <prompt> <audio expr="resultString"/> </prompt> </block> </form>