Channels ▼
RSS

The Double Metaphone Search Algorithm


June 2000/The Double Metaphone Search Algorithm

The Double Metaphone Search Algorithm

Lawrence Philips

Computer vision is a broad and deep topic, but computer hearing is no breeze either, particularly when trying to recognize surnames.


Albert Einstein once said, "Everything should be made as simple as possible — but no simpler!" Simplicity, of course, is a fundamental touchstone of quality in engineering and science. If an algorithm accomplishes its task as simply as possible, and demonstrates a touch of intuitive inspiration as well, we compliment it as "elegant." Unfortunately for engineers, human activity can rarely be described elegantly. And although text processing is a critical technology at a time when millions of people are searching the web, the unsystematic and exception-laden English language often demands algorithms that look ugly to engineers.

Finding words or names that might be spelled differently than the search string that was typed in presents an aggravated case of this problem. English is not only well known for having maddeningly irrational spelling practices, but in America we have also accumulated names from all over the world. The first algorithm to deal with this problem, Soundex, is gratifyingly simple, but it is not at all an adequate solution. It often fails to do the job of returning alternatives that are pronounced similarly to the search string.

Soundex and Metaphone belong to a class of algorithms usually known as "phonetic encoding" or "sound alike" algorithms — a heuristic type of fuzzy matching. They input a word or name, and return an encoded key, which should be the same for any words that are pronounced similarly — allowing for a reasonable amount of fuzziness. For these particular algorithms, only the first four consonant sounds are encoded, unless the first letter is a vowel. Therefore, Metaphone encodes "Stephan" as STFN. Since "Steven" and "Stefan" are pronounced similarly, you should be able to find those spellings as well. But Soundex doesn't know much about English spelling peculiarities, so it won't!

I published Metaphone in Computer Language magazine in 1990, as a replacement for Soundex. I coded in many common rules of English pronounciation that Soundex doesn't cover, such as when ‘C' is pronounced as ‘S' and when it is pronounced as ‘K.' Metaphone seems to have improved the hit rate on the axis of similarity; but, through poor design decisions on my part, ended up sacrificing some fuzziness. Despite this, it has been implemented and described fairly often over the past ten years. A good implementation of original Metaphone in C can be found in Binstock & Rex's Practical Algorithms for Programmers. There is a Pascal version in Joe Celko's SQL for Smarties, and in Database Magic with Ken North it is shown as a Java class.

Over the years people have notified me of mistakes and exceptions in the original algorithm. Bryan (BRYN) complained that his name was not matched to Brian (BRN), a design flaw. Mr. MacCafferey complained that Metaphone encoded him to MKKF, an out-and-out bug. Also, I realized that by retaining Soundex's choice of preserving the first letter (in Metaphone, only for words that started with vowels), "Otto" for example, would not be matched to "Auto."

I decided that several things needed to be changed to improve Metaphone's fuzzy rating. All words that start with vowels should simply encode to ‘A' at the beginning. ‘Y' and ‘W' can always, always be spelled with vowels, so they had to go. I had originally encoded ‘B' and ‘P' separately, not really a good idea.

But these are just the really obvious changes. More difficult to deal with, and contributing considerably to inelegance, are the consonants that are pronounced differently in different words — mostly, at first glance, for no apparant reason. English, a language which has changed relatively quickly over the centuries, preserves letters that have not been pronounced for some six hundred years, or have changed in pronounciation, such as the "gh" in "light" and "rough," and the ‘k' in "knight." English also has a tendancy to accumulate a large number of words from non-English sources, notably French, Latin, and Greek. When transliterating from the Greek alphabet, the letter that is pronounced "kh" in Greek (a sound that does not exist in English — think "chutzpah"), is spelled "ch" and pronounced ‘k': "orchestra," "chorus," etc. These two factors, historical preservation and assimilation from other languages, account for most of the weirdness in English spelling.

French words, especially names, can present difficulties because the French, although more consistant about it, are apparently competing with English speakers for the title of most strangely spelled language. I discovered, after eyeballing long lists of words and names, that it was possible to do a much better job of algorithmically predicting the pronounciation of even the most difficult consonant combinations. Better, but not perfect.

Some factors, although also contributing to the inelegance of the algorithm, are logical from the human point of view, such as those that come roughly under the headings of euphony and elision. The "shun" pronounciation of "-tion" endings is a combination of both, as is the dropping of the ‘b' at the end of "dumb" and "plumb." These factors also make some predictions easy: given a name like "Wechsler," it is just about impossible to pronounce it "Wetch-sler," therefore it must be "Wek-sler." And there are names which are pronounced differently in different countries: "Kuczewski" would be "Kootchefski" in Poland, but commonly "Kuhzooski" in the U.S. How can you choose which one to encode to?

Indeed, even some familiar names can just as plausibly be pronounced more than one way. I frequently hear people on TV try both pronounciations for Henry Kissinger and Kim Basinger — is it "Basin-gger" or "Basin-jer?" I'm not really sure myself. People who are familiar with more than one language might pronounce a name for you both in the American and native style — "Cabrillo," for example, is sometimes "Cab-ril-lo," and sometimes "Cab-reeyo," the proper Spanish pronounciation.

That is why I decided to give back two keys for words and names that can be plausibly pronounced more than one way, and that's why the new version is called Double Metaphone. In the case of Kuczewski, there are two ambiguous sounds, so in the second key returned I make both of the changes: "Kuczewski" now comes back as KSSK for the American version, "Kuhzooski," as well as KXFS for "Kutchefski." (I use ‘X' to represent the "sh" sound, and ‘0', zero, to represent "th," as in original Metaphone.) Both versions are likely to be heard in the United States, so it is necessary to try both to be really sure! In the end, however, I find that only about 10% of my sample database of the 100,000 most common American surnames come out with more than one key.

For "dictionary" words, this percentage is likely to be much smaller. But although some common exceptions are accounted for in the code, contradictory usages in ordinary English words still give me nightmares. Particularly letters like ‘g' — why is it pronounced differently between "anger" and "danger?" The letter ‘g' alone will account for most of the the alternate guesses given back in regular words! And, although many of the rules embodied in the code are concerned with surnames, many of these also apply to regular words such as "sugar" and "success," so Double Metaphone will give you a better hit rate for these as well.

The current version of Double Metaphone given here accounts for alternate pronounciations of names from Italian, Spanish, and French, and from various Germanic and Slavic languages. A few exceptions from English names and words, such as "Thames" and "sugar," are accounted for also. The appropriate context for some alternate pronounciations, such as "Zivis-ky" for "Ziwicki," and "Ro-jay" for a French pronounciation of "Roger," are too difficult calculate, and so are not given.

I tried to give back the pronounciation most likely to be heard in the U.S. in the first key, and the native sound in the second key. However, for names like "Artois," most Americans will correctly drop the ‘s,' so it comes back as ART first. Most americans are also likely to give a correct Spanish reading for "Jose" at first glance. Since many families have changed their last names to a more anglicized spelling, I have included "etymological" variations, especially useful for geneological research, a common application of both Soundex and Metaphone. To this end, consonant groups such as "Schm-" and "-wicz" are automatically given back with the common anglicizations, although they do not really sound the same according to the usual standards of phonetic similarity. But they will allow you to match "Smith" to "Schmidt," "Filipowicz" to "Philipowitz," and "Jablonski" to "Yablonsky."

Implementation

I have submitted Double Metaphone as a class called MString, which is implemented as a subclass of the MFC CString class. The MString class has parameterized constructors that allow you to initialize the MString object with the input string that you want to encode. You then supply references to two CString objects as parameters to a member function named DoubleMetaphone, which writes the primary key, and secondary key if there is one, into these CString objects.

Function DoubleMetaphone does some processing, such as uppercasing, on the input string first to normalize it. Then, to create the key, the function traverses the input string in a while loop, sending sucessive characters into a giant switch statement. Before determining the appropriate pronunciation, the algorithm considers the context surrounding each character within the input string. The algorithm breaks out of the loop as soon as one of the keys is four characters long, since that has been shown to be an ideal key length for sound-alike matching, although it can be changed if appropriate. The logic for ‘C,' ‘G,' and ‘S' is particularly frightening — the complete code for the ‘C' case is shown in Figure 1.

GetAt, a standard MFC CString function, returns the character at the given subscript for the input string that the CString object has been initialized to. The variable current is where the function keeps track of the current letter in the input string — you can see that the function sometimes skips letters that it has already looked at in the case statement at hand, so that they are not reconsidered in the next iteration of the loop. Function MetaphAdd adds letters to the output key; I take advantage of overloading to specify a second parameter when the secondary key differs from the primary key. Ultimately, if there is no difference, no secondary key is returned. This secondary key gives you flexibility as to how much fuzziness you want to provide to your users. You can match primary key to primary, secondary to primary, etc., for a possible total of four result lists of decreasing similarity!

The StringAt function is the real workhorse here. I have taken advantage of the C/C++ ability to handle a variable number of parameters to a function. This enables StringAt to compare a substring at a particular location in the input string to any number of constant strings. The first parameter to StringAt is the beginning index of the substring; the second is the substring length. If any of the constant strings in the rest of the argument list match the substring, StringAt returns true. The va_arg functions do require the programmer to keep track of where the argument list ends, however. Here I flag it by providing a NULL string at the end. Unfortunately, C++ does not support variable length argument lists for overloaded operators, so I must use function syntax instead of a == operator!

Although a considerable amount of string comparing is going on, Double Metaphone is not as slow as it looks. Creating keys for 100,000 names took me about twenty seconds, even with a separate disk I/O call to a text file for every name.

Unfortunately, this solution results in eight bytes in the combined keys where there were only four before. Other engineers have already remarked that Soundex keys can be reduced to three bytes, whereas Metaphone always needs four. But, since I have ended up reducing the number of Metaphone symbols to twelve, it is now possible to cut the space required for Metaphone keys in half. This is obvious if you realize that it's easy to map twelve characters to the first twelve hexadecimal digits! C programmers ought to find it easy to write a routine to map the Metaphone keys to four hex digits, and thus convert the four-byte string to a short unsigned integer. To store both keys, map to the high and low ends of the currently "standard" 32-bit integer. For Visual Basic and other languages, this mapping might be more difficult, but writers of ActiveXs should provide this as a service, not forgetting to provide routines that separate back out the high and low ends when both keys are stored in one 32-bit value, as well as a routine to translate the number back into the conveniently readable Metaphone encoding. But if you really can't store both, at least you can use both results if you get two from the name that the user types in at run time.

I have attempted to present this version of the algorthm in self-documenting code that non-C programmers will be able to understand. I will gladly help with implementations in other languages, or in providing additions or changes to the algorithm if the problem is clearly described. Finally, elegance will always be the goal of good engineers, so I would welcome any suggestions from those with sharper minds as to how I might improve the algorithm presented here!

Lawrence Philips is a Software Engineer at Verity, Inc. ("The leading provider of knowledge retrieval solutions for the enterprise and the Internet"), in Sunnyvale, CA. He can be contacted at lphilips@verity.com.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video