We see that both addresses above are "correct" — there are no errors or typos in either of them. They just use different forms of the same word, for example "Fourth" instead of "4th" and "Avenue" instead of "Ave." While these differences present no problem for human postmen, they throw computers completely off. So how do we match "Fourth" and "4th"? This is where the concept of address standardization comes in.
Let's say we agree on a rule that whenever numbers are used in street names, they must always be written down in their numeric form with a suffix, such as "139th" or "3rd." Naturally, we will also have to agree on a long list of exceptions to this rule, such as "Four Corner St" does not become "4th Corner St," and "County Road 22" does not become "County Road 22nd," and so on.
Armed with this rule, we will need to process our input address of "506 Fourth Avenue" before looking for it in the database. We will determine that "Fourth" refers to a numbered street and replace it with "4th."
Similarly, we will need to institute rules for proper use of street type names. For example, we will agree that all Avenues must use the abbreviated type name of "Ave." Of course, there will be exceptions to this rule as well, so that "Avenue of the Stars" does not become "Ave of the Stars."
To get a reasonable coverage of possible address spelling rules, we will need to come up with approximately two hundred of them, each with a list of exceptions. Please note that rules dealing with abbreviations will have to go both ways — when to abbreviate and when not to abbreviate (for example, "Asbury Prk" should not be abbreviated, but spelled out as "Asbury Park"). The rules will need to deal with proper use of street types; street pre-directions and post-directions, such as "123 N 45th E"; formats for secondary designators, such as apartment and suite numbers; numeric, alpha-numeric and fractional house numbers, and on and on.
We do not have to come up with all the rules ourselves. The United States Postal Service has already done it for us. USPS publishes their guidelines of proper use of addresses in the so called Publication 28, complete with a large set of addendums. It is a lengthy document with many grey areas, but it is a reasonable start.
Having all the standardization rules in place, we can update the execution flow of our geocoder to add a step of standardizing the input address before we look it up in the street network database. Please note that our street network database must already be in the standardized format for comparisons to be valid.
In addition to not being standardized, real life addresses also tend to often be incomplete. The most common case would be a missing ZIP code. Without a ZIP code, the address may still be unique and valid, yet we will not be able to locate a matching address in the street network database.
Some parts of address are redundant and can be derived from the remainder of the address. For example, knowing a ZIP code, we can find out city and state; or knowing a city and state, we can suggest at least a set of possible ZIP codes for the location.
Other address parts and not redundant and cannot be easily deduced. For example, a missing house number renders our address damaged beyond repair. However, some missing parts can still be guessed under the right set of circumstances. For example, line “123 Main” can be presumed to mean “123 Main St,” if indeed Main Street exists in that city and state, and Main Ave. and Main Blvd. do not. Techniques for handling this type of incomplete addresses are very similar to the ones used with our next group of address matching issues.
Addresses containing misspellings or typos abound in real life scenarios. In some cases, a human postman can easily figure out the implied address. For example, “Jupiter, FL” and “Jupitar, FL” clearly refer to the same city. However, in other cases even a human would have problems — “Pile St” and “Pale St” may be misspellings of the same street or they may be two different streets.
In addition to typos, incorrect addresses can contain a variety of other mistakes, such as wrong street types or directionals, as in “Park St” instead of “Park Ave” or “N Main St” instead of “S Main St”; have nonexistent streets, or cities, or ZIP codes; conflicting parts of the address; and on and on. In general, there are thousands of ways to write the address wrong, while there is only way to get it right.
However incorrect the address is, we can still check if it is close enough to some correct address so that we can declare a match. A typical approach is to use various forms of “fuzzy” address comparisons. Instead of hoping to find one perfect match in the street network database, we compare our subject address to every record in the database using some set of rules and criteria. All records that are “similar enough” (and it is up to us to define how loose or tight the proximity is) become candidates for a match. We rank these candidates in order of their similarity and see if a clear winner emerges. If we have a clear winner and it is similar enough to the subject, we declare this winner a match. If more than one candidate is equally similar, we call it a tie and declare the match an ambiguous case. If no records are “similar enough,” we have no match.
In practice, the set of rules and criteria for fuzzy address matching represents the bulk of complexity in each geocoder implementation. This is where most of the product development effort is directed, and this work is never finished. As the engine encounters more and more hits and misses, it is constantly updated and fine-tuned to achieve ever higher quality of matching.
Address Matching Quality
Ideally, a matching engine should be at least as good as a human postman. Mistakes of the engine, or mismatches, can be described in terms of false positives and false negatives. A false positive is produced when the engine finds a match, but the match is incorrect. A false negative occurs when the engine returns nothing, while a good match was indeed available.
To grade the quality of address matching engines, USPS administers a so-called CASS (Coding Accuracy Support System) certification program. To obtain the certification, a matching engine must go through a series of tests, which consist of all sorts of incorrect addresses with their corrected counterparts. A passing grade requires that a high enough percentage of them are properly corrected. CASS tests are quite comprehensive and are made of hundreds of thousands of real life “dirty” addresses. It takes a very significant development effort of the part of a product vendor to bring the engine up to the CASS level of matching quality.
Address Matching Performance
CASS tests verify how accurately the engine matches addresses, but they do not mandate how fast it needs to work. However, in real life, the performance of a matching engine is a very important point. As we saw in the case of fuzzy address matching, each subject address needs to be compared potentially to all addresses in the street network dataset, which can be hundreds of millions. At these volumes of data, the ability to perform comparisons quickly is paramount.
The issue of performance is further complicated by the fact that in many applications of geocoding the subject database themselves are often very large as well. So large, in fact, that it becomes difficult to process them within a reasonable time frame. For example, if we have a customer database of several million records, and each records takes 1 second to perform address matching on, we are looking at several months of nonstop processing time to geocode the entire database — not a realistic undertaking.
Combining a high quality of address matching with high performance is a daunting challenge for geocoding engine vendors. While there is a good number of commercially available engines with CASS-level quality of address matching, very few of them reach a performance level of thousands of addresses per second — the level required for processing of million-record databases. Building an engine like that is a work of art and not a mundane software development project. In fact, the industry gossip has it that many of the top level engines were actually developed by one and the same guy, who kept moving from one employer to another, replicating the same engine several times over.
So far, we looked only at the address matching issues and assumed that once the address is matched, our street network database will contain the latitude and longitude for this exact subject. In reality, street network databases do contain exact coordinates for some addresses, but only for a minority of them. This minority can thus be geocoded with an absolute precision, often referred to as the “parcel level” (referring to the parcel of land the address represents) or “roof top” precision. For the rest of addresses, we have more work to do.
A general approach is to find two nearest addresses with known locations and interpolate the coordinates between them. For example, if we have the exact locations for “100 Main St” and “110 Main St,” then “105 Main St” should be located just half the way between them. This is a simple calculation for straight streets, but significantly more complex for winding roads. To trace a segment of a road, we need to know not only the point locations of its ends, but also the exact route that connects them. In rural mountainous regions, roads can make thousands of turns just between the two nearest buildings.
Yet another complicating factor for interpolative geocoding is the format of house numbers, which in addition to straight numerals can often be alphanumeric, such as “123A-45 Main St,” or fractional “123 4/5 Main St.”
At this stage, our geocoder is almost complete. Let’s take one step further and consider what else a geocoding engine can be good for.
Geocoding and Address Correction
Our "Hello World" geocoder returned only the latitude and longitude coordinates of the subject address. However, since we had to perform a full address matching to obtain them, we may just as well return the corrected address as an extra bonus. If the input address was incorrect or non-standard, we’ll have the benefit of knowing the corrected and standardized form of it. In many practical applications, address correction and standardization can come very handy. One example would be data cleansing and de-duplication. If we have two customer records for John Doe, one with address “123 Main St” and the other with “123 Main,” we want to know if they are two different people or one and the same. The answer depends on whether “Main St” and “Main” refer to the same address or not.
If we decide that address correction and standardization are important and we want to employ them in our projects, we need to make sure we know which standard our geocoding engine standardizes addresses to. In many cases it is not immediately obvious. Most engines settle down on the format of addresses provided by their street network suppliers, which usually are mapping companies. They are not strict about USPS standards, and as the result, we get a whole slew of problems: non-standard addresses, incorrect postal codes, lost secondary designators, failed P.O. Box addresses, and so on. In short, geocoding engines without USPS cross-reference are not suitable for address correction and standardization.
We may be tempted to find an easy solution to this problem by adding a second step to our address processing, i.e., performing a USPS correction with a CASS-type tool after completing the geocoding step. Unfortunately, it only magnifies the problem instead of solving it.
A better approach would be to use a single-step process, where one tool combines street networks with USPS datasets and performs geocoding and USPS correction at once. Being difficult and data intensive, very few vendors build engines with these capabilities. There are no more than a handful of products of this type on the market.
From Theory to Practice
Having reviewed what goes on under the hood of geocoding engines, we are now ready to roll up our sleeves and get to work on practical applications. In the next article, I'll go over practical aspects of geocoding — what products are available on the market, how to choose the right one, whether Web services are better than software products, and many other details.