Yuri Diomin (yuri.diomin@yurisw.com) is a president of Yuri Software, Inc., a software development firm in San Diego, California. He builds applications for several vertical industries, including address correction and geocoding tools.
"Hello World" Geocoder
To begin with, let’s try to build our own, basic geocoder and figure out what it might look like. In the tradition of "Hello World" programs, it will be a bare-minimum tool, made just for the purpose of illustrating the principles.
On the input, our geocoder will take a postal address in the form of character strings. Let's say it will be Line 1 and Line 2 of a common address format, such as:
506 4th Ave
Asbury Park, NJ 07712
On the output, it will return latitude and longitude of the location as floating point numbers. (In real life, geocoders often return a plethora of other information about the address, but we will limit ourselves to just coordinates in this example.)
The first step in implementing our geocoder is to build a database of reference addresses and their locations, usually known as the "street network dataset." In our Hello World case, the street network dataset will conveniently consist of just one record:
Address
506 4th Ave,
Asbury Park, NJ 07712
Latitude
40.223571
Longitude
-74.005973
The execution flow in our geocoder is then obvious. We simply:
- Receive the input address.
- Perform a database lookup by direct string comparison and find the corresponding reference record.
- Return the latitude and longitude from the record as the output.
Mission accomplished!
Address Matching
In real life, the process is much more complicated. As the very first obstacle, we will come across the issue of address matching.
Let's say that our input address is not in the neat form of:
506 4th Ave
Asbury Park, NJ 07712
but rather:
506 Fourth Avenue Apt 1
Asbury Prk, New Jersey
Note the multitude of character differences between the two addresses.
Since addresses come from all sorts of sources, such as customer filled-out forms, dictation over the phone, etc., we cannot expect them to always be neatly formatted and standardized. A human looking at the two addresses above will easily see that they are one and the same. If we out either of them as a “mail to” address on a letter, we will expect the letter to be delivered without any problems. But if the computer performs a simple string comparison of the two addresses, they will not match and our geocoder will get a "miss" instead of a "hit."


