Dr. Dobb's | Accelerated Search For the Nearest Line

Accelerated Search For the Nearest Line

Finding the nearest position within a set of lines is not a difficult task -- most of the time.

April 10, 2007
URL:http://www.drdobbs.com/cpp/accelerated-search-for-the-nearest-line/198900559

Stefan is a software developer in Berlin specializing in cross-platform development, user interfaces, and reengineering. He can be contacted at [email protected].

Finding the nearest position within a set of lines is not a difficult task--it's done every time you click the mouse in a drawing program or GIS (geographical information system). But when large amounts of data that have to be batch processed are involved, the situation changes. I recently was faced with the problem of about 1,000,000 lines that were exported from a GIS. The lines were more or less touching each other, thus representing a pipe network, but there were no logical links between them. Furthermore, some lines ended not near the end of another line, but instead their end touched another line (call it "parent") far away from its endpoints. It was necessary to split the parent line and create a new link at this position.

I had already a module implementing a "find nearest point in all lines" in a straightforward linear manner, but it turned out that the task of finding the nearest neighbor for all lines took several days on a high-end desktop machine. In contrast to point-to-point search where algorithms with logarithmic search times exist (see A Template for the Nearest Neighbor Problem by Larry Andrews), I don't know of such implementations for point-to-line search. This may have its cause in the fact that--in contrast to point-to-point search--on each single line there are infinite possible matching points. So what to do?

In this article, I present an optimized algorithm that reduces the search time to a fraction, when searching the nearest position within a set of lines. The algorithm is based on dividing the whole possible spatial area in quadrants and then building indices for each quadrant. All searches are based on using a threshold for the maximum search distance to the line found (the "maximum search width", which is 10 meters in my examples). If there is no line within this distance, nothing will be found. The solution consists of the classes described in Table 1. For comparison, the source code of the different versions of linear search discussed here is also provided. The sample code was tested under Windows, Linux and Solaris using VC++ 6.0 and gcc 3.3.3. The measurements were made using Windows XP

Table 1: List of Classes

Sample Data

The sample main() function creates 1,000,000 lines of 100 meter length for sample data, as in Figure 1. Note, that some line ends are near to the end of another line, whereas other line ends are near the center of another line. Compared to the sample data, in typical GIS-applications lines vary much in their length. Also in typical data, lines (i.e. pipes, cables, streets etc.) mostly are represented by polygons with many points. But this is only a specialization of the problem where all polygons will have to be divided into their straight lines. By my experience though, straight lines going thorough the whole area are very rare.

[Click image to view at full size]

Figure 1: Lines in sample data

The Basics: Which Point Is Nearest?

For a given point p, finding the nearest point within a single straight line is a three-phase process (see Figure 2):

Drop a perpendicular on the (infinitely extended) line.
Check, if the perpendicular is between the start- and endpoints of the line.
If "yes," then return the connecting point of the perpendicular and the line (in cases of p1 and p2 in Figure 2); otherwise return the start- (in case of p3), or endpoint (for p4), whatever is nearest.

[Click image to view at full size]

Figure 2: Finding the nearest point within a line.

This process is implemented in the member function CLineObj::DistToPoint(...) (see Listing 1).

double CLineObj::DistToPoint(CWorldPoint &Point, CWorldPoint *ConnectPoint)

{   
    // Get coordinates of start/end
    double xa = Start().Getx();
    double ya = Start().Gety();
    double xb = End().Getx();
    double yb = End().Gety();

    // Calculate triangular data
    double dist = Start().DistTo(End());
    double xOff = xb - xa;
    double yOff = yb - ya;
    double xRaise = (dist == 0.) ? 0. : xOff / dist;
    double yRaise = (dist == 0.) ? 0. : yOff / dist;

    // Special Case: Line is a point
    if(dist == 0)
        return Point.DistTo(Start());

    // Calculate the perpendicular to a infinite line
    double p = - (xa*yb - ya*xb) / dist;
    double xy4dist = Point.Gety() * xRaise - Point.Getx() * yRaise - p;

    // Calculate Connectionpoint of perpendicular on infinite line
    double x4 = Point.Getx() + (xy4dist  * yRaise);
    double y4 = Point.Gety() - (xy4dist  * xRaise);
    if(ConnectPoint)
       ConnectPoint->Set(x4, y4);

    // Offset p4 <-> start point
    double x4off = x4 - xa;
    double y4off = y4 - ya;
    double d4 = sqrt(x4off*x4off + y4off*y4off);

    if(( (x4off > 0. != xOff > 0. || xOff == 0)  &&   // Connection point is BEFORE start of line
         (y4off > 0. != yOff > 0. || yOff == 0)) ||
          d4 > dist)                                  // Connection point is BEHIND end of line

    {
        if(fabs(x4 - xa) < fabs(x4 - xb) || // Point is nearer to start
           fabs(y4 - ya) < fabs(y4 - yb)) 
        {
            if(ConnectPoint)
              *ConnectPoint = Start();
            return Point.DistTo(Start());   // return offset to start
        }
        else                                // Point is nearer to end
        {
            if(ConnectPoint)
              *ConnectPoint = End();
            return Point.DistTo(End());     // return offset to end
        }
    }
     return fabs(xy4dist); // return length of perpendicular

Listing 1

To find the nearest out of 1,000,000 lines for a single point p, the function just must be called 1,000,000 times -- once for each single line. This needs about 379 milliseconds to execute on 2.8-GHz XEON. If for each of the 1,000,000 lines the nearest neighbor for their start and endpoint is needed, we need to repeat the single point search 1,000,000 * 2 times which results in a total runtime of (2,000,000 * 379ms) 8 days, which is unacceptable.

Optimization 1: Check Bounding Rectangle First

A obvious shortcut is, to first build a rectangle of start and endpoint (plus the maximum search width), and to check if the point p lays within this rect. If it does not, the line is too far away. This is less computation intensive. It reduces the search time to 82ms, reducing the batch run to less than two days.

Optimization 2: Introducing Quadrants

The key for the whole solution presented here is to divide the whole area in quadrants, like on a chessboard. A given point can lay only within one quadrant. A line only can match a limited number of quadrants. If the width of each quadrant is not smaller than the maximum search width, then a shortcut can be used to exclude all lines not within that distance: Then, two points with a distance less or equal to the maximum search width must lay on the same or on neighborhood quadrants. Thereby, if a points own quadrant and its 8 surrounding quadrants don't have any quadrant in common with the quadrants of a line, the line is too far away and further computation can be skipped; see Figure 3.

[Click image to view at full size]

Figure 3: Point p1 has a matching quadrant with line1. p2 does not and thereby cannot lie within the maximum search width from line1.

A CQuadrant object can be used to determine and store the quadrant-ID of a point. The quadrants boundary data has to be statically setup before usage. You must call the static member function:

void CQuadrant::SetUp(double x1, double y1, double x2, double y2, double width)

and pass the minimum and maximum coordinates of the whole graph as well as the width of one quadrant. The width is a very important property. In this stage the quadrant width at best is equal to the maximum search width (10m). The constructor of a CQuadrant object requires the coordinates of the point it should represent and thereby calculates two integers that, combined, build a ID identifying the quadrant (see Listing 2).

void CQuadrant::CQuadrant(CworldPoint &Loc)
{
    mQuadx = ((short)( (Loc.dWrldx() - mxMin) / mWorldMaxOff) );
    mQuady = ((short)( (Loc.dWrldy() - myMin) / mWorldMaxOff) );
}

Listing 2

As lines can have any length and thereby can pass more than one quadrant, the class CQuadSection was introduced. The constructor of CQuadSection requires the start and end coordinates of the line and holds a rectangle of the minimum and maximum quadrants. (For simplicity, in this implementation a line matches all quadrants that are within its bounding rectangle. Of course a improved implementation would check all these quadrants and see if they are really passed by the line).

To see if a point can be within the maximum search width of a pipe the function

CQuadSection::Matches(CQuadrant &Other).

is called, which just checks if one of the 9 quadrants of (and surrounding) the point is contained in the quadrants used by the line (see Listing 3).

int CQuadrant::Matches(CQuadrant &Other)
{
    return (mTopRight  .mQuadx + 1 >= Other.mQuadx  &&
            mBottomLeft.mQuadx - 1 <= Other.mQuadx  &&
            mTopRight  .mQuady + 1 >= Other.mQuady  &&
            mBottomLeft.mQuady - 1 <= Other.mQuady);
}

Listing 3

The quadrants for all lines can be generated once before doing the batch run. This takes about 3640ms. It leads to a reduction of the search time to about 17ms for checking for the nearest match in 1,000,000 lines, thus reducing the total runtime to less than 10 hours (see Table 1).

Optimization 3: Holding Lists for All Matches In Each Quadrant

Still, for every search we have to iterate through all 1,000,000 lines no matter how far they are away. The solution is to build a 3-dimensional container: a 2-dimensional map, representing all quadrants of the area with each item holding a list of all matches in a single quadrant, as in Listing 4. See Figure 4 for illustration.

[Click image to view at full size]

Figure 4: Illustration of 2-dimensional map, that contains the matching lines of each quadrant.

class CQuadMap
{
    // Typedef for 3-Dimensional container holding all quadrant matches
    ///////////////////////////////////////////////////////////////////////////
    // 3d  Dimension: List of all Lines that match a quadrant
    typedef std::list<CLineObj*>LinePtrVec;
    // 2nd Dimension: map of LinePtrVec  for all not empty quadrants within a line
    typedef std::map<CQuadrant::data_type, LinePtrVec>QuadLineEntrys;
    // 1st Dimension: map of QuadLineMap for all not empty rows
    typedef std::map<CQuadrant::data_type, QuadLineEntrys>QuadEntryMap;

    QuadEntryMap mPipeQuadMap;  // Data of all Lines imported by Add
 ...
};

Listing 4

This was the breakthrough. It is implemented in the class CQuadMap.

CQuadMap has a member mPipeQuadMap that holds such a 3-Dimensional container. The class has three important functions:

A constructor, to which the maximum search width is passed. I use 50 meters here.
A function Add(CLineObj &) to insert a Line (see Listing 5). This function adds a pointer of the Line passed to all its matching quadrant entries of mPipeQuadMap;
A function Find(CWorldPoint &Pnt, CLineObj *ExcludeObj) to find the nearest line to point. (see Listing 6). This function tests only those lines that are referenced in the entries of mPipeQuadMap that represent the nine quadrants of and surrounding the given point. The parameter ExcludeObj is needed to prevent, that the line itself is returned, when we search a foreign link to the start or endpoint of a CLine object. After each call to Find(), CQuadMap can be queried for the connection point on the line and for the distance to that point.

void CQuadMap::Add(CLineObj &lineObj)

{
    // Get List of all Qudrants touched by this line
    CQuadSection theQuad(lineObj.Start(), lineObj.End());

    // Add this line to all matching quadrant entries
    for(CQuadrant::data_type x = theQuad.BottomLeft().Quadx(); x <= theQuad.TopRight().Quadx(); x++)
    {
        for(CQuadrant::data_type y = theQuad.BottomLeft().Quady(); y <= theQuad.TopRight().Quady(); y++)
        {
            CQuadrant q;
            q.SetQuads(x, y);
            mPipeQuadMap[x][y].push_back(&lineObj);

            sgPoints++;
        }
    }
}

Listing 5

CLineObj *CQuadMap::Find(CWorldPoint &Point, CLineObj  *ExcludeObj)

{   
    CQuadrant theQuad(Point), // The quadrant of Point
              Quadx;

    CLineObj *foundLine = NULL;
	mLastOff = 1.e20;

    // Check all surrounding 9 quadrants
    for(int Quad = 0; Quad < CQuadrant::NEIGHBORCOUNT; Quad++)
    {
        // Get the list holding all matches for this quadrant
        Quadx = theQuad.GetNeighbor(Quad);

        QuadEntryMap::iterator it_line = mPipeQuadMap.lower_bound(Quadx.Quadx());
        if(it_line == mPipeQuadMap.end())
            continue;

        QuadLineEntrys::iterator it_row = it_line->second.lower_bound(Quadx.Quady());
        if(it_row == it_line->second.end())
            continue;

        LinePtrVec &theMap = it_row->second;
        
        // Test all Lines in the list of this quadrant
        for(LinePtrVec::iterator it = theMap.begin(); it != theMap.end(); it++)
        {
            if(ExcludeObj == *it)
                continue; 

            CWorldPoint pos;
            double d = (*it)->DistToPoint(Point, &pos);
            sgSearches++;
    
            if(d <= mMaxOff && (! foundLine || d < mLastOff))
            {
                mLastPos = pos;
                mLastOff = d;
                foundLine = *it;
                sgMatches++;
            }
        }
    }
    return foundLine;
}

Listing 6

After calling Add() for every line element (which takes 4760ms in the example) we now only have to check a very small fraction of all lines. In doing so the search time was reduced to about 0.007 ms -- which is about factor 36,000 compared to linear search. The batch process for all lines in the sample now needs only 21 seconds in total anymore.

Discussion

In the sample we reduced the total runtime from 8 days to 21 seconds (see Table 2 for details). The sample data of course is different from real-world data (i.e., from a GIS). But the runtime improvements shown here are comparable to those we encountered with data from GIS. Batch processing usually not only searches in data, but it also performs some action on it, which of course costs runtime that is not reduced by improving search. In the task I had to perform, the overall task was reduced from a number of days to a few hours. Also impressive is the difference between optimized and debug compiled code, which came to about factor 10.

Table 2: Comparison of search and batch processing times.

In this solution it becomes crucial to set the right width of the single quadrants: Setting them too small (i.e., 10 meter in the sample) increases the memory and the time to setup the data of CQuadMap. Setting it too large (i.e., 100 meter) more and more reduces the acceleration introduced by the algorithm. On real-world data I found that 25 meters is more appropriate. Below 10 meters, the amount of memory used soon reaches 1 GB and more. Keep in mind, that enlarging the size of the quadrants does not affect the search result though: It only leads to more lines checked as candidate, but it will still return the nearest element. Also still only those elements are returned that are not more far away than the maximum search width, even if the quadrant size is larger then this. It is illegal though to use a smaller quadrant size than the maximum search width.

Of course your units used can also be micrometers or miles.

The time to execute linear search grows linear with the number of lines used. When connecting all lines to each other it grows exponentially by n^2. The search time of CQuadMap turns out to be more or less linear and independent of the number of lines inserted. Connecting 1,000,000 lines to each other of course is a heavy task, but also when doing this with only 10,000 lines the runtime is reduced impressively from more than one minute to 188ms -- making such a task possible in real time (see table 2).

Conclusion

In this article I introduced a algorithm that reduces the search time for the next line of a given point to a fraction compared to linear search. The performance gain lays between 420 and 36,000 depending on the number of lines existing. The algorithm is implemented in C++ and publicly available.