Channels ▼
RSS

Design

Matching Wildcards: An Empirical Way to Tame an Algorithm


Wildcard matching routines are easy to describe and convenient to use, but deceptively challenging to implement. In 2008, when Dr. Dobb's published my first wildcard-matching algorithm, the Internet didn't provide very many alternatives that were both understandable and totally correct, let alone suitable for high-performance programs that need to compare numerous text strings quickly. For example, many of the best-tested varieties were recursive. That made them not only relatively slow compared with similar code that doesn't have the overhead of growing and shrinking stack space as it runs, but also vulnerable to overflowing the stack given certain pathological input.

Times have changed. In these days of myriad fancy hacks, a quick search turns up a broad range of non-recursive wildcard matching algorithms, coded in C, Java, and other languages. Most of them are crash-proof. But do all of them do the job as expected? Do some of them perform better than others? How can you tell?

Wild Pitfalls

A few thorny problems affect wildcard-matching algorithms:

  • To be correct, they need to handle repeating text patterns in the input strings. Repeating patterns can confound pattern matching, either when the pattern almost seems like a match but isn't, or when more than one instance of the pattern, in a single string, could match.
  • They may encounter situations in which the character that designates a wildcard (typically *) appears in the "tame" input string (a string without wildcards) as well as in the "wild" input string. A routine that compares characters from each string and upon finding a match just looks for further character matches, without taking note of the wildcard, is again susceptible to false negatives. When you're comparing the "tame" string "x*yy" with the "wild" string "x*y", you most likely want your wildcard character to be recognized as a wildcard, just as usual.
  • They also need to handle situations in which a wildcard character might appear one or more times at the beginning or end of a string, as well as in the middle. Because a * wildcard can match any number of characters, including none; the string "xy" can match "***x*****y***".   Failure to address all the possibilities can cause misbehavior. Even a careful code review can miss inconsistencies that make a routine handle certain wildcard arrangements differently from others.

If you want to test your choice of wildcard-matching code in a way that covers all of these possibilities and more, is it enough to create test cases that invoke it by way of other routines in your code? Your safest bet may instead be to create a batch of tests that target just the standalone wildcard-matching routine.

If you run that set of tests and don't like the results, you have the option either to change out the routine for one of the many that are available, or to tweak what you've got. Whichever path you choose, Listing One provides a set of tests intended to cover the bases I've just outlined.

Wild Design

When a wildcard matching routine needs to handle thousands, or millions, of text strings while a user waits, you'd like to do more than just get the job done. You want it done fast. The wide range of wildcard-matching algorithms available today is associated with an equally wide a range of performance. Analyzing and tuning your algorithm of choice for performance can be done in many ways. I've come up with an approach that combines test-driven development with performance profiling. This combination delivers measurable benefits.

Listing Two lays out my latest wildcard matching code in C. With it, a question mark (?) matches a single character, and an asterisk (*) can match any number of characters; you can easily substitute the ? and * with other single wildcard characters that may fit your needs. I arrived at this code using the tests of Listing One and studying performance profiler outcomes.

Listing Two

// This function compares text strings, one of which can have wildcards 
// ('*' or '?').
//
bool WildTextCompare(
    char *pTameText,   // A string without wildcards
    char *pWildText    // A (potentially) corresponding string with wildcards
)
{
    // These two values are set when we observe a wildcard character.  They 
    // represent the locations, in the two strings, from which we start once 
    // we've observed it.
    //
    char *pTameBookmark = (char *) 0;
    char *pWildBookmark = (char *) 0;

    // Walk the text strings one character at a time.
    while (1)
    {
        // How do you match a unique text string?
        if (*pWildText == '*')
        {
            // Easy: unique up on it!
            while (*(++pWildText) == '*')
            {
            }                          // "xy" matches "x**y"

            if (!*pWildText)
            {
                return true;           // "x" matches "*"
            }

            if (*pWildText != '?')
            {
                // Fast-forward to next possible match.
                while (*pTameText != *pWildText)
                {
                    if (!(*(++pTameText)))
                        return false;  // "x" doesn't match "*y*"
                }
            }

            pWildBookmark = pWildText;
            pTameBookmark = pTameText;
        }
        else if (*pTameText != *pWildText && *pWildText != '?')
        {
            // Got a non-match.  If we've set our bookmarks, back up to one 
            // or both of them and retry.
            //
            if (pWildBookmark)
            {
                if (pWildText != pWildBookmark)
                {
                    pWildText = pWildBookmark;

                    if (*pTameText != *pWildText)
                    {
                        // Don't go this far back again.
                        pTameText = ++pTameBookmark;
                        continue;      // "xy" matches "*y"
                    }
                    else
                    {
                        pWildText++;
                    }
                }

                if (*pTameText)
                {
                    pTameText++;
                    continue;          // "mississippi" matches "*sip*"
                }
            }

            return false;              // "xy" doesn't match "x"
        }

        pTameText++;
        pWildText++;

        // How do you match a tame text string?
        if (!*pTameText)
        {
            // The tame way: unique up on it!
            while (*pWildText == '*')
            {
                pWildText++;           // "x" matches "x*"
            }

            if (!*pWildText)
            {
                return true;           // "x" matches "x"
            }

            return false;              // "x" doesn't match "xy"
        }
    }
}

As I tweaked and refactored this code, I looked for ways to either postpone or altogether leave out conditional checks and references to variables, even those that could consume just a few clock cycles at a time. Because I had my set of tests ready to let me know when my code was mistakenly skipping a crucial step or three, I was able to aggressively try out any changes that looked as if they might speed things along.

Among my experiments, I tried relocating the checks that determine whether both input strings have more characters to compare as we're walking through them. These checks made their way down into logic that could decide to either continue or leave the main loop. In other words, the checks happen only when they must. I also removed the bMatch return variable prominent in my 2008 routine. Like growing and shrinking the stack in recursive code, setting and checking variables in any code involves clock cycles and cache misses. Eliminating a variable is a sure way to reduce costs.

Most routines of mine don't include more than one return statement, and that one's normally placed at the end, right before the closing bracket. But for this code, once the bMatch variable was gone, I found that the strategic placement of return statements, often near my checks for the ends of input strings, shaved significant time off runs that made large numbers of calls to the routine. As it turns out, there's no reason to insist on having a return at the very end of this code.

As with my 2008 routine, it's curious to watch this routine in the debugger as it walks through the two input strings in parallel. It relies on two bookmarks that are set, one for each of the input strings, when a * wildcard is encountered. If we find a matching pattern after the wildcard, we continue walking through the strings as before. But if that matching pattern ends with a non-match, while there are yet more characters to be compared, then we fall back to the bookmarks, bump the bookmark in the tame string ahead, and retry. From there, we look for the beginning of a matching pattern again. This fallback logic may repeat if we find another non-match.

The loop coded after the "Fast-forward" comment, around line 34 is based on a suggestion from Dr. Dobb's reader Christian Hammer. The idea is that when the routine encounters a * wildcard character, we can skip ahead until we find a match on the character after the wildcard. Only after that, we set new bookmarks as in my 2008 routine.

You might imagine that an optimizing compiler can do just as good a job as I've done. But a compiler is a general-purpose tool, and its optimizations are only as good as what someone has worked out for general-purpose code. Moving things around in pattern-matching logic isn't very general. Advanced tools such as mature optimizing compilers aren't available for some of the new platforms, such as embedded and mobile systems, that have been coming out recently.

 Wild Speed

I use a profiler that can show me how much time a test run has spent on each line of source code, along with percentages of the overall function time per line. Using that annotated source view, I can see where to focus as I consider performance improvements to try: I look for the lines consuming the most time. When the profiler runs in this "line" mode, it assumes that each instruction always takes a predicted amount of time. I can view actual timings if I set the profiler to report measurements per function, rather than per line, and rerun. The result may be closer to the reality that will play out when the code is deployed in the field. Nevertheless, I've found that for this code, the reported function timings have been fairly similar going from the one profiling mode to the other, at least with the profiler I use.

Wildcard
Figure 1: Original performance of the 2008 code.

Figures 1 and 2 depict profiler output revealing this code's roughly 5x improved performance over the code in Listing Two from my 2008 article (given a million repetitions of the tests in my little suite, profiled in line mode).

Wildcard
Figure 2: Improved performance.

That article's Listing Two code, in turn, has been tested to be as much as an order of magnitude faster than the code shown in Listing One of that same article (for typical input). Admittedly, the 2008 code takes some extra cycles because, unlike the new routine, it offers a choice of case-insensitive text comparison and alternate string termination characters. Yet I found that the extra setup step and comparison step fail to account for even half the performance difference in the cases I've observed.

Which wildcard matching algorithm is the best? You can apply some test cases and tools to select one for yourself — or code your own. I doubt that any single algorithm is the best for every application. For instance, there may be some situations where you know with certainty some limits on the input. That might give your code fewer requirements to meet. You might find ways to optimize an existing algorithm for your specific purpose, whether it be case-insensitive matching, internationalized text, the broader world of regular expressions…you name it. Ask yourself just what range of jobs you expect wildcard matching to do for you, and whether it's a big enough bottleneck that you want to tweak it to fit just those jobs. If it is, then happy tweaking.


Kirk Krauss is an intellectual property specialist at IBM.


Listing One

// This is a test program for wildcard matching routines.  It can be used 
// either to test a single routine for correctness, or to compare the timings 
// of two (or more) different wildcard matching routines.
//
#include <stdio.h>
#define COMPARE_PERFORMANCE 1  // Remove this line for correctness testing.

bool test(char *tame, char *wild, bool bExpectedResult)
{
    bool bResult = true;   // We'll do "&=" cumulative checking.
    bool bPassed = false;  // Assume the worst.

    // Call a wildcard matching routine.
    bResult &= WildTextCompare(tame, wild);

#if defined(COMPARE_PERFORMANCE)
    // Call another wildcard matching routine.  Running a debug build of this 
    // code under a performance profiler makes it easy to compare the timings
    // of the two routines.
    //
    bResult &= GeneralTextCompare(tame, wild, /* bCaseSensitive = */ true);

    // Here you can add more calls to more routines, for multi-way timing 
    // comparisons.
#endif

#if !defined(COMPARE_PERFORMANCE)
    // To assist correctness checking, output the two strings in any failing 
    // scenarios.
    //
    if (bExpectedResult == bResult)
    {
        bPassed = true;
    }
    else
    {
        printf("Failed match on %s vs. %s\n",  tame, wild);
    }
#else
    // For performance profiling, no need to worry about pass/fail outcomes.
    // Though for apples/apples comparisons, it would be helpful if the 
    // outcomes are at least consistent.
    //
    bPassed = true;
#endif

    return bPassed;
}

// This main() routine passes a bunch of test strings into the above code.
// In performance comparison mode, it does that over and over.  Otherwise, 
// it does it just once.  Either way, it outputs a passed/failed result.
//
void main(void)
{
    int  nReps;
    bool bAllPassed = true;

#if defined(COMPARE_PERFORMANCE)
    // Can choose as many repetitions as you're expecting in the real world.
    nReps = 1000000;
#else
    nReps = 1;
#endif

    while (nReps--)
    {
        // Cases with repeating character sequences.
        bAllPassed &= test("abcccd", "*ccd", true);
        bAllPassed &= test("mississipissippi", "*issip*ss*", true);
        bAllPassed &= test("xxxx*zzzzzzzzy*f", "xxxx*zzy*fffff", false);
        bAllPassed &= test("xxxx*zzzzzzzzy*f", "xxx*zzy*f", true);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxx*zzy*fffff", false);
        bAllPassed &= test("xxxxzzzzzzzzyf", "xxxx*zzy*f", true);
        bAllPassed &= test("xyxyxyzyxyz", "xy*z*xyz", true);
        bAllPassed &= test("mississippi", "*sip*", true);
        bAllPassed &= test("xyxyxyxyz", "xy*xyz", true);
        bAllPassed &= test("mississippi", "mi*sip*", true);
        bAllPassed &= test("ababac", "*abac*", true);
        bAllPassed &= test("ababac", "*abac*", true);
        bAllPassed &= test("aaazz", "a*zz*", true);
        bAllPassed &= test("a12b12", "*12*23", false);
        bAllPassed &= test("a12b12", "a12b", false);
        bAllPassed &= test("a12b12", "*12*12*", true);

        // Additional cases where the '*' char appears in the tame string.
        bAllPassed &= test("*", "*", true);
        bAllPassed &= test("a*abab", "a*b", true);
        bAllPassed &= test("a*r", "a*", true);
        bAllPassed &= test("a*ar", "a*aar", false);

        // More double wildcard scenarios.
        bAllPassed &= test("XYXYXYZYXYz", "XY*Z*XYz", true);
        bAllPassed &= test("missisSIPpi", "*SIP*", true);
        bAllPassed &= test("mississipPI", "*issip*PI", true);
        bAllPassed &= test("xyxyxyxyz", "xy*xyz", true);
        bAllPassed &= test("miSsissippi", "mi*sip*", true);
        bAllPassed &= test("miSsissippi", "mi*Sip*", false);
        bAllPassed &= test("abAbac", "*Abac*", true);
        bAllPassed &= test("abAbac", "*Abac*", true);
        bAllPassed &= test("aAazz", "a*zz*", true);
        bAllPassed &= test("A12b12", "*12*23", false);
        bAllPassed &= test("a12B12", "*12*12*", true);
        bAllPassed &= test("oWn", "*oWn*", true);

        // Completely tame (no wildcards) cases.
        bAllPassed &= test("bLah", "bLah", true);
        bAllPassed &= test("bLah", "bLaH", false);

        // Simple mixed wildcard tests suggested by IBMer Marlin Deckert.
        bAllPassed &= test("a", "*?", true);
        bAllPassed &= test("ab", "*?", true);
        bAllPassed &= test("abc", "*?", true);

        // More mixed wildcard tests including coverage for false positives.
        bAllPassed &= test("a", "??", false);
        bAllPassed &= test("ab", "?*?", true);
        bAllPassed &= test("ab", "*?*?*", true);
        bAllPassed &= test("abc", "?**?*?", true);
        bAllPassed &= test("abc", "?**?*&?", false);
        bAllPassed &= test("abcd", "?b*??", true);
        bAllPassed &= test("abcd", "?a*??", false);
        bAllPassed &= test("abcd", "?**?c?", true);
        bAllPassed &= test("abcd", "?**?d?", false);
        bAllPassed &= test("abcde", "?*b*?*d*?", true);

        // Single-character-match cases.
        bAllPassed &= test("bLah", "bL?h", true);
        bAllPassed &= test("bLaaa", "bLa?", false);
        bAllPassed &= test("bLah", "bLa?", true);
        bAllPassed &= test("bLaH", "?Lah", false);
        bAllPassed &= test("bLaH", "?LaH", true);

        // Many-wildcard scenarios.
        bAllPassed &= test("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab", 
            "a*a*a*a*a*a*aa*aaa*a*a*b", true);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*a*aa*aaa*fa*ga*b*", true);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*a*x*aaa*fa*ga*b*", false);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*aaaa*fa*ga*gggg*b*", false);
        bAllPassed &= test("abababababababababababababababababababaacacacacaca\
cacadaeafagahaiajakalaaaaaaaaaaaaaaaaaffafagaagggagaaaaaaaab", 
            "*a*b*ba*ca*aaaa*fa*ga*ggg*b*", true);
        bAllPassed &= test("aaabbaabbaab", "*aabbaa*a*", true);
        bAllPassed &= test("a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", 
            "a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaaa", 
            "*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", true);
        bAllPassed &= test("aaaaaaaaaaaaaaaa", 
            "*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*", false);
        bAllPassed &= test("abc*abcd*abcde*abcdef*abcdefg*abcdefgh*abcdefghi*a\
bcdefghij*abcdefghijk*abcdefghijkl*abcdefghijklm*abcdefghijklmn", 
            "abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*a\
            bc*", false);
        bAllPassed &= test("abc*abcd*abcde*abcdef*abcdefg*abcdefgh*abcdefghi*a\
bcdefghij*abcdefghijk*abcdefghijkl*abcdefghijklm*abcdefghijklmn", 
            "abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*", true);
        bAllPassed &= test("abc*abcd*abcd*abc*abcd", "abc*abc*abc*abc*abc", 
            false);
        bAllPassed &= test(
            "abc*abcd*abcd*abc*abcd*abcd*abc*abcd*abc*abc*abcd", 
            "abc*abc*abc*abc*abc*abc*abc*abc*abc*abc*abcd", true);
        bAllPassed &= test("abc", "********a********b********c********", 
            true);
        bAllPassed &= test("********a********b********c********", "abc", 
            false);
        bAllPassed &= test("abc", "********a********b********b********", 
            false);
        bAllPassed &= test("*abc*", "***a*b*c***", true);

        // A case-insensitive algorithm test.
        // bAllPassed &= test("mississippi", "*issip*PI", true);
    }

    if (bAllPassed)
    {
        printf("Passed\n");
    }
    else
    {
        printf("Failed\n");
    }

    return;
}


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video