RSS

Parallel

Matching Wildcards: An Algorithm

Source Code Accompanies This Article. Download It Now.


Author Footnote

Dear Dr. Dobb's,

Dr. Dobb's reader Benjamin Lunt has kindly done some performance comparisons using my algorithm and a recursive algorithm from wildcard text matching code used by a colleague of his. He found the recursive code slightly faster than the version of my code that appears in Dr. Dobb's, in some circumstances. In other circumstances, my code was faster. But that got me thinking about performance.

I realized that my algorithm could benefit from skipping an extra assignment of the w variable, which occurred any time after a wildcard ("*") character had been encountered. Needless assignments can force needless cache memory updates, possibly leading to serious slowdowns on modern CPUs. Then I thought of a way to correct this. The new performance-tuned code is below. It retains the fixes for repeating character patterns and for case sensitivity that I previously provided. I've tested the new version in every way I could think of, and I've sent it out to Ben and some other interested Dr. Dobb's readers who've also shared with me various thoughts about the algorithm. Ben has very helpfully tested the performance, as he did with the original version; his remarks appear after the new code below.

In answer to Ben's comment about nonmatching situations requiring more cycles than matching situations, let me point out the following. Because this is wildcard-based matching, you may have to do a lot of checking to know for certain that you don't have a match. Perhaps there's no way around that sort of performance limitation in wildcard text matching algorithms. At least I haven't thought of a faster approach, and I have yet to see evidence that one exists -- but then again you never know.

Regards,
Kirk

//This function compares text strings, one of which can have wildcards ('*').
//
BOOL GeneralTextCompare(
        char * pTameText,             // A string without wildcards
        char * pWildText,             // A (potentially) corresponding string with wildcards
        BOOL bCaseSensitive = FALSE,  // By default, match on 'X' vs 'x'
        char cAltTerminator = '\0'    // For function names, for example, you can stop at the first '('
)
{
        BOOL bMatch = TRUE;
        char * pAfterLastWild = NULL; // The location after the last '*', if we’ve encountered one
        char * pAfterLastTame = NULL; // The location in the tame string, from which we started after last wildcard
        char t, w;

        // Walk the text strings one character at a time.
        while (1)
        {
                t = *pTameText;
                w = *pWildText;

                // How do you match a unique text string?
                if (!t || t == cAltTerminator)
                {
                        // Easy: unique up on it!
                        if (!w || w == cAltTerminator)
                        {
                                break;                                   // "x" matches "x"
                        }
                        else if (w == '*')
                        {
                                pWildText++;
                                continue;                           // "x*" matches "x" or "xy"
                        }
                        else if (pAfterLastTame)
                        {
                                if (!(*pAfterLastTame) || *pAfterLastTame == cAltTerminator)
                                {
                                        bMatch = FALSE;
                                        break;
                                }
                                pTameText = pAfterLastTame++;
                                pWildText = pAfterLastWild;
                                continue;
                        }

                        bMatch = FALSE;
                        break;                                           // "x" doesn't match "xy"
                }
                else
                {
                        if (!bCaseSensitive)
                        {
                                // Lowercase the characters to be compared.
                                if (t >= 'A' && t <= 'Z')
                                {
                                        t += ('a' - 'A');
                                }

                                if (w >= 'A' && w <= 'Z')
                                {
                                        w += ('a' - 'A');
                                }
                        }

                        // How do you match a tame text string?
                        if (t != w)
                        {
                                // The tame way: unique up on it!
                                if (w == '*')
                                {
                                        pAfterLastWild = ++pWildText;
                                        pAfterLastTame = pTameText;
                                        w = *pWildText;

                                        if (!w || w == cAltTerminator)
                                        {
                                                break;                           // "*" matches "x"
                                        }
                                        continue;                           // "*y" matches "xy"
                                }
                                else if (pAfterLastWild)
                                {
                                        if (pAfterLastWild != pWildText)
                                        {
                                                pWildText = pAfterLastWild;
                                                w = *pWildText;
                                               
                                                if (!bCaseSensitive && w >= 'A' && w <= 'Z')
                                                {
                                                        w += ('a' - 'A');
                                                }

                                                if (t == w)
                                                {
                                                        pWildText++;
                                                }
                                        }
                                        pTameText++;
                                        continue;                           // "*sip*" matches "mississippi"
                                }
                                else
                                {
                                        bMatch = FALSE;
                                        break;                                   // "x" doesn't match "y"
                                }
                        }
                }

                pTameText++;
                pWildText++;
        }

        return bMatch;
}

From Dr. Dobb's reader Benjamin Lunt...

Hi Kirk,

Yes, much faster. At 10,000,000 iterations,

tame = "mississippi.river";
wild = "*.*";
original code: 57 seconds
new code: 2 seconds
my friends code: 46 seconds

wild = "*ip*";
47, 1, 43 respectively

wild = "m*i*.*r";
77, 3, 66 respectively

wild = "m*x*.*r";
69, 23, 66 respectively

Good job. However, if you will notice, it is extremely faster if it is a match, but quite slow (though still faster than the original) if not a match. I figured it should be the other way around. Once you find the match, you stop looking.

It is finding non matches that should be the faster code.

I didn't look too closely at your code yet, but could it be switched so that it was faster on a non match than on a match?

Anyway, I was surprised in the speed increase from the last code. I thought I did something wrong, but no, it was your code that was just much faster :-)

Thanks,
Ben


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Best of the Web

First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

Quick Read

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

Quick Read

Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

Quick Read

Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

Quick Read

How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

Quick Read


More "Best of the Web" >>

Most Popular

Video