Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Text Searching, C++ and OOPS, and ANSI Strings


DEC89: C PROGRAMMING

This month we are going to start a new "C Programming" column project. We are going to examine some of the problems associated with the maintenance of large text data bases. You might be familiar with the problem. You have a lot of reference material and you do not always know how to find what you are looking for. Sometimes the reference material is stored in text files. If you keep a lot of word processing files at work, or if you download a lot of software with .DOC or READ.ME files, your material is stored somewhere in ASCII text and would be ready to be read if only you could find it. Our new project will address the problems associated with locating the text files that have the information we need.

Someday everyone will have Optical Character Recognition scanners, and we will store all our books and magazines on disk. Maybe the magazines will begin to sell their editorial content on diskettes or CD-ROM as well. When that happens, then all our reference works will be ready for treatment by a document processor similar to the one we will build here in this column.

There are a number of software packages that do what we will build here. Most CD-ROM distributions of large text data bases include such a system. We will build our own. That way each of us can customize it to our specific requirements. This software will be generic standard C. If you want to integrate it into a particular user platform, it should port with little difficulty. Our project will build a system that we will call TEXTSRCH.

There are a number of disciplines that we must address in the development of such a project. First, we must ask, given a large text data base and someone to search it, how does the search proceed? What are the search criteria and how will we express them to the software? These are the first requirements to consider, because they will determine what techniques we use to store and find things. Next, once we identify files of text that match our search criteria, how must the system present them to us?

The TEXTSRCH Data Base

A document retrieval system such as TEXTSRCH works best in an environment where the data files are static. There is a lot of text and it does not change much. Users control changes with addendum or replacement documents at scheduled intervals. Typical applications are engineering specifications, maintenance manuals, and reference libraries.

To build the data base, you collect the documents in a controlled location, perhaps its own subdirectory, and then you build an index. The index supports searches. Some systems require you to manually examine every document and mark all the phrases that will appear in the index. Others build the index by automatically scanning the text and extracting words. We will use this second approach.

You want to use this kind of system on a static data base because the indexing operation can take a long time.

Retrievals

With the data base in place and the index built, you are ready to perform retrievals. Let's first discuss how such searches will proceed. Someone will want to find all references to a word or phrase in the data base. Perhaps you will want every reference to the phrase, "object-oriented programming." You must tell the system what you want, and it must tell you where you can find those references. But suppose you are interested only in references to OOP that are specifically about Smalltalk. Then you might want your search to deliver only those files where both key phrases appear. Another search might want to specifically exclude references to C++, and so you must be able to tell the system about that requirement as well.

Queries

You can see where we are heading. It is apparent at the outset that we need some form of structured query language with Boolean operators. We need to be able to parse an expression such as this one:

  "object oriented" and "smalltalk" but not "C++"

Our expressions will have operators with precedence, so we will need to use parentheses to override the default precedence assigned by the parser, making an expression such as this one possible:

  "object oriented" and ("design" or "programming")

The TEXTSRCH Expression Analyzer

All of what just preceded leads us to our first installment. The first set of functions that we will develop will parse such query expressions and prepare them for a retrieval pass at the data base. Listing One, page 164, is textsrch.h. This listing is one that will grow over the months. As we add features, we will add macros, prototypes, and such to textsrch.h. This month it contains what we need for expression analysis.

Listing Two, page 164, is express.c, the code that parses our expressions. It begins with a modified BNF that describes the query language. As you can see, it isn't very complex. A program that wants to parse an expression calls the lexical_scan function in this module. The function returns a NULL if it finds an error. If the expression is correct, the function returns a pointer to the parsed expression, which is represented in the form of tokens suitable for processing by an expression interpreter.

The lexical_scan function calls the expression function to check the expression for syntax and to convert it into tokens. The syntax check uses a recursive descent parser. That is, if the expression function finds a left parentheses in the stream, it calls itself to validate the expression that is in parentheses, and then, when it returns, it expects to find a right parenthesis. If it finds the unary NOT operator, it calls itself to check the expression that follows the operator. If it finds a word or phrase -- a phrase in our context consists of a group of words surrounded by double quotes -- it looks at the next element in the expression. If that element is the binary AND or OR operator, the expression function calls itself to check the expression to the right of the operator.

While validating the expression, the program converts it into tokens. Each language element is a one-character token, and the tokens are written into an array in the order in which they occur. When the token is a word or phrase, which is assigned the OPERAND token, an associated string contains the value of the word or phrase.

After the expression is found to be correct and the tokens are built, the lexical_scan function calls the postfix function. Its purpose is to convert the token stream from infix notation to postfix notation. An expression in infix notation uses defined operator precedence with parentheses to override the defaults. This is the notation we use when we compose the query. It is also the notation used by the C language. The postfix notation (also called Reverse Polish Notation) eliminates the parentheses and represents precedence by the proximity of tokens to one another. Users of SNOBOL, Forth, and TI scientific calculators will recognize the notation.

An infix expression places a binary operator between the operands as shown here:

     this and that

The postfix version of this expression pushes the operands on a stack followed by the operator as shown here:

     <and>
    that
    this

(In these examples, I've used the angle brackets to set operators apart from the operands on the stack.)

A parenthesized infix expression such as this:

     (this and that) or what

is represented on the postfix stack this way:

     <or>
    what
    <and>
    that
    this

Evaluation of a postfix expression is a stack operation. The top element is popped. If the element is an operand, we use the operand to derive the value of the expression. If the element is a unary operator, the next element is popped and evaluated, the unary operator is applied to it, and the result is pushed. If the element is a binary operator, the next two operators are popped, the binary operator is applied to them, and the result is pushed. Then the procedure repeats until the element popped is a single result. Each pop is a recursive repeat of the entire procedure. In our case the pushed results are true or false values within the data base document depending on the search results. Each time we pop a word or phrase, we see if it is in the data base and, if so, in which files. The result of that search is pushed. We'll get further into the details of expression evaluation next month.

Conversion from infix notation to postfix notation is a simple matter. The procedure is described in detail in Fundamentals of Data Structures, Horowitz and Sahni, 1976, Computer Science Press, on pages 91 - 97. We implement this procedure in the functions postfix, isp, icp, and poststack.

Listing Three, page 167, is testexpr.c, a program to test our expression analyzer and display the results. It is a throwaway program not to be used in the project beyond its purpose as a test and demonstration tool. You type your expression into the gets function. Use words and phrases with interspersed operators. The valid operators are AND, OR, and NOT. The purpose of the query is to express a pattern of words and phrases that do or do not appear in the text data base. The TEXTSRCH program will find files that match the query expression. Typical expressions are these:

     (fortran or cobol) and not pascal
    fortran or (cobol and not pascal)

The testexpr program will report a syntax error by displaying a pointer symbol below the expression element where it found the error. If the expression is correct, the program will display the postfix stack in this format:

Enter the search expression: fortran or (cobol and not pascal)

  Enter the search expression:
fortran or (cobol and not pascal)

   Token
------------------------------
   <or>
   <and>
   <not>
   pascal
   cobol
   fortran
------------------------------
   Enter the search expression:

Press the Enter key with no expression to terminate the program. Next month we'll discuss how we will index the data base and how that index will deliver our query results.

Book Report: Object-Oriented Program Design With Examples in C++

Are you looking for a stocking stuffer for yourself this season? There is a new book called Object-Oriented Program Design With Examples in C++, by Mark Mullins. The book is more about object-oriented programming and design than about C++, but it uses C++ for examples and is an indispensable addition to your library if you are trying to figure out OOP and C++.

Mullin takes a unique approach to these subjects. Instead of presenting just another enhancement to the AT&T C++ reference documents and Stroustrup's book, he teaches the basics of object-oriented design and uses C++ to demonstrate the lessons. Follow this book from beginning to end, and you will have a manageable grasp on object-oriented design and programming by the time you're done.

Mullin's book also teaches valuable lessons about the general approach a programmer uses in the design of software systems, object oriented or not, and he does it with a writing style that informs and often entertains the reader. Every now and then he slips in a dry joke that goes by almost unnoticed.

To show you how to design an object-oriented system, Mullin undertakes the development of an integrated information management system for a fictional company. This is the first extensive treatment of OOP I've seen that uses real, believable examples of things that a programmer might encounter in the real world. Most writers build contrived analogies that attempt to relate abstract examples of toasters, fruit, or other silly non-automata to the mysterious things that the object-oriented paradigm calls objects. Mullin uses inventory, sales and personnel database records to demonstrate the development of a class hierarchy to support his hypothetical design. These are things we can all associate with.

Perhaps you've never worked on an inventory system, but you can readily see what one might need to do. If there is a warehouse full of stuff and you need to locate specific items, fill orders for items, and replace what you sell, then you need an inventory system. I've worked on several. The smallest was for a video tape store; the largest was for spare parts for the Space Shuttle. Their respective functional requirements were similar. The inventory management folks call it "material requirements planning." George Carlin calls it "keeping track of your stuff." Mullin takes this mundane application, one that everyone will recognize, and designs an object-oriented software system to support it.

Mullin traces the design in the manner in which it might actually occur. The false starts, the refinement of requirements, the incremental development of an object-oriented data model, he presents them all in the sequence in which you would encounter them during a real-design project. As he proceeds, Mullin tells you how you should be thinking about the objects and their place in the class hierarchy.

Mullin's design sometimes tends to build his data base by using views that seem -- at least to me -- to be unnatural. Object-oriented design involves building base classes and derived classes in a class hierarchy. Because every entity in his data base has a name and address, Mullin starts out with a new class named "Entity" that contains name and address members. Then he derives everything else from that class. This is not the natural view of things. We are not necessarily always subordinate to our common denominators. Because we all have addresses does not mean that we are subordinate to those addresses. I think we are seeing the tendency of the object-oriented paradigm to bend our perspective in inappropriate directions. The solution to a problem should resemble the problem, and names and addresses are components of their owners, not the other way around. I would have created the Entity class with its name and address members and then included an instance of that class in each of the classes that require it. The functional result would be the same but we would not be forced to perceive our data base from an illogical perspective just to adhere to hierarchical dogma. But then, I do not know enough about object-oriented design to write a book. Not yet, at least.

My only other criticism of this book is that Mullin and his editors do not seem to know that the word "data" is a plural noun. But then, many of my colleagues here at DDJ don't know that either, and I forget sometimes, too. That criticism aside, this is a very good book. OOP and C++ have been begging for literary treatments this good, and I'm happy to see it finally happening.

The ANSI Corner: Preprocessing and Strings

The ANSI standard C specification has introduced something the committee calls "stringizing," not a real word, an abomination actually, but a new concept in preprocessing, nonetheless. (If K&R can give us "initializer," also not a word, I guess X3J11 similarly presumed that they could extend English while they extended C.)

The trouble with the ANSI document is that it often fails to provide any basis for a particular extension to classic C. In too many cases the committee leaves it to us to figure out why they included a particular feature. They will describe the feature and perhaps provide examples of how a conforming compiler must behave, but they do not adequately provide the motive behind it all. The rationale document that accompanies the specification is supposed to fill these information gaps but too often it does not or tries and fails. Such is the case with the preprocessor's # operator on #define macro parameter substitutions, the so-called "stringizing" feature. They tried to explain it, but I do not understand the explanation. If you do not already know how you might use the # feature, you might not be able to guess its purpose from the description in the specification even when you know what it does. What we can do is look at what it does and see if this solution solves some problem we might have. The # operator turns a parameter into a string. That's all it does. Here is an example:

   #define str(x) # x
   printf(str(HELLO));

This macro expands the hello parameter into a string so that the printf call looks like this after the substitution:

     printf("HELLO");

The preprocessor performs any other translations on the parameters before it builds the string as shown here:

   #define HELLO goodbye
   #define str(x) # x
   printf(str(HELLO));

This sequence converts to this:

     printf("goodbye");

Why would you want to do any of that? If you know you want HELLO to be "HELLO" or goodbye to be "goodbye," why not just code it that way in the first place? Let's try to dream up a circumstance where this feature might have some use. One possibility that comes to mind is in the realm of debugging. You can use the # operator to create a TRACE macro that traces selected expressions in your program on the console. You can turn the TRACE on and off with a global compile-time variable. Here is the macro:

  #ifdef DEBUGGING
        #define TRACE(x)(printf("\n"# x),(x))
  #else
        #define TRACE(x) (x)
  #endif

Suppose your program has an expression that you want to trace on the console. If this is the expression:

     b = strlen ("12345");

To trace this expression, you can recode it with the TRACE macro like this:

     b = TRACE(strlen ("12345"));

Now whenever the expression is executed with the DEBUGGING global variable defined, the expression's code is displayed on the console, as well as being evaluated because the statement is preprocessed to this sequence:

     b = (printf("\ n" "strlen (\ "12345 \")"),
        (strlen("12345")));

Notice that the two components of the macro expansion, the printf and the strlen calls are comma separated. This guarantees that the strlen, which is the rightmost expression, returns the value required when the expression is evaluated. Notice also that the two are surrounded by parentheses. If they were not, we would get the wrong value in the "b" integer because the assignment operator has a higher precedence than the comma operator.

Next month the ANSI Corner will explore the ## token pasting operator. We will, that is, if I can figure out a useful application for token pasting.

_C PROGRAMMING COLUMN_ by Al Stevens

[LISTING ONE]

<a name="0275_000b">

/* ----------- textsrch.h ---------- */

#define MXTOKS 25       /*  maximum number of tokens */

#define OK    0
#define ERROR !OK

struct postfix  {
    char pfix;      /* tokens in postfix notation */
    char *pfixop;   /* operand strings            */
};

extern struct postfix pftokens[];
extern int xp_offset;

/* --------- expression token values ---------- */
#define TERM     0
#define OPERAND 'O'
#define AND     '&'
#define OR      '|'
#define OPEN    '('
#define CLOSE   ')'
#define NOT     '!'
#define QUOTE   '"'

/* ---------- textsrch prototypes ---------- */
struct postfix *lexical_scan(char *expr);







<a name="0275_000c"><a name="0275_000c">
<a name="0275_000d">
[LISTING TWO]
<a name="0275_000d">

/* ---------- express.c ----------- */

#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "textsrch.h"

/*
 * Parse a search expression into a valid postfix token stream.
 * The input expression has this form:
 *   <expr>   := <ident>
 *               <ident> <op> <expr>
 *               NOT <expr>
 *               (<expr>)
 *   <op>     := AND
 *               OR
 *   <ident>  := <character>
 *               <character> <ident>
 *               "<phrase>"
 *   <phrase> := <ident>
 *               <ident> <space> <phrase>
 */

#define iswhite(c)      (c < 33 && c > 0)
#define isparen(c)      (c == OPEN || c == CLOSE)
#define isop(c)         (c == AND || c == OR)
#define ischaracter(c)  (!isparen(c) && \
                        !isop(c)     && \
                        c != TERM    && \
                        !iswhite(c)  && \
                        c != QUOTE)

/* ----------- prototypes --------------- */
static int getword(char *expr);
static int gettoken(char *expr);
static int expression(char *expr);
static void postfix(void);
static int isp(int tok);
static int icp(int tok);
static void poststack(void);

int xp_offset = 0;              /* offset into the expression */
static char word[50];           /* word from the expression   */

static char tokens[MXTOKS+1];   /* tokens in infix notation   */
static char *operands[MXTOKS];  /* operand strings            */
static int token_ptr = 0;

static char stack[MXTOKS];      /* stack for tokens           */
static char *stopr[MXTOKS];     /* operand strings            */
static int top = 0;

struct postfix pftokens[MXTOKS];
static int pf = 0;

/* ------ analyze the expression for valid syntax ----------*/
struct postfix *lexical_scan(char *expr)
{
    token_ptr = xp_offset = 0;
    if (expression(expr) == ERROR)
        return NULL;
    else if (gettoken(expr) != TERM)
        return NULL;
    postfix();
    return pftokens;
}

/* ---------- analyze an element of the expression ---------- */
static int expression(char *expr)
{
    int tok;

    tok = gettoken(expr);
    switch (tok)    {
        case OPEN:
            if (expression(expr) == ERROR)
                return ERROR;
            if ((tok = gettoken(expr)) != CLOSE)
                return ERROR;
            break;
        case NOT:
            if (expression(expr) == ERROR)
                return ERROR;
            break;
        case OPERAND:
            break;
        default:
            return ERROR;
    }
    tok = gettoken(expr);
    switch (tok)    {
        case TERM:
            return OK;
        case AND:
        case OR:
            return expression(expr);
        case CLOSE:
            --token_ptr;
            --xp_offset;
            return OK;
        default:
            break;
    }
    return ERROR;
}

/* ------- extract the next token from the expression ------- */
static int gettoken(char *expr)
{
    char tok;

    operands[token_ptr] = 0;
    if ((tok = getword(expr))== OPERAND)    {
        operands[token_ptr] = malloc(strlen(word) + 1);
        strcpy(operands[token_ptr], word);
    }
    tokens[token_ptr++] = tok;
    return tok;
}

/* ------- extract a word, operator, parenthesis,
           or terminator from the expression ------------- */
static int getword(char *expr)
{
    int w = 0;

    /* ------- bypass white space -------- */
    while (iswhite(expr[xp_offset]))
        xp_offset++;
    switch (expr[xp_offset])    {
        case OPEN:
        case CLOSE:
            return expr[xp_offset++];
        case TERM:
            return TERM;
        case QUOTE:
            while (expr[++xp_offset] != QUOTE)  {
                if (w == 50)
                    return ERROR;
                word[w++] = tolower(expr[xp_offset]);
            }
            xp_offset++;
            word[w] = '\0';
            return OPERAND;
        default:
            while (ischaracter(expr[xp_offset]))    {
                if (w == 50)
                    return ERROR;
                word[w++] = tolower(expr[xp_offset]);
                xp_offset++;
            }
            word[w] = '\0';
            if (strcmp(word, "and") == 0)
                return AND;
            else if (strcmp(word, "or") == 0)
                return OR;
            else if (strcmp(word, "not") == 0)
                return NOT;
            return OPERAND;
    }
}

/* - convert the expression from infix to postfix notation - */
static void postfix(void)
{
    char tok = '*';

    top = token_ptr = pf = 0;
    stack[top] = '*';
    while (tok != TERM) {
        switch (tok = tokens[token_ptr])    {
            case OPERAND:
                pftokens[pf].pfix = tok;
                pftokens[pf].pfixop = operands[token_ptr];
                pf++;
                break;
            case NOT:
            case OPEN:
            case AND:
            case OR:
                while (isp(stack[top]) >= icp(tok))
                    poststack();
                stack[++top] = tok;
                break;
            case CLOSE:
                while (stack[top] != OPEN)
                    poststack();
                --top;
                break;
            case TERM:
                while (top)
                    poststack();
                pftokens[pf++].pfix = tok;
                break;
        }
        token_ptr++;
    }
}

static int isp(int tok)
{
    return ((tok == OPEN) ?  0 :
            (tok == '*')  ? -1 :
            (tok == NOT)  ?  2 :
                             1  );
}

static int icp(int tok)
{
    return ((tok == OPEN) ?  4 :
            (tok == NOT)  ?  2 :
                             1  );
}

/* --- remove a token from the stack, put it into postfix --- */
static void poststack(void)
{
    pftokens[pf].pfix = stack[top];
    pftokens[pf].pfixop = stopr[top];
    --top;
    pf++;
}





<a name="0275_000e"><a name="0275_000e">
<a name="0275_000f">
[LISTING THREE]
<a name="0275_000f">

/* ----------- testexpr.c ------------- */

/*
 * A program to test the TEXTSRCH expression analyzer
 */

#include <stdio.h>
#include <process.h>
#include <string.h>
#include "textsrch.h"

static void disp_token(struct postfix *pf);

void main(void)
{
    char expr[80];

    do  {
        /* ----- read the expression from the console ------ */
        printf("\nEnter the search expression:\n");
        gets(expr);
        if (*expr)  {
            /* --- scan for errors and convert to postfix --- */
            if (lexical_scan(expr) == NULL) {
                while(xp_offset--)
                    putchar(' ');
                putchar('^');
                printf("\nSyntax Error");
                exit(1);
            }

            /* ------ display the postfix tokens ------ */
            printf("\nToken");
            printf("\n--------------");
            disp_token(pftokens);
            printf("\n--------------");
        }
    } while (*expr);
}

static void disp_token(struct postfix *pf)
{
    if (pf->pfix != TERM)   {
        disp_token(pf+1);
        printf("\n %s", pf->pfix == AND  ? "<and>" :
                        pf->pfix == OR   ? "<or>"  :
                        pf->pfix == NOT  ? "<not>" :
                        pf->pfixop);
    }
}












Copyright © 1989, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.