Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Awk As a C Code Generator


AUG90: AWK AS A C CODE GENERATOR

The unique features of this special-purpose language can greatly improve development time

This article contains the following executables: BALDWIN.LST

Wahhab has more than 20 years experience in software development and is currently owner of Baldwin Software Services, a small company specializing in helping large companies apply new technology and improve their software development process. He can be reached at 1011 Union St., Manchester, NH 03104.


The AWK language is one of the gifts that the Unix environment has given to us. Named for the initials of its developers (Alfred Aho, Peter Weinberger, and Brian Kernighan), AWK was originally designed to be a simple utility for "quick and dirty" programming tasks. Since then, it gradually evolved into a powerful tool capable of performing many time-saving tasks for programmers.

This article will introduce the AWK language and present an AWK application that functions as a C code generator. While the specific problem solved here may not be immediately useful to you, the principles applied to solve the problem will give you ideas about how you, too, can use AWK to handle your labor-intensive tasks.

The AWK Language

The AWK language is an example of a special-purpose language. Special-purpose languages are not designed to solve every programming problem. Instead they can greatly ease the development effort required in the arena for which they are intended to be utilized. AWK was designed to be a tool for writing programs that first read one or more sequential files from standard input and then write a file to standard output. While AWK can be used for developing programs that handle other tasks, it excels at tasks of this sort.

AWK saves you time because it makes some assumptions and performs many actions without the need for you to ask that it do so. An AWK program consists of a sequence of pattern-action statements (as described shortly) and function definitions. Essentially, AWK executes the following cycle:

    1. Read a line from standard input.

    2. Parse the line into fields. The special variable $0 is assigned to the whole line (without the carriage return). The special variable $1 holds the first field, and $n holds the nth field.

    3. Set other variables as follows: Set NF to number of fields, NR (number of record) to the line number, FNR to the line number within the current input file, and FILENAME to the name of the current input file. Thus, $NF holds the last field of the line.

    4. Process the record according to each of the pattern-action statements in order.

A pattern-action statement takes the form: pattern {action}

The statement can be read in this way: If the input line matches the pattern (or condition), then perform the action. Either the pattern or the action can be omitted. If the pattern is omitted, then by default the action is to select every input line. If the action is omitted, then by default the action is to print the input line.

The syntax of AWK is very similar to the C syntax, except that in AWK a carriage return can replace C's ubiquitous semicolon. Also variables can be used in AWK without being declared, and they are initialized either to zero or a null string, depending on how they are used.

A crucial feature of AWK that is not present in C is the ability to perform regular expression matching. Regular expression matching is used by the Unix grep utility, and has been adopted by several text editors, including Brief. AWK compares a pattern string (usually enclosed between slashes) against another string (which, by default, is the input line), and then returns true or false depending on whether the pattern is matched. (For more details, see the accompanying text box entitled "Regular Expressions.")

At this point, without explaining the command syntax in detail, we can explore some small but still useful AWK programs. For example, /Fred/ prints all lines in the specified input that contain the string "Fred". The following one-line program prints its input file with line numbers: {print FNR, $0}

If alignment of the input is necessary, then the printf statement familiar to all C programmers can be used: {printf("%4d %s\n", FNR, $0)}

More Details.

The comma between the two parameters in the print statement causes an output field separator (a space by default) to be placed between the fields. If the space is not desired, two strings can be concatenated in AWK by simply naming them one after another (that is, print FNR $0). Note that printf requires a newline, while print, like its Basic counterpart, prints a new line automatically.

Two special patterns, BEGIN and END, refer to one cycle before the first input record has been read and after the last input record has been read. For instance, Example 1 shows how to total a series of numbers.

Example 1: A simple AWK program to total a series of numbers.

       {total += $1}
  END  {print total}  #  Total for all input

The += operator, when used as shown in Example 1, is equivalent to total = total + $1. Also note that the pound sign (#) indicates a comment.

These programs are typically run by first saving them as a file (say, TOTAL.AWK) and then invoking the file by the command line: AWK -f TOTAL.AWK DATA1.DAT DATA2.DAT

DATA1.DAT and DATA2.DAT serve as the input files (DATA*.DAT could also be used). The output goes to STD-OUT and is typically redirected into a file by using >. Short programs can be passed directly as a parameter to AWK: AWK '{print FNR, $0}' prog.c > progc.lst

This last method is more useful under Unix than under MS-DOS, because MS-DOS restricts both the length of the command line and the use of < or > in a parameter.

Note that AWK is traditionally implemented as an interpreter, although an AWK compiler for DOS now exists (see the text box entitled "AWK for DOS"). Most AWK programs are small and I/O bound, so the quick debug cycle of an interpreter generally outweighs the speed benefits of a compiler.

AWK contains many useful built-in functions for handling strings. For instance, the program in Example 2 performs global substitutions on the input file, changing British spelling to American. Other string-handling routines include sub, which performs a single substitution; length and substr, which return the length and substrings, respectively; and index, which finds the index of the first occurrence of one string as a substring of another.

Example 2: A built-in AWK function for global substitution.

  {gsub(/aluminium/, "aluminum")}
  {gsub(/colour/, "color")}

AWK is unusual in that all variables are both treated either as strings or as numbers, depending on how they are used. Arrays in AWK are associative arrays whose subscripts are strings, rather than numbers. This powerful facility, also found in other languages such as SNOBOL and REXX, allows the construction in memory of a structure that is similar to a keyed file. Thus, the program in Example 3 produces a list of each text word used in the input file along with the number of times that the text word occurs.

Example 3: Using strings rather than integers as array subscripts

       # Program to count word usage in input.
       # eliminate all non-letters except space
       {    gsub(/[^A-Za-z ]/, "")         }

       # add words to associative array
       {    for (i = 1; i <= NF; i++)
                 ++word[$i] }

       # print each array entry
  END  {    for (j in word)
                 print j, word [j]  }

I have found AWK to be very useful during the process of converting a large system over from one database manager, such as Informix's C-ISAM, to another. C-ISAM stores data internally in an operating system-independent format. This means that each field must be converted to and from this format when records are read or written. I needed to convert many different record types, each with many fields, and the offsets for each field had to be calculated exactly. I used the AWK program in Listing One (see page 116) to generate the C code required.

In this case, my input files were C header files that contained record structures. In turn, some of these record structures contained other structures or arrays of structures. My output was C source code to move the data from these records as pointed to by a pointer called p. A bit of sample input and output are shown in Figure 1 and Figure 2, respectively.

Figure 1: Sample input for Listing One

  struct rec1 {
     long r1_id_no;
     char r1_name[51];
     int rc_value;

  };
  struct rec2 {
     long r2_id_no;
     int r2_seq;
     char r2_code;
     struct {
        char r2_state_cd[3];
        long r2_state_eff_dt;
     } r2_st_range[51];
     struct {
        int r2_value;
        char r2_use_cd[3];
     } r2_not_array;
     struct {
        char r2_work_cd[3];
     long r2_work_dt;
     } r2_work_list[6];
     long r2_the_end;
  };

Figure 2: C code generated by the AWK program in Listing One

  /* rec1 */
        stlong(p.rec1->r1_id_no , inf_rec + 0);
        stchar(p.rec1->r1_name , inf_rec + 4);
        stint(p.rec1->rc_value , inf_rec + 54);
  /* rec2 */
        stlong(p.rec2->r2_id_no , inf_rec + 0);
        stint(p.rec2->r2_seq , inf_rec + 4);
        stchar(p.rec2->r2_code , inf_rec + 6);
        for (i = 0; i < 51; i++) {
            (p.rec2->r2_st_range[i].r2_state_cd , inf_rec + i * 6 + 7);
            (p.rec2->r2_st_range[i].r2_state_eff_dt , inf_rec + i * 6 + 9);
        }
        (p.rec2->r2_not_array.r2_value , inf_rec + 313);
        (p.rec2->r2_not_array.r2_use_cd , inf_rec + 315);
        for (i = 0; i < 6; i++) {
            (p.rec2->r2_work_list[i].r2_work_cd , inf_rec + i * 6 + 317);
            (p.rec2->r2_work_list[i].r2_work_dt , inf_rec + i * 6 + 319);
        }
        stlong(p.rec2->r2_the_end , inf_rec + 353);

All of my header files had the same basic format, so I didn't try to make the AWK program more general. For example, the opening struct statement and the closing brace for each record must start in column 1, while all other lines must be indented. (After all, this was a one-shot program, designed to be used once and then thrown away.)

The first line in Listing One changes AWK's default input file separator from its usual value of white space to be any mix of spaces, tabs, or open brackets. A quirk of AWK is that when you modify the input field separator, the leading spaces on a line are considered to be the first field. If you use the default, the leading spaces are ignored. The field names (which are indented) show up as the second field, or as $2, rather than $1.

The next line of the program deals with input lines such as: struct sample_rec {, and resets j, which is the offset within the structure. At the same time, the record name is saved for use in the emit function described in the next paragraph.

More Details.

The bulk of the program deals with the individual fields. If the field occurs within a structure inside of the main record structure, the information in the field must be saved until the structure name and the number of occurrences are encountered. The function setup does this, using AWK arrays to hold the information. In the normal case, the emit function is called. This function drops any trailing semicolon from the variable name (from a line such as int field;), and then prints the line of C code required. The print statement in emit uses the C operator ? to print the second parameter required by strings. A simpler but longer approach would require the use of an extra variable as shown in Example 4.

Example 4: Alternative to using the ? operator

  if (type == ``string'')
       last_part = ", " 1 ");"
  else
       last_part = ");"
  print "\t" type "(p." rec "->" varname
       ", inf_rec + " offset last_part

AWK does not have function pointers. I wrote the f function to invoke the appropriate function and to pass on the remaining parameters.

The last remaining task occurs when the closing brace to a structure within a structure is found. If NF > 3, then this is an array of structures, and a C language for loop is printed. The field separator includes the open bracket but not the close bracket, so in a line such as, : } inner_struct[5]; $4 is set to 5];. The trick $4 + 0 forces this to a numeric value.

The AWK for loop uses the temporary variable k and then calls the function emit2, which is a close copy of emit, except that it calculates an offset from inf_rec. When emit2 has spit out one line for each variable in the inner structure, it prints the closing brace to the C for loop, and calculates an updated offset into the C-ISAM record.

This AWK program handles the process of moving data from the C structure to the C-ISAM record. I wrote an almost identical program to handle the reverse process of moving the data from the C-ISAM record to the C structure. The only problem that had to be addressed differently in the second program was how to handle slack bytes. Different C compilers place slack bytes in various places to encourage numbers and structures to be word-aligned. Based on the conditions that your particular compiler and settings require, the addition of a line such as this will force the offset to a word boundary: offset += offset % 2

While this program has obvious weaknesses (which I would remedy if it were to be run by others), it still serves as a good example of how to use a utility language such as AWK to perform time-consuming chores that could take days if done manually, and that would even take much longer to program in a conventional language such as C. In fact, having a powerful tool such as AWK handy will lead you to do many tasks you might not undertake otherwise! Now that an AWK compiler is available (see "AWK for DOS"), you may even find yourself using AWK for distribution programs.

Regular Expressions

Regular expressions use a pattern string of characters, enclosed in slashes, to describe a pattern that can be compared to a target string. The simplest example of a pattern string is just a string of characters. For example, /red/ matches "a red ball" and "Alfredo", but not "tuna fish". Different series of metacharacters have special meaning in the pattern string. The period refers to any single character, so /r.d/ matches "red" and "rod", but not "road" or "rd". To match a period or any other metacharacter literally, precede it with a backslash. Special characters can be matched by using C's escape characters: \t represents a tab and \b represents a backspace, while \101 represents an octal value (65, or the character 'a').

A circumflex matches the beginning of a string while the dollar sign matches the end, so /^red$/ matches a line consisting solely of the word "red". An asterisk matches zero or more of the preceding pattern. For instance, /l*ama/ matches "ama", "lama", "llama", "lllama", and so on. A plus matches one or more, while a question mark matches zero or one of the preceding pattern. Also, characters may be grouped with parentheses. Thus, /M(iss)+ippi/ matches "Missippi", as well as "Mississippi". Brackets can be used to indicate a character that is one of the enclosed characters, so that [a-zA-Z] matches any letter, while [13579] matches an odd digit. Thus, /[A-Z][a-z]*/ matches any word with an initial capital. A circumflex that is the first character after the open brace matches any character except for those in the braces, so /^[^aeiou]+$/ matches strings with no lower-case vowels.

Regular expressions are a valuable and powerful tool. Unfortunately, different implementations of regular expressions occur under the MS-DOS programs that use them. The Microsoft Editor does not support parentheses or the plus symbol; the Brief editor uses braces in lieu of parentheses, the @ to match zero or more occurences, and ? as a single character, and has other differences as well. Different MS-DOS implementations of grep, a command-line text search facility based on the Unix utility, use slightly differing characters. Nonetheless, regular expressions are a powerful tool for matching text -- once you become familiar with them, you will wonder how you ever lived without them.

-- W.B.


AWK for DOS

Two commercial versions of AWK are available under MS-DOS and OS/2. Mortice Kern Systems, Waterloo, Ont., Canada, includes both small and large model versions of AWK in its MKS Toolkit (with and without 8087 support), and also sells AWK separately, offering both DOS and OS/2 versions. This package includes a tutorial written by MKS, as well as a copy of the book, The AWK Programming Language by Aho, Kernighan, and Weinberger (Addison-Wesley, 1988). The OS/2 version opens the possibility of holding enormous arrays in memory.

Sage Software (Beaverton, Oreg.), which recently acquired Polytron, provides the same book in its versions of the PolyAwk program for MS-DOS and OS/2. PolyAwk includes several useful extensions, such as the ability to hold associative arrays in sorted order, the provision of true multi-dimensional arrays, and the availability of new functions including getkey, toupper, and tolower.

Sage also offers a developer's toolkit that includes both an interpreter and a compiler. By default, the compiler, AWKC, produces a program with an extension of .AE. This program must run by a run-time program, called AWKI. However, by using a -xe flag on the AWKC command produces a stand-alone executable program. The ability to write stand-alone programs using AWK is a great benefit if you wish to distribute your software to others -- they do not need a copy of the interpreter, and they cannot read or modify your source code.

-- W.B.

_AWK AS A C CODE GENERATOR_ by Wahhab Baldwin

[LISTING ONE]

<a name="01a2_0010">

#    This program reads a C structure and produces C code
#    to write the record to a C-ISAM work area.
#    There are limitations on the input format:  see sample

BEGIN {     FS = "[ \t[]+"              # Override default
      }                                 #    field separators
$1 == "struct" && $3 == "{" {           # Opening record struct
            rec = $2
            print "/*  " $2 "  */"
            offset = 0
       }
$2 == "struct" && $3 == "{" {           # Struct within record
            type = "struct"
            j = 0
       }
$2 == "long"   {    f(type == "struct", "stlong",  $3, 4)}
$2 == "int"    {    f(type == "struct", "stint",   $3, 2)}
$2 == "float"  {    f(type == "struct", "stfloat", $3, 4)}
$2 == "double" {    f(type == "struct", "stdbl",   $3, 8)}
$2 == "char" && NF > 3 {                # String
                    f(type == "struct", "stchar",  $3, $4 - 1)}
$2 == "char" && NF == 3 {               # Single character
                    f(type == "struct", "stchar",  $3, 1)}
$2 == "}" && NF > 3 {                   # Array of structs
            type = ""
            print "\tfor (i = 0; i < " $4 + 0 "; i++) {"
            for (k = 0; k < j; k++) {
                gsub(";", "", name[k])
                temp = $3 "[i]." name[k]
                emit2("\t" stype[k], temp, flen[k])
            }
            print "\t}"
            offset += ($4 - 1) * slen
            slen = 0
        }

$2 == "}" && NF == 3 {                  # Named struct
            type = ""
            for (k = 0; k < j; k++)
                emit(stype[k], $3 "." name[k], flen[k])
            slen = 0
        }

function f(bool, str, x, y) {
           if (bool)
                setup(str, x, y)
            else
                emit(str, x, y)
        }

function setup(type, varname, l) {      # Save field data in array
            name[j] = varname
            stype[j] = type
            flen[j++] = l
            slen += l
        }

function emit(type, varname, l) {       # Print C code for field
            gsub(";", "", varname)
            print "\t" type "(p." rec "->" varname,
                ", inf_rec + " offset \
                (type == "ststring" ? ", " l ");" : ");")
            offset += l
        }

function emit2(type, varname, l) {      # Print C code for field in struct
            gsub(";", "", varname)
            print "\t" type "(p." rec "->" varname,
                ", inf_rec + i * " slen, "+",
                offset (type ~ /string/ ? ", " l ");": ");")
            offset += l
        }

[Example 1: Simple AWK program to total a series of numbers]

{total += $1}
END  {print total}  # Total for all input


[Example 2:  Built-in AWK function for global substitution]

   {gsub(/aluminium/, "aluminum")}
   {gsub(/colour/, "color")}

[Example 3:  Using strings rather than integers as array subscripts]


           # Program to count word usage in input.
           # eliminate all non-letters except space
           {    gsub(/[^A-Za-z ]/, "")        }

           # add words to associative array
           {    for (i = 1; i <= NF; i++)
                     ++word[$i] }

           # print each array entry
      END  {    for (j in word)
                     print j, word[j] }

[Example 4:  Alternative to using the ? operator]

      if (type ~ /string/)
           last_part = ", " l ");"
      else
           last_part = ");"
      print "\t" type "(p." rec "->" varname
           ", inf_rec + " offset last_part










Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.