Channels ▼
RSS

Parallel

GNU Awk: This is Not Your Father's Awk


Awk, the little language loved by so many, has been around in one form or another since the days of V7 UNIX, circa 1978. Its data- and text-processing capabilities, expressed in a concise, highly expressive syntax, are the epitome of what DSLs strive for and were, in fact, the inspiration for general-purpose scripting languages, particularly Perl.

The original language has evolved considerably. In particular, the GNU version — gawk — is far more powerful than standard Awk. In this article, I provide some background on the development of the Awk language, focusing particularly on gawk. I also take a high-level look at the new features available in gawk 4.1, which was released in May of this year.

History of Awk

Awk was originally developed at Bell Laboratories by Al Aho, Peter Weinberger, and Brian Kernighan: It took its name from their initials. It was first exposed to the world with V7 UNIX around 1978. Awk was intended for writing very small programs (a few lines only) to perform simple data filtering and manipulation tasks. Despite this, the language offered some unique facilities, along with most of the major features needed for serious programming:

  • Powerful regular expression matching
  • String manipulation and numeric facilities, including a handful of the most useful mathematical functions
  • Associative arrays
  • Flow control: if/else, while, and for loops
  • A built-in printf statement and shell-like I/O redirection
  • The pattern-action paradigm for expressing what needs to be done, along with a built-in input loop that automatically reads records and splits them into fields.

It is perhaps the last item that gives Awk its expressiveness. Around 1985, based on user demand, the authors beefed up the language by adding:

  • C-compatible operator precedence
  • A do-while loop
  • Regular-expression-based field splitting
  • Multidimensional array syntax
  • User-defined functions. (This is perhaps the most important addition.)

This version was released with UNIX System V Release 3.2, but due to the nature and cost of UNIX licenses, did not become widely available. In late 1987, the authors published a book on this version of Awk that, more than 25 years later, is still valuable reading.

When the book came out, I bought it because I'd been wanting to learn Awk. Having an interest in programming language design and development, and being then single (thus with lots of spare time), I looked to see if the GNU project had a version of awk that I could play with.

Indeed they did. But alas, it was a clone of the original awk, not of the new version. In addition, it was buggy and slow. So I joined forces with the main GNU awk volunteer, David Trueman, to update awk. We worked together for several years until he had to bow out. Since1994, I have been the official maintainer, working with a pleasant and pleasantly sized team of codevelopers.

Around 1991, the IEEE POSIX committee started working on a command language and utilities standard, which included Awk. Gawk complies with POSIX, except in the few cases where it does not make sense to do so, and even then an option enables full blind compatibility.

For the first ten years of my involvement with gawk, it was usually used along-side UNIX awk. Gawk offered the advantages of being generally faster and also of having fewer arbitrary limits; often, I had users who would push quantities of data through gawk that would cause UNIX awk to roll over and die. (UNIX awk has considerably improved since then, however.)

Things changed when GNU/Linux started to take off, as gawk became the only version of awk available on the system. Today, I believe that gawk is the most widely used Awk implementation. In the rest of this article, I will show what kinds of interesting things you can now do with gawk.

Gawk Features

Given the natural tendency of programmers to want to change, extend, and add features, together with the long-lived nature of this project, it is perhaps not surprising that gawk has developed a multitude of additional features not present in standard awk.

These features can be grouped into several categories:

  • Support facilities:
    • Statement count profiling.
    • Awk-level debugging with a debugger that is similar to GDB, the GNU debugger. This first became available with gawk 4.0 (released in 2011) as a separate executable, but it is now built-in to the regular gawk executable.
    • Many additional command-line options for controlling all the possibilities.  All GNU-style long options also have short options, for use in #! (shebang) scripts.
  • Extensibility:
    • File inclusion with the @include statement (new in 4.0).
    • Loading of extension functions written in C or C++ that can be called from Awk code with the @load statement or the -l  (lower case "L") command-line option  (new in 4.1).
  • New built-in functions:
    • String functions, such as the gensub() generalized substitution function, the patsplit() function to use regular expressions to define the contents to be split out, and extensions to the standard close(), length(), match(), and split() functions.
    • Array sorting functions: asort() and asorti().
    • Bit-manipulation functions: lshift(), rshift(), and(), or(), xor(), and compl().
    • Time-stamp functions: mktime(), systime(), and strftime().
    • Translation functions: bindtextdomain(), dcgettext(), dcngettext(), and facilities for internationalization based upon GNU Gettext.
    • A type function: isarray().
  • Language extensions:
    • Regular-expression-based record splitting.
    • Field splitting based on regular expressions specifying the field contents instead of the field  separators.
    • Two-way I/O to coprocesses and TCP/IP socket communication, using an extended I/O syntax.
    • True multidimensional arrays, and control over array traversal order with for loops.
    • BEGINFILE and ENDFILE special patterns.
    • The nextfile keyword and the switch statement.
    • Indirect function calls
    • Indirect control of variables through the SYMTAB array.

The details on all of these features (and additional minor ones) are provided in the gawk documentation, which is available online and in printed form as a 490-page book containing 16 chapters, four appendices, a glossary, and many sample programs.

What's New in gawk 4.1

Most of the changes in the recently released gawk 4.1 relate to the internals. In particular, there is now just one executable instead of three. This considerably reduces the installation footprint" and simplifies maintenance and documentation.

The most notable change at the Awk language level is the ability to use the GNU MPFR and GMP libraries for arbitrary-precision integer and floating-point arithmetic. This requires either the -M command-line option or setting the new variable PREC to indicate the precision to use. If you select either option, gawk switches to using arbitrary precision for all of its numerical calculations. 

The new feature in gawk 4.1 that I think is the most important is the addition of a defined and documented extension mechanism.  The new mechanism defines a syntax for loading extensions (the @load keyword and/or the -l option) and an API for extension functions written in C or C++ to use when communicating with gawk. Extension functions act like user-defined functions callable from Awk programs, but they are written in a different language. Their reason for being is to provide access from Awk programs to external operating system facilities and libraries.

While gawk has had a minimally documented extension mechanism for many years, using it required understanding the source code data structures and some of the internal functions. It offered no compatibility from release to release, even at the source code level, much less at the binary level. I had been wanting to rewrite this mechanism for a long time. The new implementation provides the potential for Awk programs to do anything that can be done from C or C++.  The simplest example is an extension that provides chdir(), so that Awk programs can finally change their working directory!

Besides letting you write functions in C or C++, the new API provides hooks into gawk's I/O redirection so that you can provide your own filenames for input, output, and two-way I/O. You can also register functions to be called before gawk exits.

Many of the additional functions that gawk has could, in fact, be implemented using the new API.  They remain built-in for backwards compatibility and to avoid the overhead of having to load them every time gawk starts up.

Gawk comes with several sample extensions. In addition, the gawkextlib project provides several more; the most notable of which is an extension that parses XML files and presents them to Awk programs in a familiar manner.

As one small example of what you might do, consider the existing stat and fts extensions that, when combined with multidimensional arrays, let you walk a file hierarchy and do just about anything you want with the returned information. Using these extensions, writing a simple version of the UNIX du utility could be accomplished in a few hundred lines of Awk, instead of several thousand lines of C. Furthermore, it would be portable to GNU/Linux, Windows, and Mac OS X, without any changes.

Conclusion

There's a lot of great stuff in gawk — much more than most people realize. In particular, the profiler and debugger provide necessary tools for larger scale development. The extension mechanism opens up many new frontiers for doing things in Awk that were not possible before. If you haven't looked at Awk in a while, give it another go. I expect you just might like what you see.


Arnold Robbins has been the principal maintainer of gawk since 1994.

Resources

The recommended gawk build for Windows


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video