Channels ▼
RSS

Open Source

GNU Awk: This is Not Your Father's Awk


Awk, the little language loved by so many, has been around in one form or another since the days of V7 UNIX, circa 1978. Its data- and text-processing capabilities, expressed in a concise, highly expressive syntax, are the epitome of what DSLs strive for and were, in fact, the inspiration for general-purpose scripting languages, particularly Perl.

The original language has evolved considerably. In particular, the GNU version — gawk — is far more powerful than standard Awk. In this article, I provide some background on the development of the Awk language, focusing particularly on gawk. I also take a high-level look at the new features available in gawk 4.1, which was released in May of this year.

History of Awk

Awk was originally developed at Bell Laboratories by Al Aho, Peter Weinberger, and Brian Kernighan: It took its name from their initials. It was first exposed to the world with V7 UNIX around 1978. Awk was intended for writing very small programs (a few lines only) to perform simple data filtering and manipulation tasks. Despite this, the language offered some unique facilities, along with most of the major features needed for serious programming:

  • Powerful regular expression matching
  • String manipulation and numeric facilities, including a handful of the most useful mathematical functions
  • Associative arrays
  • Flow control: if/else, while, and for loops
  • A built-in printf statement and shell-like I/O redirection
  • The pattern-action paradigm for expressing what needs to be done, along with a built-in input loop that automatically reads records and splits them into fields.

It is perhaps the last item that gives Awk its expressiveness. Around 1985, based on user demand, the authors beefed up the language by adding:

  • C-compatible operator precedence
  • A do-while loop
  • Regular-expression-based field splitting
  • Multidimensional array syntax
  • User-defined functions. (This is perhaps the most important addition.)

This version was released with UNIX System V Release 3.2, but due to the nature and cost of UNIX licenses, did not become widely available. In late 1987, the authors published a book on this version of Awk that, more than 25 years later, is still valuable reading.

When the book came out, I bought it because I'd been wanting to learn Awk. Having an interest in programming language design and development, and being then single (thus with lots of spare time), I looked to see if the GNU project had a version of awk that I could play with.

Indeed they did. But alas, it was a clone of the original awk, not of the new version. In addition, it was buggy and slow. So I joined forces with the main GNU awk volunteer, David Trueman, to update awk. We worked together for several years until he had to bow out. Since1994, I have been the official maintainer, working with a pleasant and pleasantly sized team of codevelopers.

Around 1991, the IEEE POSIX committee started working on a command language and utilities standard, which included Awk. Gawk complies with POSIX, except in the few cases where it does not make sense to do so, and even then an option enables full blind compatibility.

For the first ten years of my involvement with gawk, it was usually used along-side UNIX awk. Gawk offered the advantages of being generally faster and also of having fewer arbitrary limits; often, I had users who would push quantities of data through gawk that would cause UNIX awk to roll over and die. (UNIX awk has considerably improved since then, however.)

Things changed when GNU/Linux started to take off, as gawk became the only version of awk available on the system. Today, I believe that gawk is the most widely used Awk implementation. In the rest of this article, I will show what kinds of interesting things you can now do with gawk.

Gawk Features

Given the natural tendency of programmers to want to change, extend, and add features, together with the long-lived nature of this project, it is perhaps not surprising that gawk has developed a multitude of additional features not present in standard awk.

These features can be grouped into several categories:

  • Support facilities:
    • Statement count profiling.
    • Awk-level debugging with a debugger that is similar to GDB, the GNU debugger. This first became available with gawk 4.0 (released in 2011) as a separate executable, but it is now built-in to the regular gawk executable.
    • Many additional command-line options for controlling all the possibilities.  All GNU-style long options also have short options, for use in #! (shebang) scripts.
  • Extensibility:
    • File inclusion with the @include statement (new in 4.0).
    • Loading of extension functions written in C or C++ that can be called from Awk code with the @load statement or the -l  (lower case "L") command-line option  (new in 4.1).
  • New built-in functions:
    • String functions, such as the gensub() generalized substitution function, the patsplit() function to use regular expressions to define the contents to be split out, and extensions to the standard close(), length(), match(), and split() functions.
    • Array sorting functions: asort() and asorti().
    • Bit-manipulation functions: lshift(), rshift(), and(), or(), xor(), and compl().
    • Time-stamp functions: mktime(), systime(), and strftime().
    • Translation functions: bindtextdomain(), dcgettext(), dcngettext(), and facilities for internationalization based upon GNU Gettext.
    • A type function: isarray().
  • Language extensions:
    • Regular-expression-based record splitting.
    • Field splitting based on regular expressions specifying the field contents instead of the field  separators.
    • Two-way I/O to coprocesses and TCP/IP socket communication, using an extended I/O syntax.
    • True multidimensional arrays, and control over array traversal order with for loops.
    • BEGINFILE and ENDFILE special patterns.
    • The nextfile keyword and the switch statement.
    • Indirect function calls
    • Indirect control of variables through the SYMTAB array.

The details on all of these features (and additional minor ones) are provided in the gawk documentation, which is available online and in printed form as a 490-page book containing 16 chapters, four appendices, a glossary, and many sample programs.

What's New in gawk 4.1

Most of the changes in the recently released gawk 4.1 relate to the internals. In particular, there is now just one executable instead of three. This considerably reduces the installation footprint" and simplifies maintenance and documentation.

The most notable change at the Awk language level is the ability to use the GNU MPFR and GMP libraries for arbitrary-precision integer and floating-point arithmetic. This requires either the -M command-line option or setting the new variable PREC to indicate the precision to use. If you select either option, gawk switches to using arbitrary precision for all of its numerical calculations. 

The new feature in gawk 4.1 that I think is the most important is the addition of a defined and documented extension mechanism.  The new mechanism defines a syntax for loading extensions (the @load keyword and/or the -l option) and an API for extension functions written in C or C++ to use when communicating with gawk. Extension functions act like user-defined functions callable from Awk programs, but they are written in a different language. Their reason for being is to provide access from Awk programs to external operating system facilities and libraries.

While gawk has had a minimally documented extension mechanism for many years, using it required understanding the source code data structures and some of the internal functions. It offered no compatibility from release to release, even at the source code level, much less at the binary level. I had been wanting to rewrite this mechanism for a long time. The new implementation provides the potential for Awk programs to do anything that can be done from C or C++.  The simplest example is an extension that provides chdir(), so that Awk programs can finally change their working directory!

Besides letting you write functions in C or C++, the new API provides hooks into gawk's I/O redirection so that you can provide your own filenames for input, output, and two-way I/O. You can also register functions to be called before gawk exits.

Many of the additional functions that gawk has could, in fact, be implemented using the new API.  They remain built-in for backwards compatibility and to avoid the overhead of having to load them every time gawk starts up.

Gawk comes with several sample extensions. In addition, the gawkextlib project provides several more; the most notable of which is an extension that parses XML files and presents them to Awk programs in a familiar manner.

As one small example of what you might do, consider the existing stat and fts extensions that, when combined with multidimensional arrays, let you walk a file hierarchy and do just about anything you want with the returned information. Using these extensions, writing a simple version of the UNIX du utility could be accomplished in a few hundred lines of Awk, instead of several thousand lines of C. Furthermore, it would be portable to GNU/Linux, Windows, and Mac OS X, without any changes.

Conclusion

There's a lot of great stuff in gawk — much more than most people realize. In particular, the profiler and debugger provide necessary tools for larger scale development. The extension mechanism opens up many new frontiers for doing things in Awk that were not possible before. If you haven't looked at Awk in a while, give it another go. I expect you just might like what you see.


Arnold Robbins has been the principal maintainer of gawk since 1994.

Resources

The recommended gawk build for Windows


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-4124b309ed8223fe99361b41fa82ae3d
2013-07-17T14:17:54

Thanks, Arnold. I appreciate your work, and suspect there are a lot more that do too. I love the article and up-to-date focus on such an old tool. It inspires one to augment current knowledge and use. I for one fail to find awk burdensome, dated, or inadequate worthy of callous dismissal. Many things need doing in countless scenarios. One's frame of reference is rarely indicative of the broad spectrum of opportunities and pertinent paths to solutions. awk is awesomely relevant. That's not to say other tools are not, and it does not challenge other tools relevance. Whatever makes me productive is relevant. awk does that. I've been using it since the early 90's. Is that bad? Nope. I have food on the table, more than my share of comforts, and enjoy my work. Awk empowers me in ways other tools do not. It's just one tool in my belt, but it will be kept at the ready for years to come. People who don't use things like hammers and screwdrivers tend to refrain from asking why they are still are relevant... for me... awk is such an essential tool that I would miss it with great frequency if it were lost to me. Associative arrays and the simplicity of command-line use are two biggees, though I feel that naming one or two particularly valuable features does great injustice to the rest.


Permalink
ubm_techweb_disqus_sso_-929e8af566e8b3bbbe3deaee8fff3288
2013-07-17T05:41:31

Although I still tend to use awk myself for absolutely minimal command lines, I usually have to notice, that when things get just a little more complicated, I have to replace awk with perl and see, that the command line gets just a tiny bit longer, but does what I want and I have all the flexibility I need. The same goes for sed and even egrep. As soon as control characters are involved, most egrep and sed implementations fail and when the amount of data to process is a little bit bigger, I notice that perl is way faster than comparable tools. At the end, I almost always regret I did not use perl from the start.
My conclusion is, that awk is only relevant in an environment, in which you don't have perl available.


Permalink
AndrewBinstock
2013-07-16T22:26:03

Answering from my own experience: I really like being able to do simple operations quickly with a single line of code that I can place on the command line. I can write and run a lot of transforms really easily that way. (To your second question: I don't know Python and don't really want to learn it just to solve this kind of problem.)


Permalink
ubm_techweb_disqus_sso_-d530f9e8cbcdf7b43ada2f6519b213cb
2013-07-16T21:57:55

Awk works great in configuration scripts. Also the code to process the input stream tends to be shorter than those written in Perl and even Python.

Use the right tool for the job. What's that saying? "If all you have is a hammer, everything looks like nails"


Permalink
ubm_techweb_disqus_sso_-eb143e6d5f6d149c968e253a5efe5875
2013-07-16T21:51:51

> Why is awk still relevant?

Because when you know Awk, Perl, Python sed, ... you may still find that awks once unique way of processing text, and its particular brevity vs readability might still best fit a part of a solution.

I still have Awk in my bag of tricks even though I use other scripting languages. (It can't all be down to "first love" :-)


Permalink
ubm_techweb_disqus_sso_-ccb0938ee08648c3676059bf8bafe950
2013-07-16T21:27:34

Why is awk still relevant? Back in the day, there was shell scripting or else you coded in C. The former was pretty limited. The latter was Real Work. Awk was a wonderful midpoint. But then came Perl, hugely powerful relatively, but messy. And then came Python, which was "Perl-equivalent" but cleaner (less obscure punctuation, nice non-afterthought exception handling). So where's Awk now? Why wouldn't I just use Python?


Permalink

Video