Awk, the little language loved by so many, has been around in one form or another since the days of V7 UNIX, circa 1978. Its data- and text-processing capabilities, expressed in a concise, highly expressive syntax, are the epitome of what DSLs strive for and were, in fact, the inspiration for general-purpose scripting languages, particularly Perl.
The original language has evolved considerably. In particular, the GNU version gawk is far more powerful than standard Awk. In this article, I provide some background on the development of the Awk language, focusing particularly on gawk. I also take a high-level look at the new features available in gawk 4.1, which was released in May of this year.
History of Awk
Awk was originally developed at Bell Laboratories by Al Aho, Peter Weinberger, and Brian Kernighan: It took its name from their initials. It was first exposed to the world with V7 UNIX around 1978. Awk was intended for writing very small programs (a few lines only) to perform simple data filtering and manipulation tasks. Despite this, the language offered some unique facilities, along with most of the major features needed for serious programming:
- Powerful regular expression matching
- String manipulation and numeric facilities, including a handful of the most useful mathematical functions
- Associative arrays
- Flow control:
- A built-in
printfstatement and shell-like I/O redirection
- The pattern-action paradigm for expressing what needs to be done, along with a built-in input loop that automatically reads records and splits them into fields.
It is perhaps the last item that gives Awk its expressiveness. Around 1985, based on user demand, the authors beefed up the language by adding:
- C-compatible operator precedence
- Regular-expression-based field splitting
- Multidimensional array syntax
- User-defined functions. (This is perhaps the most important addition.)
This version was released with UNIX System V Release 3.2, but due to the nature and cost of UNIX licenses, did not become widely available. In late 1987, the authors published a book on this version of Awk that, more than 25 years later, is still valuable reading.
When the book came out, I bought it because I'd been wanting to learn Awk. Having an interest in programming language design and development, and being then single (thus with lots of spare time), I looked to see if the GNU project had a version of awk that I could play with.
Indeed they did. But alas, it was a clone of the original awk, not of the new version. In addition, it was buggy and slow. So I joined forces with the main GNU awk volunteer, David Trueman, to update awk. We worked together for several years until he had to bow out. Since1994, I have been the official maintainer, working with a pleasant and pleasantly sized team of codevelopers.
Around 1991, the IEEE POSIX committee started working on a command language and utilities standard, which included Awk. Gawk complies with POSIX, except in the few cases where it does not make sense to do so, and even then an option enables full blind compatibility.
For the first ten years of my involvement with gawk, it was usually used along-side UNIX awk. Gawk offered the advantages of being generally faster and also of having fewer arbitrary limits; often, I had users who would push quantities of data through gawk that would cause UNIX awk to roll over and die. (UNIX awk has considerably improved since then, however.)
Things changed when GNU/Linux started to take off, as gawk became the only version of awk available on the system. Today, I believe that gawk is the most widely used Awk implementation. In the rest of this article, I will show what kinds of interesting things you can now do with gawk.
Given the natural tendency of programmers to want to change, extend, and add features, together with the long-lived nature of this project, it is perhaps not surprising that gawk has developed a multitude of additional features not present in standard awk.
These features can be grouped into several categories:
- Support facilities:
- Statement count profiling.
- Awk-level debugging with a debugger that is similar to GDB, the GNU debugger. This first became available with gawk 4.0 (released in 2011) as a separate executable, but it is now built-in to the regular gawk executable.
- Many additional command-line options for controlling all the possibilities. All GNU-style
longoptions also have short options, for use in
- File inclusion with the
@includestatement (new in 4.0).
- Loading of extension functions written in C or C++ that can be called from Awk code with the
@loadstatement or the
-l(lower case "L") command-line option (new in 4.1).
- New built-in functions:
- String functions, such as the
gensub()generalized substitution function, the
patsplit()function to use regular expressions to define the contents to be split out, and extensions to the standard
- Array sorting functions:
- Bit-manipulation functions:
- Time-stamp functions:
- Translation functions:
dcngettext(), and facilities for internationalization based upon GNU
- A type function:
- Language extensions:
- Regular-expression-based record splitting.
- Field splitting based on regular expressions specifying the field contents instead of the field separators.
- Two-way I/O to coprocesses and TCP/IP socket communication, using an extended I/O syntax.
- True multidimensional arrays, and control over array traversal order with
nextfilekeyword and the
- Indirect function calls
- Indirect control of variables through the
The details on all of these features (and additional minor ones) are provided in the gawk documentation, which is available online and in printed form as a 490-page book containing 16 chapters, four appendices, a glossary, and many sample programs.
What's New in gawk 4.1
Most of the changes in the recently released gawk 4.1 relate to the internals. In particular, there is now just one executable instead of three. This considerably reduces the installation footprint" and simplifies maintenance and documentation.
The most notable change at the Awk language level is the ability to use the GNU MPFR and GMP libraries for arbitrary-precision integer and floating-point arithmetic. This requires either the
-M command-line option or setting the new variable
PREC to indicate the precision to use. If you select either option, gawk switches to using arbitrary precision for all of its numerical calculations.
The new feature in gawk 4.1 that I think is the most important is the addition of a defined and documented extension mechanism. The new mechanism defines a syntax for loading extensions (the
@load keyword and/or the
-l option) and an API for extension functions written in C or C++ to use when communicating with gawk. Extension functions act like user-defined functions callable from Awk programs, but they are written in a different language. Their reason for being is to provide access from Awk programs to external operating system facilities and libraries.
While gawk has had a minimally documented extension mechanism for many years, using it required understanding the source code data structures and some of the internal functions. It offered no compatibility from release to release, even at the source code level, much less at the binary level. I had been wanting to rewrite this mechanism for a long time. The new implementation provides the potential for Awk programs to do anything that can be done from C or C++. The simplest example is an extension that provides
chdir(), so that Awk programs can finally change their working directory!
Besides letting you write functions in C or C++, the new API provides hooks into gawk's I/O redirection so that you can provide your own filenames for input, output, and two-way I/O. You can also register functions to be called before gawk exits.
Many of the additional functions that gawk has could, in fact, be implemented using the new API. They remain built-in for backwards compatibility and to avoid the overhead of having to load them every time gawk starts up.
Gawk comes with several sample extensions. In addition, the gawkextlib project provides several more; the most notable of which is an extension that parses XML files and presents them to Awk programs in a familiar manner.
As one small example of what you might do, consider the existing
fts extensions that, when combined with multidimensional arrays, let you walk a file hierarchy and do just about anything you want with the returned information. Using these extensions, writing a simple version of the UNIX du utility could be accomplished in a few hundred lines of Awk, instead of several thousand lines of C. Furthermore, it would be portable to GNU/Linux, Windows, and Mac OS X, without any changes.
There's a lot of great stuff in gawk much more than most people realize. In particular, the profiler and debugger provide necessary tools for larger scale development. The extension mechanism opens up many new frontiers for doing things in Awk that were not possible before. If you haven't looked at Awk in a while, give it another go. I expect you just might like what you see.
Arnold Robbins has been the principal maintainer of gawk since 1994.