Channels ▼
RSS

Tools

Examining the TAWK Compiler

Source Code Accompanies This Article. Download It Now.


Dr. Dobb's Journal May 1997: Examining the TAWK Compiler

Jim is a lead programmer/analyst for a financial software development firm specializing in Windows 95/NT programming. He can be contacted at jimbo@radiks.net.


The AWK programming language was created to simplify information processing in the UNIX environment. AWK's popularity stems from its ability to perform powerful data-processing chores with fewer lines of code than traditional programming languages like C++.

AWK is not a machine-specific language and does not provide methods to interface with the computer at the system level. AWK is dependent on external utilities that can be invoked for environment-specific processing.

The UNIX programming environment is well known for its filter programs, which write output to the standard output device. Often, these programs obtain their input from standard input device. Due to their usage of standard I/O devices, the output of one filter program can be redirected to serve as input to another program (often using the pipe symbol "|").

For example, if you wanted to find all files in the current directory containing the word "toad," you could simply redirect the output of the directory list program (ls) to the grep search utility this way: ls|grep "toad". The output of ls serves as input to grep.

AWK was created as a language that could facilitate filter programming. AWK automatically extracts input from the standard input device (or from a file specified on the command line).

An AWK program consists of a declared pattern with a counterpart action. The pattern is usually an expression that performs a comparison upon the input data. If this pattern expression yields a Boolean True condition, the counterpart action is executed.

AWK was designed with a philosophy of keeping the language simple, even if it meant sacrificing run-time performance. Users are not responsible for opening files, reading records, or closing files.

Thompson Automation's TAWK

Thompson Automation's TAWK compiler family is a set of compilers for various operating systems (UNIX, OS/2, Windows 95/NT, Solaris) that are based on the AWK language. The version of TAWK I'll examine here is TAWK 5.0 for Windows 95/NT.

TAWK, however, implements many extensions to the language and development environment. These extensions allow TAWK to function as a general-purpose programming language. Many features found in traditional programming languages such as C++ and Pascal can be found in TAWK, while TAWK retains many features unique to AWK-based languages.

TAWK can function as a compiler or an interpreter. Original implementations of AWK were pure interpreters. They read AWK source code from a file and interpreted the specified source code at run time. TAWK is packaged with an interpreter called AWKW, which reads TAWK scripts and carries out the commands issued in source code. TAWK is also packaged with a compiler called AWKCW, which reads TAWK scripts and generates stand-alone executable files from them.

The differences between the two modes of operation are primarily issues of available disk space -- source-code scripts often occupy fewer bytes of disk space than executable code.

The TAWK compiler will scan for a file called AWKW.CFG, which should contain your preferred default compile options so that they needn't be specified on the AWKCW command line.

TAWK users can enjoy the benefits of using separate source-code modules. TAWK is able to bind subroutine libraries to a main program in both the interpreter and the compiler. In command line awkw -f DEMO.AWK -f WINDOWS.AWK, TAWK loads both specified scripts, coupling them into a single logical script. The compiler can create an executable file (DEMO.EXE) from this by invoking the command line awkcw -xe -z DEMO.AWK WINDOWS.AWK.

Because of a bug in the incremental compilation feature in TAWK for 32-bit Windows, Thompson Automation has recommended the use of the -z flag when compiling to avoid both compile-time and run-time bugs. The -z flag forces each separate TAWK file to be recompiled (see http://www.tasoft.com/~thompson/ for more information).

A supplemental program called MKAWKLIB.EXE provides an easy method of using bundled subroutine libraries. MKAWKLIB adds entries to a special database used by TAWK when it is unable to find functions or global variables specified in a given file.

Suppose you have a function library called MYFUNC.AWK that contains a special function called my_func(). The command-line mkawklib MYFUNC.AWK exposes this function to all future TAWK scripts.

The only drawback to using TAWK libraries is that all variable and function names are stored in the compiled executable file. This includes symbolic constants that may never be referenced. This will cause the executable file to be larger than necessary.

Compiled TAWK programs inherit the standard TAWK command-line processor unless the -eo option is specified during compilation. When using this option, command-line arguments normally processed by TAWK ( -F, -v, -w, - - ) are placed in the ARGV[] array. The argument count is held in the ARGC variable. ARGV[0] contains the fully normalized path and filename of the executing program.

A counterpart option, -ee, causes TAWK to suppress filename expansion into the ARGV[] array. If your TAWK program processes a file using the automatic input method, TAWK will expect a filename as ARGV[1]. If no filename is specified, TAWK can read the input from stdin, but it will treat any parameter passed in ARGV[1] as a file and will try to open it. To avoid this, you can write the whole TAWK program in the BEGIN section, allowing unhindered access to the command line. The options -eo and -ee can be combined as -eoe. The example finger.awk program uses the -eo option to gather command-line arguments for the specified Internet account ID; see Listing One (Executable versions of finger.awk and other programs presented here are available electronically; see "Availability," page 3.)

The TAWK compiler is capable of generating two different kinds of executable programs. To create a stand-alone executable file, you must specify the -xe compile switch. The executable file generated when this switch has not been specified will rely on a run-time support program called AWKR50W.EXE. This program is similar to the BRUNxx.EXE programs used by Microsoft Quick Basic in the DOS environment. Compiled TAWK programs that are dependent on this module occupy significantly smaller portions of disk space than their stand-alone counterparts. A typical "Hello world!" stand-alone TAWK program occupies over 200 KB of disk space.

TAWK in a DLL

With the release of Version 5.0, Thompson Automation bundled a special version of TAWK in a DLL. This DLL (AWKRW.DLL) can be invoked by external programs to provide TAWK functionality in non-TAWK environments such as Visual Basic. This DLL may not be redistributed. It is for use only by the licensed TAWK user.

AWKRW.DLL exports functions that allow client programs to invoke TAWK actions, invoke TAWK functions, manipulate TAWK variables, and feed command-line arguments. In its current state, the DLL cannot be used for building distributable applications.

Low-Level Feature

TAWK sports a robust set of features for low-level system access. Direct access to memory can be attained by using addressof(), peek(), and poke(). The addressof() function returns the address of the specified TAWK variable. Because of TAWK's internal memory-management scheme, the specified variable is locked into place until all references to the address returned by addressof() have been removed.

peek() and poke() operate in a manner similar to their counterparts in BASIC. A single byte from the specified address is returned by peek(). poke() is used to store a byte at the specified address. Bitwise operators are implemented as a series of TAWK functions. and(), or(), and xor() all accept two 32-bit integer expressions as operands. not() accepts a single 32-bit integer expression. All of the bitwise functions return a 32-bit integer as a result.

The shiftl() and shiftr() functions perform left/right bit-shifting operations, respectively. Each function accepts two parameters. The first parameter is an integer expression; the second parameter specifies the number of bits to shift.

An error in the documentation might lead you to believe that port-level I/O can be accomplished in the 32-bit Windows version of TAWK. The documentation specifies that the port I/O functions -- inp(), inpw(), outp(), and outpw() -- do nothing under OS/2 and UNIX. These functions do not seem to function in the 32-bit Windows environment.

TAWK utilizes the associative array construct found in languages such as SNOBOL or Perl. An associative array is an array that uses a string as an index rather than an integer.

The for keyword is used to iterate through all items in an associative array. Traditional AWK usually causes the associative array to be traversed in an ascending-sorted sequence by the indexing string. TAWK lets you define how the associative array will be sorted by setting various values in the built-in SORTTYPE variable. Table 1 lists the combinations of values for SORTTYPE, and defines how they affect array sorting. The program asort.awk (Listing Two) is a sort program that illustrates the usage of associative arrays. It also illustrates the usage of the SORTTYPE variable to change sort order.

File I/O and Directory Processing Functions

File I/O in traditional AWK has always been a bit clumsy. Files are never explicitly opened in traditional AWK. Rather, file I/O is performed by a set of I/O operators -- >, >>, and <.

The > and >> operators are used to send output to a file. The single > writes to a file, while the >> pair appends output to a file. The file name is specified on the right side of the appropriate operator. The < symbol works in a similar fashion. It, however, is used to deliver input to the AWK program from a file. Traditional AWK doesn't provide methods of reading fixed-size chunks of data, nor does it provide methods of seeking to particular items in a file. Traditional AWK is limited to sequential file access.

TAWK provides a set of functions similar to those in the standard C library for file manipulation. Table 2 lists these function names and describes their uses.

The functions fseek() and ftell() are used in setting a logical file position and in determining a logical file position, respectively. These functions, coupled with fread() and fwrite(), provide random access to files.

flock() and funlock() are important functions in multiuser environments. flock() tells the operating system to prevent other processes from accessing a series of bytes in a given file. This allows the locking application to have exclusive control over those bytes. funlock() frees the lock, allowing these bytes to be utilized by any other process.

In addition, TAWK provides a series of directory-processing functions. The dirlist() function is used to capture a list of filenames from a specified directory into an associative array. The general format of the dirlist() function is dirlist(dirname, x), where dirname is a string denoting the specified directory and x is the associative array to be filled with the filenames. If the specified directory cannot be found, dirlist() returns False; otherwise, it returns True.

Since dirlist() simply returns an array of filenames, other functions are necessary to obtain specific information about a file. The functions filetime(), filemode(), and filesize() can be used to determine more information about a particular file. The filetime() function allows a TAWK program to determine the creation time, last modification time, or last access time of a given file. The filemode() function provides a means of determining the access attributes of a file. It can determine if a file has hidden, system, archive, or read-only attributes. filesize() determines the size in bytes of the specified file.

The Windows Interface

External DLL functions can be called by TAWK programs by first declaring the external function, then by calling it as though it were a native TAWK function. TAWK automatically coerces numeric data to the type specified in the declaration. The only time an additional step is required is when a function needs a pointer to a binary data structure. If that happens, pack() must first be used to prepare the binary structure before using the function. The SOCK_ADDR structure needed in the call to the Winsock _connect function (Listing One) was prepared using a call to pack().

The built-in variable "DLLS" allows a program to specify the directory path used to search for all DLLs.

The Windows DEMO.AWK program provided with the TAWK release dynamically builds a GUI Windows application. The dialog, menu, and all other resources are constructed dynamically in the program. No resources are embedded in the executable file. Implementing a GUI application takes a little more effort when using TAWK than contemporary development environments such as Visual C++ or Delphi.

Many Windows API functions communicate with the given application by means of callback functions. A callback function is an application-hosted function that is exposed to the Windows operating system. This exposed function is called by Windows to deliver or retrieve data.

Two TAWK functions -- registercallback() and unregistercallback() -- facilitate usage of Windows callback functions. Each function accepts a single string parameter that names a TAWK function. The value returned from registercallback() can be passed as a function pointer to a given Win32 function as a callback address.

TAWK allows three callback functions to operate concurrently. unregistercallback() is used to prevent the specified function from being used as a callback function. The TAWK callback function must accept exactly four 32-bit integers as parameters. The program timer.awk (Listing Three) implements a callback function to service Windows timer messages.

The example programs provided with TAWK that utilize callbacks are all dialog programs that rely on the dialog processor's internal message pump to retrieve and dispatch messages from the application's message queue. In many cases, you'll need to perform this operation using a PeekMessage() loop similar to that in TIMER.AWK.

Internet Applications with Winsock

TAWK provides the source library SOCKET.AWK, which provides function prototypes, constants, and helper functions to facilitate TCP/IP programming.

TAWK provides two methods of communicating with a socket: An implementation of the send/receive functions from WSOCK32.DLL, and an implementation that references a socket through file descriptors. (The second method, however, did not work. At the time of this writing, Thomspon Automation is investigating the nature of the problem.)

The native WSOCK32.DLL functions perform as desired. Listing One is source code for a simple Internet finger client. The finger protocol is a simple protocol that many Internet hosts provide as a means of identifying a specific user of their system. For example, the output of Example 1(a) would look something like Example 1(b).

My implementation of finger tries to connect to the host specified in the user's ID (that is, the name after the "@" symbol). After the connection to port 79 (the standard finger port) is successful, the program passes the constructed query string to the host and displays the results. When the host stops sending data, the finger program shuts down and closes the socket. The program then terminates. The WSAStartup() and WSACleanup() Winsock functions are called automatically in the INIT and TERM rules in the SOCKET.AWK library provided with TAWK.

The popularity of using various Internet services has provided a niche software market for TCP/IP programs. TAWK's inherent text-processing power, coupled with a Winsock interface, make it an attractive Internet scripting tool.

Conclusion

TAWK is a potent programming tool. I would not recommend it for building GUI applications, but you could certainly build GUI programs with TAWK. Rather, I see TAWK as a powerful scripting tool that can be used to quickly implement batch-oriented software.

For More Information

Thompson Automation Software
5616 SW Jefferson
Portland OR 97221
503-224-1639
http://www.tasoft.com/~thompson/

DDJ

Listing One

# Jim Lawless -- jimbo@radiks.net# FINGER.AWK -- This program implements a simple finger client.
# Syntax: finger id     ( finger jimbo@radiks.net )
# or finger -l id  ( finger -l jimbo@radiks.net )
# To compile: awkcw -eo -xe -z finger.awk socket.awk


</p>
BEGIN {
   if( ARGC < 2 )
      syntax();
# Construct one long string out of the command-line parameters.
# Find the parameter with an "@" symbol and extract the hostname from it.
   for(i=1;i<ARGC;i++) {
      n=index(ARGV[i],"@");
      if(n) {
         hostname=substr(ARGV[i],n+1,
            length(ARGV[i])-n);
      }
      if( length(q) )
         q = q " " ARGV[i];
      else
         q = ARGV[i];
   if( hostname == "" )
      syntax();
# Create a socket.
   sock=_socket( AF_INET, SOCK_STREAM, IPPROTO_TCP );
# Connect the socket to our host.
   host=gethostaddr(hostname);
   print "Connecting to host : " \
      hostname " (" host ")." ;
   sock_addr=pack("@<s @>s @<l @x @x @x @x @x @x @x @x",
      PF_INET,79,inet_addr(host));
   flag=_connect( sock,sock_addr,length(sock_addr));
# If the connection has been established, send the string q to the host.
   if( flag==0) {
      b=strdup("\0",128);
      q=q "\r\n";
      _send(sock,q,length(q),0);
# Now, wait for a reply...no more than 15 seconds.
     i=0;
     retry_count=0;
     while(!i) {
        i=_recv(sock,b,128,0);
        if(!i) {
           retry_count++;
           sleep(1);
           if(retry_count>15) {
              break;
           }
        }
     }
     while(1) {
# Make sure we didn't time out on the last operation!
        if( retry_count>15) {
           break;
        }
# Print the returned data block.
        if(i!=0) {
           for(j=1;j<=i;j++) {
              printf("%c",substr(b,j,1));
           }
# Get more data.
           i=_recv(sock,b,128,0);
        }
# If we're out of data, stop.
        else {
           break;
        }
     }
# disable sends and receives.
     _shutdown(sock,2);
     _closesocket(sock);
  }
}
# Get a long integer from a given address in memory.
function peekl(addr)
{
   local p=0;
   local i;
   for(i=0;i<4;i++) {
      p*=256;
      p+=peek(addr+3-i);
   }
   return(p);
}
# Get the host address, given a host name (this returns a string).
function gethostaddr(hostname)
{
   local ret;
   local p;
   local i;
   local s;
   local realname;
  
   ret = _gethostbyname(hostname)
   if (ret == 0)  {
      return("");
   }
# get the host address here!
   p=peekl(ret+12);
   p=peekl(p);


</p>
   s="";
   s=peek(p) "." peek(p+1) "." peek(p+2) "." peek(p+3);
   return(s);
}
# Get a null-terminated string from an address in memory.
function getstring(addr)
{
   local ar
   unpack("@a",addr,ar)
   return ar[1]
}
function syntax()
{
   print "\n" \
   "Syntax:\n" \
   "   finger id     ( finger jimbo@radiks.net )\n" \
   "or\n" \
   "   finger -l id  ( finger -l jimbo@radiks.net )\n";
   exit();
}

Back to Article

Listing Two

# Jim Lawless -- jimbo@radiks.net# ASORT.AWK -- This program will utilize TAWK's associative-array
# capabilities implemented as a simple file-sorting program
# Syntax: asort infile [ options ] 
# Options: /i or /I, Ignore case-sensitivity; /d or /D, Descending sequence
# To compile: awkcw -eoe -xe -z asort.awk


</p>
BEGIN {
   if( ARGC < 2 ) {
      print "Syntax:\n" \
            "   asort infile [ options ] \n\n" \
            "Options:\n" \
            "   /i or /I   Ignore case-sensitivity\n" \
            "   /d or /D   Descending sequence\n" ;
      exit();
   }
# Set SORTTYPE to default ASCII / Ascending
   SORTTYPE=2;
# Check command-line and alter SORTTYPE as necessary
   for(i=1;i<ARGC;i++) {
      if( ARGV[i]=="/i" || ARGV[i]=="/I") {
         SORTTYPE+=4;
      }
      if( ARGV[i]=="/d" || ARGV[i]=="/D") {
         SORTTYPE+=8;
      }
   }
}
# Main input rule. Increment a table entry using whole line as a key. Counter
# built in ar array indicates how many occurrences of specified line exist.
{ ar[$0]++ }


</p>
# Now, we're done with input. Get values out of the array and display them.
END {
   for(i in ar) {
      j=ar[i];
      while(j--) {
         print(i);
      }
   }
}

Back to Article

Listing Three

# Jim Lawless -- jimbo@radiks.net# TIMER.AWK -- This program implements a Windows TimerProc callback function.
# Declare externals.  


</p>
extern winapi int SetTimer(int hwnd,int id,int milli, void *);
extern winapi int KillTimer(int);
extern winapi int name "PeekMessageA" PeekMessage(void *,int,int,int,int);
extern winapi int TranslateMessage(void *);
extern winapi int name "DispatchMessageA" DispatchMessage(void *);
# Use a global variable to control the timer.
global counter;


</p>
BEGIN {
   counter=0;
# Register the callback.
   proc=registercallback("timerproc");
# Create a timer that will activate ever 1000 milliseconds ( once per second )
   timer_id=SetTimer(0,0,1000,proc);
   PM_REMOVE=1;
# Create a buffer to hold the 28-byte MSG structure msg=strdup("\0",28);
   for(;;) {
# let Windows breathe!
      if( PeekMessage(msg,0,0,0,PM_REMOVE)) {
         TranslateMessage(msg);
         DispatchMessage(msg);
      }            
# See if we've iterated 10 times.
      if( counter >= 10 )
         break;
   }
# Stop the timer
   KillTimer(timer_id);
# Free up the callback
   unregistercallback("timerproc");
}
# This function will be invoked by Windows
function timerproc( hwnd, message, wparam, lparam)
{
   counter++;
   print "In timerproc...iteration number " counter;
}

Back to Article


Copyright © 1997, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video