Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Tools

Examining the Cocktail Toolbox


MAR96: Examining the Cocktail Toolbox

Rod, an engineer with Boeing aircraft, can be contacted at [email protected].


Half a century ago, General Motors and others began convincing railroads to use diesel instead of steam locomotives. Maintaining steam locomotives, they claimed, required a tremendous amount of labor, shop facilities, and supplies, not to mention having valuable capital assets off-line during the maintenance process. Steam fan that I am, it pains me to admit they were right--diesels were an improvement, at least when it came to running a transportation business.

In particular, one marketing tactic diesel manufacturers used was to put a demonstrator diesel locomotive on the tracks. In one case (or so the story goes), the demonstrator wore out its wheels in just three months. Skeptics immediately jumped on this unprecedented wear rate as a reason to reject the newfangled machine. Closer investigation revealed that the locomotive ran 87,000 miles in those three months, which was about the normal life of a set of wheels. However, steam locomotives could never approach that mileage in such a relatively short time. In short, while steam engines were sitting in the yards being oiled, greased, and ash-cleaned, diesels were on the tracks wearing off their wheels.

Using the Cocktail tool package for translator development makes me feel like that demo diesel in that the work leaves me worn out. But when I look back at what I have produced in a short time, I realize that a large amount of relatively routine work has been done for me, allowing me to concentrate on the real algorithmic meat of the task. The result is high mileage, which more than compensates for the rapid wear.

What is Cocktail?

Cocktail is a collection of tools for producing compilers, translators, and similar tools. It was originally developed at the German National Research Center's subsidiary at Karlsruhe. Cocktail contains implementations for six special-purpose languages that generate most of the parts of a translator.

Cocktail is available for both MS-DOS and UNIX. To run the binaries under DOS, you'll need an 80386/486, at least 2 MB of memory, and the go32 DOS extender from DJ Delorie. To compile the sources, you'll need the gcc GNU C compiler (Delorie's GCC port to DOS), the GNU make (gmake), and the ar and ranlib archive-handling tools. For more information, see text box on page 82. A commercial implementation (including support) is available from Josef Grosch ([email protected]), whose company provides support for the free version, along with translator-development services.

Two of the languages included with Cocktail, rex and lalr, are approximate functional equivalents of the ubiquitous lex and yacc. They generate lexical scanners and LALR parsers, respectively. However, they generate much faster code than lex and yacc and have valuable additional features. Also, they integrate smoothly with the other languages.

Another language, ell, is actually nearly identical to lalr. However, its generator produces top-down parsers instead of bottom-up parsers. This somewhat limits the grammars it can parse, but it also yields faster parsers and makes some kinds of semantic processing easier.

The ast language supports specifying abstract syntax trees (ASTs). Many translator programs build ASTs internally, and there is a very large amount of relatively boring code to be written to support them. Ast generates it for you.

The ag language generates evaluator programs for attribute grammars (AGs). AGs are a very powerful formal system for specifying static semantic analysis. An AG defines, in a declarative style, how to compute information about the nodes of a tree. The ag language is a superset of ast, which it uses to define the AST on which it works. The same generator program, cg, implements both ast and ag.

Finally, the puma language handles term-rewrite systems, another kind of powerful formal system. They define how a tree can be transformed into another tree, using patterns and replacements.

The generators produce source code ("target code") in either Modula-2 or C, which is then compiled as usual. Handwritten target code can be embedded in many places in the input to the generators. This is a powerful escape mechanism for handling things that can't be expressed in the relevant formal system.

Ast and Ag

Ast is a language for defining a tree grammar for an AST; lalr is for defining a string grammar for a concrete syntax. Although concrete and abstract syntaxes are considerably different in style, the developers of Cocktail recognized that the same language could be used to define either style.

The cg generator has a function that takes a concrete grammar and its related semantic actions, written in the ast and ag languages, and converts it into lalr. It can also emit rex code for most of a scanner specification. I find this the easiest way to use Cocktail, as most of the specification is written in a single notation.

I will use a primitive programming language called "Lox" for illustrative purposes. Lox has a PRINT statement, an IF with only a THEN clause, a compound statement for combining other statements, and simple expressions involving integer literals, addition, multiplication, and parentheses.

The first 50 lines of Listing One show the concrete syntax in ast notation. I've used the convention that all grammar symbols in the concrete syntax start with "Cs". Before I adopted this and other, similar conventions, I found myself confusing grammar symbols with attributes of the concrete and abstract syntax.

A grammar symbol followed by an equal sign and more symbols to its right is a production. For example, in line 27, a PRINT statement, denoted by grammar symbol CsPrintStmt, consists of the literal characters 'PRINT' followed by an expression, denoted by CsExpr, which is defined by the production beginning on line 36.

Alternative right-hand sides are enclosed between "<" and ">" and separated by commas. Each alternative can have its own name ahead of an equal sign. For example, a statement (CsStmt, line 26) can be either a PRINT statement, an IF statement (CsIfStmt, line 28) or a compound statement (CsCompoundStmt, line 32). The names for the alternatives are optional in ast, and they are not necessary for defining the syntax. However, they will be needed later to build an AST.

Rex

Together, cg and lalr can generate an LALR parser. In addition, from the same input, cg generates the boring but voluminous part of a scanner specification in the language rex, namely, all the tokens of fixed spelling such as "BEGIN" and ";".

The rest of the scanner specification must be handwritten; see Listing Two. For my simple language, the only token of variable spelling is an integer literal, specified in line 47 as "Digit +." Digit is defined in line 39 as any digit, and the "+" is the familiar "zero or more occurrences of" operator of regular expressions.

When a literal is recognized, lines 48 and 49 call the function LiteralValue to compute its integer value and store this in attribute CaValue. An attribute is a field attached to a node of a tree that is not part of the tree structure itself. I use the convention that attributes of symbols in the concrete syntax start with "Ca".

Line 50 returns the recognized grammar symbol (TokLiteral, in this case) from the scanner. The material between braces on these three lines is all target code, handwritten in Modula-2, which rex copies almost verbatim into the generated scanner.

Function LiteralValue is defined in lines 16 through 34. This also is target code, as the braces around it show. It converts the characters of the literal into an integer value.

Building an Abstract Syntax Tree

Cg can also be used to define and build an abstract syntax tree (AST). Listing Three contains an abstract syntax for Lox. Although the notation is the same as for the concrete syntax, the style is quite different. All the delimiting tokens, such as IF, "(", and the like, are absent from the abstract syntax.

The concrete syntax describes the structure of input in string form. There, the delimiters are needed to determine what construct is next and where its subcomponents begin and end. The abstract syntax, in contrast, defines a tree rather than a string. Each node in the tree contains a node operator that tells exactly what construct it is (making delimiter symbols redundant).

You can think of this in object-oriented terms. A grammar symbol such as AsExpr on line 27 is like an abstract class. (This is a different meaning of "abstract.") That is, objects (or tree nodes) of type AsExpr are never actually created. Instead, only nodes of type AsSumExpr, AsProductExpr, and AsLiteral are created. When it is used on the right side of a rule, AsExpr can refer to any of these; see line 19. My convention is that "As" begins the names of abstract-syntax grammar symbols.

Conceptually, there is a tree even for a concrete syntax, usually called a "derivation tree.'' This form of tree isn't the most desirable for further processing. So parsers usually just discover the derivation tree one step at a time but never actually construct it in memory.

Given only a concrete syntax, cg and lalr generate a parser that doesn't produce any record of what it has done. You must add definitions that tell the parser what to produce. I have done this in a way that builds an AST, according to the grammar in Listing Three, while parsing according to the grammar in Listing One.

The construction works by attaching attributes to the nodes of the derivation tree and computing their values during parsing. In general, the values of attributes of tree nodes can depend on other attributes of the same node or attributes of its parent, children, or siblings. Thus, information can flow all over the tree in all directions.

In cases where a tree is actually built, cg can infer the tree-traversal order necessary to evaluate a set of attributes and generate a set of procedures that traverse in this order while computing the attributes. You need to specify only the local rules. This is the real power of attribute grammars and evaluator generators like cg.

However, we are evaluating attributes of the concrete grammar--whose tree is never built--during parsing, which is done bottom-up. So we are limited to attribute values that depend only on attributes of the same node and its immediate children.

Fortunately, this is usually adequate for building an AST. The Build module, starting at line 51 of Listing One, does this. Lines 71 through 73 declare that every symbol in the concrete syntax has an attribute named CaAst of type tAst, which is a pointer to an AST node. CaAst will hold the root of the AST that corresponds to this concrete grammar symbol.

The attribute declaration just adds CaAst to the previously declared properties of the grammar symbols CsProgram, CsStmt, and so on. All of their descendent symbols, such as CsPrintStmt, CsIfStmt, CsSumExpr, and the like will inherit this attribute.

The property SYNTHESIZED tells cg to copy the value of this attribute by default from a child node if no explicit rule is given to compute it.

Rules for computing the attributes CaAst of the concrete symbols begin at line 77. Take line 112: The function mAsIfStmt is one of a complete set of functions generated by cg from the AST definition of Listing Three. It allocates and constructs an AST node whose operator is AsIfStmt and whose children are supplied as parameters. These are just the AST subtrees for the expression and statement used to build the IF statement.

The notation CsExpr : CaAst in line 115 defines the attribute CaAst of the child node whose node type is CsExpr. This is target code, as the braces show, but it is not quite Modula-2.

Cg recognizes CsExpr : CaAst as a reference to an attribute and replaces it with the appropriate expression to access the actual attribute. Cg will have stored this in some temporary place. Also, cg notes that CaAst of the parent CsIfStmt node depends on CaAst of the CsExpr child node. It uses this information to determine traversal orders.

In a more complicated language, other attributes would collect information to be stored into AST nodes whose construction is postponed. The linking of AST nodes also could be more complicated. For example, I have a C parser and AST builder which inverts the inside-out structure of declarators to match the right-side-out structure of type specifiers and function formal-parameter lists.

Notice the parameter to the mAsLiteral constructor in line 146. Although AsLiteral has no children, it has attribute AaInitValue, declared in line 34 of Listing Three. mAsLiteral sets the value of AaInitValue. The constructor call just uses the value of attribute CaValue.

An Attribute Evaluator

The Evaluate module, beginning at line 37 of Listing Three, defines an attribute evaluator that works on the AST after it is already constructed. I have written this to compute the values of Lox expressions, which contain only constants.

Line 46 declares that every AsExpr node has an additional attribute--AaValue--that AsSumExpr, AsProductExpr, and AsLiteral will inherit.

The property OUTPUT tells cg that the value of this attribute should be explicitly stored in the tree, because it is expected to be used later. Attribute AaInitValue has the property INPUT, which means it should have been computed prior to attribute evaluation, during AST building. Cg will optimize the storage of attributes that are neither INPUT nor OUTPUT by keeping them in local variables of the evaluator, rather than in the tree nodes.

The rest of the rules in Listing Three tell how to compute AaValue for the various subclass nodes of AsExpr. Line 61 just copies from AaInitValue to AaValue of the AsLiteral node. This seeming redundancy gives every AsExpr node a consistent attribute (AaValue), while the AaInitValue is unique to the AsLiteral node and is computed at a different time.

Cg doesn't know that I intend never to create a node with operator AsExpr, so it insists on a rule to compute its AaValue attribute. Line 63 provides this.

This attribute evaluator happens to work strictly bottom-up, so I could have done this work at AST build time. In a more- complicated language, I might need an attribute evaluator that traverses the tree in complex ways.

Tree Transformations

Puma generates procedures that match patterns in trees. Once a pattern is matched, a new tree can be constructed and used in many ways: for checking type compatibility, computing result types of functions, or transforming a tree.

Listing Four is a simple tree transformer that folds the IF statement in Lox. Since all expressions are evaluated, it is possible to decide whether the THEN clause should be executed. If so, the IF statement can be replaced by just the THEN clause.

To avoid deallocation, I copy all tree nodes that could possibly have changes in their subtrees. Expressions can't have any subordinate changes, so I reuse them in place.

Most of the functions and rules in Listing Four simply make copies of AST nodes. The pattern at line 42 is interesting: It matches an AsIfStmt whose first child is an AsExpr whose first component (attribute AaValue) has value zero.

The pattern also binds local identifier ThenStmt, the second child of the AsIfStmt. ThenStmt is used in the pattern replacement in line 46. This is just a Fold done on the THEN clause.

Odds and Ends

In addition to the generators I've just described, Cocktail includes other useful components. All host-platform-dependent functions are collected in a single library. These routines can be modified as necessary to make all the Cocktail tools run on a particular machine. I have installed (by recompiling) Cocktail on two flavors of UNIX, without change to the portability library.

There also is a modest library of functions likely to be useful in a translator program, including the generators themselves, which use the library. Simple examples are manipulation and storage of variable-length strings and hashed symbol-table mapping of variable-length strings to a compact series of integers.

There are translator programs between lex and rex and between yacc and lalr. Mtc is a good Modula-2-to-C translator that produces amazingly readable code for a source-to-source translator. I have used mtc to bootstrap fairly large bodies of Modula-2 code to a platform for which I didn't have a handy Modula-2 compiler.

There also is a fair-sized collection of example scanners and parsers for various programming languages. The commercial version has an expanded set, including difficult languages such as Fortran.

For the examples in this article, 300 lines of handwritten input code to the various Cocktail tools generated 3700 lines of Modula-2 code. This ratio is consistent with the ratio I've achieved on realistic-sized projects. Some of this might not be used in a particular application and some is bigger than handwritten code, but there is a high gain to be had.

Using Cocktail means writing very high-density code. There is a lot of meat in a small space. It is tiring, but it gives so much leverage, you can afford to take more breaks and still finish a project much faster. It's like a diesel running 87,000 miles in three months. But for some reason, I don't have the same "steam fan's" nostalgia for the old system of handwriting translator code.

For More Information

Delorie's GCC port to DOS is available at grape.ecs.clarkson.edu in the /pub/djgcc directory. Free versions of Cocktail (with source and documentation) are available at ftp://ftp.ira.uka.de:/pub/programming/cocktail/ and ftp://ftp.unistuttgart.de/pub/unix/programming/compilerbau/ (including a DOS version using DJGPP).

Listing One

 1     
 2     /* Concrete syntax */ 
 3     
 4     PARSER Parser 
 5     
 6     RULE
 7     
 8       START = CsProgram . 
 9     
10     /* Terminal symbol: */ 
11     
12       TokLiteral : 
13         [ CaValue : CARDINAL ] 
14         { CaValue := 0 ; } . 
15             
16     /* Main grammar: */ 
17     
18       CsProgram = CsStmt .
19     
20       CsStmtPlus
21         = < CsStmtPlusLast = CsStmt .
22             CsStmtPlusMore 
23               = CsStmt ';' CsStmtPlus . 
24           > .
25     
26       CsStmt
27         = < CsPrintStmt 
28               = 'PRINT' CsExpr .
29             CsIfStmt
30               = 'IF' CsExpr
31                 'THEN' CsStmt .
32             CsCompoundStmt  
33               = 'BEGIN' CsStmtPlus 'END' . 
34           > . 
35     
36       CsExpr
37         = < CsSimpleExpr = CsTerm .
38             CsSumExpr = CsExpr '+' CsTerm . 
39           > . 
40     
41       CsTerm
42         = < CsSimpleTerm = CsFactor .
43             CsProductTerm = CsTerm '*' CsFactor .
44           > . 
45     
46       CsFactor
47         = < CsLiteral = TokLiteral .
48             CsParenthesizedExpr = '(' CsExpr ')' . 
49           > .
50     
51     MODULE Build
52     
53     /* Actions to build AST. */ 
54     
55     PARSER
56     
57     GLOBAL
58       { FROM Ast IMPORT 
59           AstRoot , tAst 
60           , mAsProgram 
61           , mAsStmtStarNone , mAsStmtStarAnother 
62           , mAsPrintStmt , mAsIfStmt
63           , mAsCompoundStmt , mAsSumExpr
64           , mAsProductExpr , mAsLiteral ;
65       }
66     
67     PROPERTY /* Disable implicit INPUT */
68     
69     DECLARE 
70     
71       CsProgram , CsStmt , CsStmtPlus 
72       , CsExpr , CsTerm , CsFactor
73         = [ CaAst : tAst SYNTHESIZED ] .
74     
75       /* Storing Final tree. */ 
76     
77       CsProgram
78         = { CaAst
79               := { CaAst 
80                      := mAsProgram
81                          ( CsStmt : CaAst ) ; 
82                    AstRoot := CaAst ;
83                  } ; 
84           } . 
85     
86       /* Linking statments into lists. */
87     
88       CsStmtPlusLast
89         = { CaAst
90               := mAsStmtStarAnother
91                    ( mAsStmtStarNone ( ) 
92                    , CsStmt : CaAst
93                    ) ;
94           } . 
95     
96       CsStmtPlusMore
97         = { CaAst
98               := mAsStmtStarAnother
99                    ( CsStmtPlus : CaAst
100                    , CsStmt : CaAst
101                    ) ;
102           } . 
103     
104       /* Building statement nodes. */ 
105     
106       CsPrintStmt
107         = { CaAst
108               := mAsPrintStmt
109                    ( CsExpr : CaAst ) ;
110           } .  
111                    
112       CsIfStmt
113         = { CaAst
114               := mAsIfStmt
115                    ( CsExpr : CaAst 
116                    , CsStmt : CaAst
117                    ) ;
118           } .  
119                    
120       CsCompoundStmt
121         = { CaAst
122               := mAsCompoundStmt
123                    ( CsStmtPlus : CaAst ) ;
124           } .  
125     
126       /* Building expression nodes. */
127                    
128       CsSumExpr
129         = { CaAst
130               := mAsSumExpr
131                    ( CsExpr : CaAst
132                    , CsTerm : CaAst
133                    ) ;
134           } . 
135     
136       CsProductTerm
137         = { CaAst
138               := mAsProductExpr
139                    ( CsTerm : CaAst
140                    , CsFactor : CaAst
141                    ) ;
142           } . 
143     
144       CsLiteral
145         = { CaAst
146               := mAsLiteral
147                    ( TokLiteral : CaValue ) ;
148           } . 
149     
150     END Build 
151     

Listing Two

 1     
 2     /* Handwritten part of scanner spec. */
 3     
 4     SCANNER Scanner 
 5     
 6     EXPORT 
 7       { FROM Positions IMPORT tPosition ; 
 8         INSERT tScanAttribute 
 9       } 
10     
11     GLOBAL
12       { INSERT ErrorAttribute } 
13     
14     LOCAL
15     
16       { PROCEDURE LiteralValue ( ) : CARDINAL 
17     
18         ; VAR I , Value : CARDINAL ; 
19         ; VAR String : Strings . tString ; 
20         ; BEGIN
21             GetWord ( String ) 
22           ; Value := 0 
23           ; FOR I := 1 
24               TO Strings . Length ( String ) 
25             DO
26               Value 
27                 := Value * 10 
28                    + ORD ( Strings . Char 
29                              ( String , I )
30                          )
31                      - ORD ( '0' )
32             END 
33           ; RETURN Value 
34           END LiteralValue ; 
35        }
36     
37     DEFINE
38      
39       Digit = {0-9} .
40     
41     START Literal 
42     
43     RULE
44     
45       INSERT RULES #STD# 
46     
47       #STD# Digit + 
48          : { Attribute . TokLiteral . CaValue
49                := LiteralValue ( ) ;
50              RETURN TokLiteral ;
51            } 
52     

Listing Three

 1     
 2     /* Abstract syntax */ 
 3     
 4     TREE Ast VIEW Ast 
 5     
 6     RULE
 7     
 8       AsProgram = AsStmt .
 9     
10       AsStmtStar
11         = < AsStmtStarNone = . 
12             AsStmtStarAnother
13               = Next : AsStmtStar 
14                 Stmt : AsStmt . 
15           > . 
16     
17       AsStmt 
18         = < AsPrintStmt 
19               = AsExpr .
20             AsIfStmt
21               = AsExpr AsStmt .
22             AsCompoundStmt  
23               = AsStmtStar . 
24           > . 
25     
26       AsExpr
27         = < AsSumExpr 
28               = Left : AsExpr 
29                 Right : AsExpr . 
30             AsProductExpr 
31               = Left : AsExpr 
32                 Right : AsExpr .
33             AsLiteral 
34               = [ AaInitValue : INTEGER ] . 
35           > . 
36     
37     MODULE Evaluate 
38     
39     EVAL Eval
40     
41     PROPERTY  
42     
43     DECLARE 
44     
45       AsExpr 
46         = [ AaValue : INTEGER OUTPUT ] . 
47     
48       AsSumExpr
49         = { AaValue
50               := Left : AaValue
51                  + Right : AaValue ; 
52           } .  
53     
54       AsProductExpr
55         = { AaValue
56               := Left : AaValue
57                  * Right : AaValue ; 
58           } . 
59     
60       AsLiteral
61         = { AaValue :- AaInitValue ; } .
62     
63       AsExpr
64         = { AaValue := 0 ; } . 
65     
66     END Eval 
67     

Listing Four

 1     
 2     /* Transformation specification. */
 3     
 4     TRAFO Trans 
 5     
 6     TREE Ast 
 7     
 8     PUBLIC FoldProgram  
 9     
10     EXTERN Ast , ReleaseAst 
11     
12     GLOBAL
13       { FROM Ast IMPORT ReleaseAst ; } 
14     
15     FUNCTION FoldProgram  
16       ( AsProgram ) AsProgram 
17     
18       AsProgram ( Stmt )
19         RETURN 
20           AsProgram ( FoldStmt ( Stmt ) ) ; . 
21     
22     FUNCTION FoldStmtStar 
23       ( AsStmtStar ) AsStmtStar
24     
25       AsStmtStarAnother ( Next , Stmt ) 
26         RETURN 
27           AsStmtStarAnother
28             ( FoldStmtStar ( Next ) 
29             , FoldStmt ( Stmt ) 
30             ) ; . 
31     
32       Tree RETURN Tree ; . 
33     
34     FUNCTION FoldStmt  
35       ( AsStmt ) AsStmt 
36     
37       AsCompoundStmt ( Stmts ) 
38         RETURN 
39           AsCompoundStmt 
40             ( FoldStmtStar ( Stmts ) ) ; . 
41     
42       AsIfStmt 
43         ( AsExpr ( { 0 } , .. ) 
44         , ThenStmt 
45         )
46         RETURN FoldStmt ( ThenStmt ) ; . 
47     
48       AsIfStmt ( Expr , Stmt ) 
49         RETURN 
50           AsIfStmt 
51             ( Expr  
52             , FoldStmt ( Stmt ) 
53             ) ; . 
54     
55       Tree RETURN Tree ; . 
56     


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.