A CGI program that parses templates
Jay, a webmaster in Ohio, can be reached at email@example.com.
My organization's web site has two pages -- a calendar of events and the obligatory list of links to related sites -- that are updated with new information e-mailed by visitors. While the volume, which is of e-mail requesting these updates has not been huge, I am often too busy to update the pages promptly.
The obvious solution is to allow users to update these pages themselves, using a web form and Common Gateway Interface (CGI) program that updates the HTML page automatically, so that changes appear instantly. Rather than create a unique CGI program for updating each particular page, I wrote a more general, template-driven CGI application that can easily be extended to support updates of any page. In this article, I'll present an application called "dynadd" that uses the template language and parsing techniques I developed.
As Figure 1 illustrates, dynadd (available electronically; see "Availability," page 3) involves four files (not counting the program itself):
- An HTML form for the users to supply updates. It includes hidden parameters specifying the names of the remaining files.
- A comma-delimited ASCII file that holds the data supplied by users.
- An output-file template. This, of course, is the key to the application. Most of the template consists of generic HTML and plain text that will be copied verbatim to the output file. Additionally, there are some special pseudotags and pseudoentities to control where and in what format data from the data file is inserted in the fixed text.
- An HTML output file created by dynadd from the template file and the raw data.
Listings One, Two, and Three are examples of the template file (dyndemo.htx), data file (dyndemo.dat), and input form (dyndemo.html), respectively UNIX users may have to change the file permissions to permit updates to the data file. If your security does not permit web users to create new files -- and it normally should not -- you must create a writeable dummy output HTML file before the first run.
Once the output page is created, it is accessed like any ordinary HTML file. I could have implemented a program to generate the page on-the-fly every time it is accessed, but this would have been less efficient, since the page will most likely be read more often than it is updated.
The template file is ordinary HTML with the addition of some special markup tags and entities. Any text (including markup) that the program does not recognize is simply passed through to the output file. The program makes no attempt to process or validate the HTML. This is done quite deliberately: It means that any legal HTML, including future tags and extensions, will process successfully. Of course illegal HTML will also pass, but there are plenty of other ways to validate your HTML.
The additional information resembles HTML tags and entities. I deliberately made this look like "real" HTML because it seemed like an elegant solution. It has the drawback of possibly conflicting with future HTML enhancements. In the case of such a conflict, the constructs would be interpreted as dynadd constructs and not as HTML, because dynadd processes the file before a web browser ever sees it.
Specifically, dynadd uses two pairs of pseudotags, <detail> and <if>, and their corresponding terminators, </detail> and </if>.
<detail> and </detail> identify blocks of text containing markers that identify where the data from the data file should go. Everything between these two tags is processed once for each record in the data file.
Fields from the data file are included by using the field name as if it were an entity; that is, preceding the name of the field with an ampersand, and ending it with a nonalphanumeric character. If this character is a semicolon, the semicolon is discarded; any other terminator is retained as part of the text. Any additional text is passed through unchanged. Note that the text can include markup, and fields can be included in markup, so you can say things like "<a href=&someurl>".
The <if> pseudotag allows you to define conditional constructs. The word "if" is followed by a logical expression. If the expression evaluates to false, any text up to the corresponding </if> is discarded; otherwise, the encapsulated text is processed normally. You may nest <if>s.
<if>s can be used to exclude records that fail some criteria. For example, in the calendar of events, I exclude events with past dates. <if>s can also be used to eliminate headings when the corresponding field is blank, or to vary text depending on the data for the particular record.
The only operations that may be used in <if> expressions in this version of the program are string comparisons and the logical operators "and" and "or." Other operators could be added if needed. There is no way to include the result of an expression in the output as text. Such a feature could easily be added, but again, I haven't needed it, so I haven't bothered.
My original version used conventional mathematical symbols like "<" and ">" for relations, but HTML uses these symbols to identify markup. So, instead, I resorted to the Fortran relational operators gt (greater than), le (less than or equal), eq (equal), and so on. All operators must have a space before and after. (You can also use "=" for equality, but it must be surrounded by spaces.) As a special case, an expression consisting of a single field name is considered true if the field is not empty, and false if it is empty. This conveniently handles the most common requirement: suppressing headings when a field is omitted.
The program allows more than one <detail> section in a single template, which causes the data file to be processed more than once. You can use this feature, combined with <if>s, to exclude entire records or to break the data into sections. That is, follow the first occurrence of <detail> with an <if> that will only pass records wanted for the first section; follow the second <detail> with an <if> that will only pass records for the second section; and so on. This is inefficient in that it causes the same records to be read multiple times, but for a modest-sized data file, it is perfectly adequate, and it provides the flexibility of including the same record in multiple sections if desired.
Dynadd's parser has three levels:
- The "document level" copies HTML to the output, and stores <detail> text in a buffer.
- The "detail level" parses the <detail> text, reads the data file, and fills in the file data where appropriate.
- The "expression level" parses the logical expressions contained within <if> tags.
In both the document and detail levels, you need to distinguish between three basic conditions: fixed text, tag, or entity. One buffer stores tag text and another buffer stores entity text. The pointers to these buffers double as flags: NULL means off and non-NULL means on. These pointer flags then describe the three possible conditions. A non-NULL entity pointer indicates that the current token is an entity. A non-NULL tag pointer shows that we are parsing a tag. When both are NULL, we are parsing fixed text. Technically, it is legal to have entities inside tags, so it is possible for both to be active, in which case entities takes precedence.
When dynadd reads "<", indicating the start of a markup tag, it adds the subsequent text to the tag buffer until it sees a terminating ">". At that point, dynadd checks the tag buffer to see if the tag is a pseudotag; if not, then it is just passed to the output.
When dynadd sees an "&", it adds the subsequent text to the entity buffer. As with the markup tag, dynadd then checks the entity buffer to see if it is a pseudoentitity. If it is, dynadd copies the value of that field for this record to the output; otherwise, it just passes it through to the output.
The expression parser works by converting the incoming expression to Shilop (reverse Polish). In normal "algebraic" notation, you write an operand, followed by an operator and another operand, like "3+7". In Shilop, you write "3,7+". This format is easier to parse.
All the operators are binary, which simplifies the parsing scheme. Since dynadd deals primarily with string operands in evaluating logical operations, a null string is considered false, and everything else is true. dynadd uses "Y" when it has to create a True string.
This implementation uses two stacks: one for values and one for operators. First, it collects a token. If it's a literal, it pushes it onto the top of the value stack. If it's a variable, it finds its value and pushes that on the value stack. In this program, you do almost no arithmetic, so everything is stored as a text strings. Extending the parser to handle other data types is not particularly difficult: The main catch is dealing with incompatible data types, such as the expression if "hello" == 7.
If the token is an operator, it pops an operator off the operator stack and compares the popped operator with the current operator. If the popped operator has higher precedence, it pops the two values from the value stack, evaluates them using the operator, and pushes the result to the value stack. If the current operator has higher precedence, the popped operator is pushed back onto the stack, then the current operator is pushed onto the stack. If the operator stack is empty, then the operator is simply pushed onto the stack. When the end of the expression is reached, anything remaining on the operator stack is popped and processed. You are left with the final result.
Consider the expression in Example 1 where the expression parser proceeds as follows: year is a variable, so it finds its value and pushes it onto the value stack. ge is an operator. Since the stack is empty, ge is pushed onto the operator stack. 1997, a literal, is pushed onto the value stack.
The next token, and, is an operator with lower precedence than ge, the top value of the operator stack. So ge is popped along with year and 1997. The expression is evaluated and pushed onto the value stack. and is pushed onto the operator stack.
event is then read and pushed onto the variable stack. Now, the current operator, eq, has higher precedence than and, the operator on the top of the stack. So eq is pushed and nothing is popped or evaluated. convention is pushed onto the value stack, and nothing is left on the expression, so everything remaining on the value and operator stacks is popped and processed. This gives you the final result.
If you ever find that you don't have enough values on the stack, then there must have been a syntax error in the input. Ditto if you have more than one value on the value stack when you're done.
Tying the Pieces Together
Somehow the program must be able to put fields from the add-entry form into the proper place in the data file, and then retrieve these fields correctly when processing the template. In my first version, the fields were simply numbered, with the number reflecting the order in which they were stored in the data file. That is, field &1 was the first field in the record, field &2 was the second, and so on. This was simple, but it quickly proved difficult to work with.
Now, I use a slightly more sophisticated scheme. The data file uses a special parameter record, labeled with #field=, that lists the field names. When the add-entry form is processed, the form field names are matched against the field names in the data file parameter record. The fields are then stored in the order specified by the parameter record, which is not necessarily the order in which they occur on the form. When the template is processed, field names are again matched to the parameter record, and the appropriate field is extracted from the data record. Note that this means that new fields may be added without upsetting any existing data or the order in which users see the fields -- provided that they are added to the end of the data record. Fields on either the add-entry form or the output page may be freely rearranged.
Providing ways for visitors to update data on your site without intervention is asking for trouble. For example, you have no assurance that the forms users submit are those you supplied -- they may have copied one of yours, modified it, and returned their own version. Nor do you have assurances that users will not input "illegal" text. You must take these problems into account, and make your application extremely robust. I took numerous steps to plug as many holes as possible, although there may be some that I missed.
To make the program flexible, it must be able to access different templates and data files at different locations. These and other parameters must be specified in the HTML add-entry form, either as command-line parameters or as hidden fields. However, you don't want to allow users to read from or write to arbitrary files simply by modifying these parameters.
To prevent this, the only parameter on the add-entry form is the base name of the data file; all other necessary parameters are located within the data file. Thus, only files within the HTML document tree can be accessed; for example, there is no way to specify the password file. All associated files use the same base filename with specific extensions designating the file's purpose; these extensions are hardcoded into dynadd, so only files with these extensions can be accessed.
To prevent "bad" HTML, I simply disallowed all markup tags from user input, by removing all "<" and ">" characters and everything in between. In the future, I will enhance the program to allow harmless markup, like bolding and italics, and ensure that all tags are properly terminated.
The data file assigns some characters special meaning: A newline is an end-of-record mark, vertical bars act as field separators, and pound signs specify parameter records. To prevent users from using these characters and causing parsing errors, I replace these with nonreserved characters (pound signs with asterisks, vertical bars with slashes, and newlines with spaces).
It would not take any great technical sophistication for malicious users to fill the form with obscenities or some other hostile message, then click the submit button a few hundred times. Even if readers were content to simply ignore the offensive material, they would have to sort through pages and pages of junk to find the real data. More sophisticated users could also try to execute malicious code on a server by overflowing the buffers in my application with carefully crafted strings.
Consequently, I impose certain limits. First, I limit and enforce the maximum length of a single entry, which prevents this buffer overflow problem. Second, I set a limit on the number of entries that can be added to the file from any single remote computer. CGI makes the name of the client's machine available through the REMOTE_HOST environment variable. Note that the limit must be set high enough so that a client machine with legitimate multiple inputs is not likely to go over the limit.
<html><title>Dynadd Demo -- A Link Page</title><h1>Dynadd Demo -- A Link Page</h1> </p> <hr> Return to <a href="index.html">home page</a> </p> <hr> Add a link to this page using <a href="dyndemoa.htm">this handy form!</a><p> </p> <i>Probably some introductory text here, maybe disclaimers about not taking responsibility for the content of links people added themselves, etc. The headings are, of course, intended only for illustration.</i><p> </p> <h2>Computers</h2> <ul> </p> <detail><if subj = "compute" and xdate ge xtoday> <li><a href="&url">&title</a> <if submit><br> submited by &submit </if></if></detail> </ul> </p> <h2>Government</h2> <ul> <detail><if subj = "gov" and xdate ge xtoday> <li><a href="&url">&title</a> <if submit><br> submited by &submit </if></if></detail> </ul> </p> <h2>Miscellaneous</h2> <ul> <detail><if subj = "misc" and xdate ge xtoday> <li><a href="&url">&title</a> <if submit><br> submited by &submit </if></if></detail> </ul> </p> <hr> Copyright 1997 by me<p> </html>
#limit=3#fields=url|title|submit|subj|xmonth|xday|xyear cmh.infinet.com|http://hoohoo.ncsa.uiuc.edu|NCSA Web Server Documentation||compute| abc.org|http://www.ddj.com|Dr Dobb's Journal||compute abc.org|http://www.wpafb.af.mil|Write-Patterson Air Force Base|Mary Jones|gov abc.org|http://www.dole96.com|Dole for President||gov|Nov|4|1996 day-p005.infinet.com|http://www.cs.cmu.edu/books.html| On-Line Books||misc|Jun|30|1997
<html><title>Dynadd demo: Add to link page</title><h1>Dynadd demo: Add to link page</h1> <hr> Return to <a href="dyndemo.htm">link page | <a href="index.html">home</a> <hr> Use this form to add to our link page!<p> </p> <form action="/cgi-bin/dynadd/test/dyndemo" method=post> <table> <tr><td align=right>URL<br> <td><input name=url size=40> <tr><td align=right>Page title<br> <td><input name=title size=40> <tr><td align=right>Submitter's name <td><input name=submit size=40> <tr><td align=right valign=top>Heading you belong under<br> <td><input type=radio name=subj value=compute>Computers<br> <input type=radio name=subj value=gov>Government<br> <input type=radio name=subj value=misc>Miscellaneous<br> <tr><td align=right>If this URL will no longer be valid after a certain date, enter expiration date here <td><select name=xmonth> <option value=""> <option value="Jan">Jan <option value="Feb">Feb <option value="Mar">Mar <option value="Apr">Apr <option value="May">May <option value="Jun">Jun <option value="Jul">Jul <option value="Aug">Aug <option value="Sep">Sep <option value="Oct">Oct <option value="Nov">Nov <option value="Dec">Dec </select> <select name=xday> <option value=""> <option value="1">1 <option value="2">2 <option value="3">3 <option value="4">4 <option value="5">5 <option value="6">6 <option value="7">7 <option value="8">8 <option value="9">9 <option value="10">10 <option value="11">11 <option value="12">12 <option value="13">13 <option value="14">14 <option value="15">15 <option value="16">16 <option value="17">17 <option value="18">18 <option value="19">19 <option value="20">20 <option value="21">21 <option value="22">22 <option value="23">23 <option value="24">24 <option value="25">25 <option value="26">26 <option value="27">27 <option value="28">28 <option value="29">29 <option value="30">30 <option value="31">31 </select> <select name=xyear> <option value=""> <option value=1997>1997 <option value=1998>1998 </select><br> </table> </p> <p align=center> <input type=submit value=" Add "> <input type=reset value="Reset"> <p> </form> <hr> Copyright 1997 by me<p> </html>
Copyright © 1997, Dr. Dobb's Journal