Channels ▼
RSS

Web Development

Text-Embedded Programming and Preprocessing with Starfish


Vlado is an assistant professor of Computer Science at the Dalhousie University, where his research focuses on text mining and natural language processing. His home page is http://www.cs.dal.ca/~vlado, and he can be contacted at vlado@cs.dal.ca.


Starfish is an open source, Perl-based unifying framework for macro preprocessing and text-embedded programming. It illustrates an elegant and simple methodology based on regular expression rewriting, which is relatively painlessly implemented in Perl. We also discuss some practical aspects of this implementation.

Text-Embedded Programming

You are probably already aware of the notion of text-embedded programming: snippets of source code are embedded in a document, and during processing, the snippets are replaced with the result of their evaluation. The technique is mainly used for generating HTML documents, and there are several examples of programming languages or frameworks using it, such as PHP, ASP, and JSP.

The code snippets are distinguished from the surrounding text with starting and ending string delimiters, which act as escape sequences that toggle on and off the code processing. Typical string delimiters are "<?" and "?>" or "<?php" and "?>" in PHP, "<%" and "%>" in ASP, and "<?" and "!>" in ePerl. During processing, the text other than code snippets is left intact, while the code snippets are evaluated and the evaluation results are used to replace the snippets. For example, in PHP, we could prepare an HTML document such as:

<html><head><title>PHP Test</title></head>
<body>
<?php echo '<p>Hello World</p>'; ?>
</body></html>

and after processing it with the PHP interpreter, the output would come out as:

<html><head><title>PHP Test</title></head>
<body>
<p>Hello World</p>
</body></html>

Embedding the code in this way is sometimes called "escaping" because a starting delimiter such as "<?" serves as an escape sequence, triggering special processing of the snippet. Another kind of escaping, referred to as the "advanced" escaping in PHP is illustrated with the following example:

Good <?php if ($hour < 12) { ?> Morning
<?php } else { ?> Evening <?php } ?>

I will refer to this as "inverted escaping." Inverted escaping can be interpreted in the following way: The complete text is a piece of code in which the plain text is embedded between "?>" and "<?php" delimiters and it is translated into an `echo "string";' statement, or any substring part:

?> external text <?

should be interpreted as the statement

echo " external text ";

An implicit delimiter "?>" is assumed at the beginning of the text and an implicit "<?php" is assumed at the end of text. Although it is relatively easy to implement, I don't use inverted escaping in Starfish because its benefits are not so clear. Inverted escaping fails to embody the principle that each snippet should be a well-defined block of code. Also, Perl offers a plethora of string delimiting options such as q/.../ and <<'EOT', which can be used instead of inverted escaping to include larger chunks of text.

Related Work

Let us now consider Perl-based embedded programming. Having the text-embedded capability should not be considered a main characteristic of a programming language, but it should be an orthogonal framework that allows several programming languages as options.

Knowing Perl, its string-processing capabilities, and its ability to execute source code at runtime (the eval function), it is clear that this should be easy to implement. In 1998, at the time when I started thinking about needing a system like this, I found one that partially implemented what I needed: ePerl.

The language ePerl was developed by Ralf S. Engelshall in the period from 1996 to 1998. It is an embedded Perl language in the sense that we described. But ultimately, there were two reasons why I decided it did not fit my needs: (1) it seemed to be too heavy weight, and (2) it did not support the "update" mode. I'll explain the "update" mode in the next section, but what do I mean by "too heavy weight"? The language ePerl is a package of 195 KB, created by modifying the Perl source code and requiring compilation during installation. If a Perl script can provide the same functionality and require only Perl to be installed, it would be a more convenient solution. For better reusability, including a module would be an even better solution.

Others noted the heavyweight nature of ePerl, as well. For example, David Ljung Madison developed an "ePerl hack," which is a Perl script of some 1400 lines that has functionality similar to ePerl.

Text::Template, by Mark Jason Dominus, is another Perl module with similar functionality. An interesting and probably independent similarity is that Starfish uses $O as the output variable, while $OUT is used in Text::Template. The default embedded code delimiters in Text::Template are "{" and "}", with an additional condition that braces have to be properly nested. (In other words, {{{"abc"}}} is a valid snippet with delimiters.) The module allows the user to change the default delimiters to other alternative delimiters.

The well-known Perl module HTML::Mason by Jonathan Swartz, Dave Rolsky, and Ken Williams can also be seen as an embedded Perl system, but it is a larger system with the design objective being a high-performance, dynamic web-site authoring system.

Starfish is a "lighter weight" system than ePerl or Mason, but it is, in certain sense, more flexible than Text::Template and the ePerl hack, so I believe it deserves survival among the pack. Since I followed some ePerl design parameters in the beginning, I called it called SLePerl, as an abbreviation for "Something Like ePerl." I changed it to Starfish a bit later.

Update Mode

The two main novelties introduced in Starfish are the update mode and the flexibility provided by the hook-evaluation mechanism. The code-embedded systems that we have described in the previous section process text by replacing the snippets with the evaluation results—something that we will refer to as the "replace" mode of operation. This mode is used either to channel the results to standard output or to generate another file. However, there are many situations where instead of having an input and output file, it is more convenient to update a file by running a processor on it. In addition to this, we want to have the flexibility of specifying new escape sequences and have different "evaluators" for them.

Here are two examples to better explain the need for the update mode:

The first example is about writing Makefiles—the recipe files describing what commands need to be run to bring a set of files up to date. For example, C source files need to be selectively compiled and linked, LaTeX files need to be processed, figures produced, and so on. Make, the program used to interpret Makefiles, has macro facility to help us in maintaining all dependencies, but it is not wise to rely too much on them. Long ago, I used to marvel at all the features of dmake, a particular version of make, only to realize later that dmake is only commonly found on DEC systems, and all my wonderfully crafted dmakefiles did not work in other environments. If you really want to make sure that your makefile is portable, you better give up on fancy stuff, roll up your sleeves, and do some typing. But this will not solve the problem of maintainability. Later I learned about Imake, but after some pains, I concluded that using a C preprocessor on top of a Makefile may be as bad an idea as could ever be. Perl provides MakeMaker, which might be a good idea, but it would be nice to have the same system handling HTML, LaTeX, procmail, plain text files, PostScript, Java sources, C, Python, and even Perl itself.

For example, to specify that a package includes all files with the extension .java in the current directory, and to have a java-to-class rule for each of them, we want to use the following Perl code:

#<? @javafiles = map { s/.java$//; $_ } <*.java>;
#   echo 'all: '.join('.class ', @javafiles).".class\n";
#   echo join('', map { "$_.class: $_.java; javac $_.java\n" } @javafiles);
#!>

We need to comment the code out so that make does not get confused. After processing the makefile with Starfish, the code is appended with the output:

#+
all: t1.class t2.class t3.class
t1.class: t1.java; javac t1.java
t2.class: t2.java; javac t2.java
t3.class: t3.java; javac t3.java

#-

The output is delimited with strings "#+" and "#-" so that after several runs of Starfish, the output code does not get replicated but replaced with new code.

Here's another example of the need for update mode, involving C-macro-style inclusion of Java code based on a global variable $Debug:

public static int main(String[] args) {
		 //<? echo "       ".(defined $Debug ?
		 //qq[System.out.println("Debug version");] :
		 //qq[System.out.println("Release version)";]);
		 //!>//+
		 System.out.println("Release version)";//-
		 return 0;
	 }
}

or an HTML file with an automatic "Last update:" label:

<!--<? echo "Last update: ".file_modification_date() !>-->
<!-- + -->
Last update: May 2, 2005
<!-- - -->

The file_modification_date function is smart in the sense that if we keep starfish-ing a file every day, the date of the last modification will remain the same (unless something else causes Starfish to change the updates).

We see that besides the issue of having several escape sequences, or hooks, we need to handle different styles of source code for different commenting.

Implementation

Starfish uses a conceptually simple approach of "hooks" (or triggers) and evaluators. For example, the delimiters "<?" and "!>" represent a hook, which is associated with an evaluator that will evaluate the code in between and produce the result that will replace the code. In the update mode, the code will be replaced with something like:

<? code !>
#+
..out
#-

The result is found in the content of a special variable $O. So, one can produce the evaluation result by putting explicit content into this variable, or use an implicit and more elegant way of using the echo command. (I thought about using the command print, but it is not advisable to change the behavior of this built-in Perl command.)

The previous code between delimiters "#+" and "#-" is removed by having another hook ("#+", "#-") and an evaluator that simply removes the embedding. For this reason, we usually need at least two hooks and evaluators. More precisely, a hook-evaluator pair in Starfish is a hook element consisting of:

  • The escape sequences (begin and end).
  • The evaluator function.
  • An optional code preparation function to remove comments before evaluating a snippet.

These elements are set in a different way for files of a different style (or type). Currently, other options include HTML, TeX and LaTeX, Java, makefile, ps, and Perl. A user can add arbitrary hooks. For example, this text is written using Starfish on plain text. Since the standard escape sequences '<?' and '!>' are frequently used, the following snippet is applied at the beginning of the file:

<? echo "DO NOT EDIT!  GENERATED!\n";
$Star->rmHook('<?','!'.'>');
$Star->rmHook('<?starfish','?'.'>');
$Star->addHook('<?'.'new', '!'.'>', 'default');
$Star->addHook('#'.'ignore ', "\n", 'ignore');
!>

Since the replace mode is used, the first command produces a warning not to edit the output file. The next two commands remove two standard hooks, and a new hook is introduced with the default evaluators. In order to comment out some lines, the hook with '#'.'ignore' is introduced (some hooks are not given literally on purpose).

The variable $Star is a special variable used to refer to the Starfish object processing the current file. An empty string can be used as a begin or end delimiter, in which case it will match the current or the final position in a file. For example,

$Star->addHook('','','ignore');

will cause the rest of the file to be removed in the replace mode, while in the update mode it will be copied without searching for hooks.

While scanning a file, Starfish is finding matching pieces according to the list of hooks, and it will choose the left-most, shortest one. If two hooks have equal begin delimiters and both matching end delimiters can be found, the one defined later is chosen. We saw that hooks can be specified using two special keywords, default and ignore. The evaluator default is the default Starfish evaluator that takes code, removes "armor" comments, evaluates it, and either replaces it with the contents of $O, or appends to it contents of $O. The ignore evaluator will remove the snippet in the replace mode, or leave it as it is in the update mode. Any other string in addHook is interpreted as a code snippet that can make use of the following names:

  • $self, the current Starfish object,
  • $p, the recognized begin delimiter,
  • $s, the recognized end delimiter, and
  • $_ is the snippet.

The snippet is replaced with $p.$_.$s in the update mode or with $_ in the replace mode.

The regular expressions can also be used as hooks, e.g., currently in the Python style, but I plan to further extend their usage in Starfish.

Starfish can be used as a module or as a program (i.e., script) called starfish. The starfish program can be used within CGIs to preprocess files in the update mode or replace mode (with the meta source typically given extension .sfish). I found it very useful within the emacs editor, where I save a file, run starfish in the update mode on it, and reopen it again at the same spot. This can be automated using emacs macros and tied to one keystroke. Thanks to Jesse Rusak, I now know that the Mac OS X operating system offers system services, which can conveniently be used to process a region of text and replace it with the output. It seems that Starfish would fit nicely with this framework.

Macros, Folding, and Unfolding

The ability to specify arbitrary hooks and evaluators can be used to define simple nonparametric macros, or even parametric macros. For example, an evaluator of a hook with the begin string "macro(" and the end string ")" could produce an expansion dependent on the parameters between parentheses.

This approach is experimentally used in Starfish for source-code folding and unfolding. Folding and unfolding are normally editor features, and are a way of hiding regions of text during editing and viewing them only as short labels, usually single lines. Starfish is used to implement this feature in Java mode in the following way.

Macros, in this sense equivalent to folding/unfolding mode of operation, are activated with:

$Star->defineMacros();

The field $Star->{HideMacros} is used to specify the folding or unfolding mode of operation. Unfolding, or macro expanding, is achieved by running "starfish *.sfish" for example, while folding, such as, macro de-expansion, is achieved by running:

starfish -e='$Star->{HideMacros}=1' *.sfish

In Java mode, a fold, or macro, may be assigned to a block of code in the following way:

//m!define macro name
...code...
//m!end

A new line is mandatory after //m!end. After running starfish in macro mode, this definition will disappear and it will be appended as a macro "auxdefine" at the end of file in the following form:

//auxdefine macro name
...code...
//endauxdefine

A user is not supposed to edit this part of the file.

If we want to define a macro and expand it in the same place (instead of disappearing), it is done in the following way:

//m!defe macro name
...
//m!end

A macro is used in the following way:

//m!expand macro name

This line will be unchanged in the HideMacros mode of operation, and in the expanded mode it will be expanded into:

//m!expanded macro name
...
//m!end

If we want to expand a macro even in hide mode (i.e., HideMacros mode), it is achieved with the line:

//m!fexpand macro name

which is expanded into:

//m!fexpanded macro name
...
//m!end

It is sometimes useful to redefine and override previous macro definitions. Using just a define is an error, since it may be done mistakenly, so another hook is used:

//m!newdefe macro name
...
//m!end

For folding and unfolding to work, it is necessary that starfish process text in more than one pass. In its default mode of operation, starfish processes text and produces the output. In the update mode, the output file is not written if the new contents are equal to the previous contents. This is why the last-update example shown before works. Using the field $self->{Loops} in the Starfish object, we can control how many times processing will be repeated. The variable $self->{CurrentLoop} contains the information about the current loop.

Multipass processing of text has other potential advantages. First, in this way starfish can be used as a spreadsheet that calculates various expressions and updates cross-dependencies using multiple passes. Another advantage is based on using hook-evaluator pairs in the regular-expression replacement style. Computation by regular-expression replacement is based on a sequence of regex replacements (or substitutions), which are applied on text until the text does not change. It can be shown that this is a full computation model equivalent to the Turing machine. As an example, Christophe Blaess has constructively demonstrated that the sed utility, which relies basically on two variables and regular expression substitutions, is equivalent to the Turing machine.

Directory-Wise Configuration with starfish.conf

I use Starfish frequently in configurations that are distributed over a directory tree. Typically, it is a web site with a set of HTML files. The files usually use a common configuration that could be stored in the ancestor directories, and may be modified in the local directory. The standard way to use this kind of configuration is to use files named starfish.conf, which are Perl source files, and to use the predefined function read_starfish_conf(). The function will check whether there is a starfish.conf file in the current directory, then if it exists, it will look for this file in the parent directory, and if it exists in the grand-parent directory, etc. Once it does not find a starfish.conf file, it will execute (i.e., require) each starfish.conf file in top-down direction, executing each one of them in the appropriate directory.

Starfish has some useful command-line options such as:

  • -e='code' for specifying Perl initial code
  • -replace for replace mode
  • -o-file for specifying the output file ('-' for the standard output),
  • -mode for specifying permission mode for the output file

and some useful utility functions such as: getfile, putfile, appendfile, read_records, and htmlquote.

Conclusion

Starfish is a Perl package that comes with a module and a script named starfish. It is a general-purpose text-embedded programming and macro-preprocessing program with a novel update mode and a very flexible syntax. It supports different styles of source files, including HTML, Java, Tex, LaTeX, and Makefile, with an open possibility of extending the list. The hook-and-evaluator mechanism can be used to perform text folding and unfolding.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV