Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Automatic File Conversions with Perl


Dr. Dobb's Journal February 1998: Automatic File Conversions with Perl

Make lpr easier to use

Tim is a senior technical editor with DDJ. He can be reached at [email protected].


The BSD printer-management system is actually fairly simple. The user program, lpr, accepts a user request and relays the file with control information to the printer daemon, lpd. The daemon program runs a print-filter program to convert the data for the desired output device and feeds the result to the printer. This simple system allows for a great deal of flexibility, but it's not always the easiest system to use.

One problem, for instance, is that the daemon must select the correct print filter. It uses the control information from lpr to look up a program in a "printcap" file. This approach requires the user to specify the required conversion to lpr. However, inexperienced users may not know what kind of conversion is required; most simply memorize that certain options are required for certain extensions. It also restricts the kind of data that can be printed; both lpr and lpd must be recompiled to change the available options.

One way around this problem is to make a "smart" print filter that recognizes the type of data it is being fed and invokes the appropriate conversion. A smart print filter makes lpr easier to use, since specifying the conversion is no longer necessary. It also removes the restriction on the number of file types; the formats that can be printed are restricted only by the print filter's ability to determine the format.

Implementing a smart print filter involves three issues:

Determining the type of the file. Fortunately, most complex files have a "magic number" near the beginning of the file that can be used to identify the type. For example, PostScript files always begin with %! and TeX DVI files always begin with the two bytes 247, 2. This approach is used by the common file command, and you can refer to /etc/magic for a list of other magic numbers.

Doing the actual conversion. Fortunately, there are a variety of conversion programs freely available from many sources.

Effectively dealing with multiple conversions. For example, if you have an HP LaserJet printer, you might want to use DVIPS to convert TeX DVI files into PostScript, then use the free GhostScript program to convert PostScript into a printer file. Since this last step will be common to many different files, it would be nice to isolate it. That way, if you later replace your aging HP LaserJet with a PostScript-capable printer, you need only change one line.

The nature and number of conversions may vary dramatically depending on the precise file format. For example, if you have a PostScript printer, then printing a PostScript file requires no conversion at all, while printing a compressed TeX DVI file will require two conversions: one to uncompress the data, and another to convert the DVI file to PostScript.

My solution to this problem is the Perl program in Listing One Perl is a good fit because it has sophisticated pattern-matching facilities that can be used to easily identify the file type based on the initial part of the file. I could have used the output of the file program instead of checking for magic numbers directly, but that would have meant hard coding the output strings from file, which seemed no better than hard coding the patterns to identify the file type directly.

The innovative aspect of my program, which I call PrintConvert, is how it handles multiple conversions. Most conversions feed the data into a two-stage pipeline, consisting of a conversion program and another copy of PrintConvert. This allows PrintConvert to separately evaluate the output of each conversion stage, and makes it very easy to handle compressed input. For example, suppose you have a compressed text file, and type "lpr textfile.Z." The lpd daemon will invoke PrintConvert as the default print filter, which will then incrementally build the pipeline shown in Figure 1. At each step, a new copy of PrintConvert runs to classify the next step of the conversion.

Of course, you'll probably want to add more data types. Adding a new data type will usually require adding a new elsif clause, similar to Example 1. As you can see here, you need a regular expression to match the beginning of the file. In Example 1, the expression /^\037\213/ is a regular expression (/.../) that matches the two octal bytes 037, 213 that appear at the beginning (^) of the file. Note that the $_ variable (the default for a match like this) only holds the first part of the file; if you have a file that can only be recognized by inspecting the end of the file data, you'll need to do some additional work.

PrintConvert also takes advantage of Perl's flexible file handling, which makes opening a program (precede the name with "|") or a standard I/O path (">-" is stdout) just as easy as opening a disk file.

Since Perl is a complete and sophisticated programming language, you could add more complex processing to your conversions. For example, you might have a condition that recognizes PBM graphics files and either scales them or converts them, depending on whether or not that particular file fits on a page. This rule has a nice side effect: If you convert all other graphics files to PBM, they'll get scaled also.

There are many other improvements you can make to this basic program. For example, it could generate burst pages or log printer usage. The final output could use GhostScript or a more sophisticated printer-management system. You could even integrate other printer-management functions, such as displaying the user name on the printer's LCD front-panel display.

DDJ

Listing One

#!/usr/bin/perl# PrintConvert -- convert any file type into printer output


</p>
# Read the initial section of the file
$blocksize=16384;
sysread(STDIN,$_,$blocksize);
$first_segment = $_;
if (length($_) == 0) { exit(0); }


</p>
# Now, use those initial bytes to determine the file type and
# appropriate handling.
# Note that "$0 @ARGV" is the current program and options.  Most
# formats are fed through some conversion program and then into
# another copy of this program for further consideration.


</p>
if (/^\004?%!/) {                       # PostScript, possibly preceded by ^D
    &PrintTo(">-");                     #   just dump it to STDOUT
    print STDOUT "\004";                #   append Ctrl-D
} elsif (/^\037\213/) {                 # GZIP
    &PrintTo("|gunzip | $0 @ARGV");
} elsif (/^\037\235/) {                 # Unix Compress
    &PrintTo("|uncompress | $0 @ARGV");
} elsif (/^\367\002/) {                 # TeX DVI format
    &PrintTo(">/tmp/PrintConvert.tmp.$$");     # Save into temporary file
    exec "dvips -q -f </tmp/PrintConvert.tmp.$$ | $0 @ARGV";
} elsif (/^\115\115/) {                 # TIFF file
    &PrintTo("|fax2ps | $0 @ARGV");
} elsif (/^\111\111/) {                 # TIFF file
    &PrintTo("|fax2ps | $0 @ARGV");
} elsif (/^\314\000\206/) {             # FreeBSD executable (don't print)
    print STDERR "Executable file not printed!\n";
} else {                                # Unrecognized text file
    &PrintTo("|lptops -ntr | $0 @ARGV");              # Use lptops


</p>
}
# `Print' data to the named destination
sub PrintTo {
    open(OUT,"@_");                     # Open the file or program
    syswrite(OUT,$first_segment,length($first_segment)); # Write first segment
    while(sysread(STDIN,$_,$blocksize)) {    # Now copy the rest
          syswrite(OUT,$_,length($_));
    }
    close(OUT);
}

Back to Article


Copyright © 1998, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.