Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Tools

Convincing Pascal to Read Non-Pascal Files


MAY88: STRUCTURED PROGRAMMING

Sometimes a feature of a language is merely a defect put in a favorable light. It all depends on what you're trying to accomplish. Pascal, for example, insists that all files be bound to a data or record type: very noble from the standpoint of preserving the purity of strong typing but often an obstacle when trying to process files formatted in languages other than Pascal.

Specifically, the kinds of files I'm talking about are self-describing tables such as those generated by dBASE, Reflex, and other database programs. Typically, such files begin with several data structures describing the contents, followed by any number of fixed-format data records. It's easy to mix record types with C and assembly language and even BASIC, all of which support free-form files. You have to convince Pascal to do it, though, and the trickery for doing so is the subject of this month's column.

I'll also respond to a reader's complaints about Turbo Pascal 4.0.

Creating a Table File

Rather than covering a specific vendor's table format, I've developed a simple model for this article that is typical of these files in general. What makes it simple is not the file structure itself but instead the number of options. Data records can consist of only two field types: integers and character arrays. The thrust here is the principles, and there's no sense muddying the waters with a number of options that you can figure out for yourself.

The typical table file begins with a fixed-length preamble (256 bytes in this case) containing a header record and field descriptors. The header record is a fixed structure that contains several fields giving basic information about the contents. In this case, the header record is 24 bytes long and contains the fields shown in Table 1, page 95.

Signature is an invariant value written at a fixed place to identify the file as belonging to the application. If some other value appears in that position, the file doesn't follow the rules given here and can't be processed. The value of signature for this application is 19364 (4BA4h), and it appears in the first 2 bytes of the file.

The nrecs field tells you how many data records the file contains. Tablename is a packed array of ten characters giving the name of the table; not all vendors have an analogous field in the header record. Datastart expresses an offset, with respect to the start of the file, to the first data record. This is a long (32 bits on a PC) integer to correspond with the usage of fseek/ftell in C. Turbo Pascal 4.0 and Microsoft Pascal similarly use a long integer for their SEEK procedures. The last two fields, descrsize and ndescr, have to do with the next part of the file preamble.

A data record consists of one or more fields, each of which has three attributes: a name, a data type, and a length. These constitute a field descriptor, which has the form shown in Table 2, page 95. Each data field has one descriptor, hence there are ndescr field descriptors of descrsize following the header record. In the programs that accompany this article, for example, there are two fields (ndescr= 2), so there are two descriptors.

Thus the preamble consists of a fixed header record followed by a variable number of descriptor records, each of which has a fixed format. Taken together, they describe the data content of the file. The unused portion of the preamble is filled with uninitialized garbage.

The data itself begins at byte offset header.datastart from the beginning of the file. Each record corresponds to a row in the data table and each field to a column. The descriptor records describe the columns, and the length of any given record is the sum of all flen fields in the descriptors. There are header.nrecs records. The file is thus a self-describing entity, and the program's job is to interpret the descriptions in order to extract the data. Figure 1, page 95, shows the format of a simple table.

Listing One, page 69, is a generic C program (MKTABLE.C) that creates a table with the preamble described here. The program then requests data entry and writes out the data records you type in response, saving them in a file called database.xyz, in order to build an adequate table, enter at least three or four records. Terminate data entry by pressing Return when the program asks for a name. The program then updates the header record to reflect the number of data records entered and closes the file. This is a vastly simplified version of a database management package, but it generates a complete table of the same sort that flows out of dBASE and other table-oriented database products.

Translating Strings

The references to pac (a packed array of characters) in Figure 1 and Table 1 point up a fundamental difference between Pascal and lowerlevel languages such as C and assembly language. Although not defined in the academic standard, the pac string type is supported by most real-world Pascal compilers. It's really a stretched PAC, the difference being that element 0 contains the string length, with the first valid character at element 1. Thus, at the data level, a string containing the word Pascal looks something like GPascal. If the string were declared as:

VAR strng : STRING [10];

the last 4 bytes would contain garbage. The compiler inserts string-handling routines that pay attention to the length byte.

C handles string data differently, and most assembly-language programmers use the same convention as C. It's so common, in fact, that it has a name: ASCIIZ. In ASCIIZ, the 0th element contains the first character of the string. There is no length indicator; instead, end-of-string is signified by ASCII value 0, or CHR (0) in Pascal notation. This is merely a packed array of characters with a special end sentinel. The C term for it is a null-terminated string.

The asciiz function in Listing Two, page 70, transforms ASCIIZ strings into Pascal strings. Because it's possible that a string might exceed its maximum length, the function also takes the max parameter, which limits the number of characters in the result; it's either max characters or everything up to the null terminator, whichever comes first.

This string-translation routine and the liberal use of variant records are two keys to convincing Pascal to read non-Pascal tables. The third is processing the file on a byte-by-byte basis. Here's how it works.

Processing the File

The program declares the table as FILE OF BYTE and opens it. The first step reads the 24-byte header record into the stream variant of the headrec structured variable. This is necessary because the file is of type BYTE; all file reads are done in the same way. Access to the data elements is via the other variants, as in the next step, which checks signature. Execution continues if the signature is correct and ends with a message otherwise.

Procedure showHeaderInfo lists information from the header record. Note that the table name is drawn from the second variant of the headrec record. Why? Because the asciiz function expects an argument of type pac, which is 20 bytes long, whereas the real tablename field is only 10 characters long. This is a trick to prevent the compiler from choking on a mismatched type.

The getDescriptors procedure, called by showHeaderInfo, reads the field descriptors that follow the header file. The program assumes a maximum of ten fields for the table when it declares the field variable, which is an array of fieldrec structures. The header.ndescr variable governs the actual number read from the file. ShowHeaderInfo uses the descriptors to display information about the fields. The showData procedure uses them more extensively.

Table 1: Format of a simple table.

Name          Type

signature     word
nrecs         word
tablename     pac [10]
datastart     longint
descrsize     integer
ndescr        integer


Before calling showData, however, the program first calls the Pascal SEEK procedure. The purpose is to reposition the file pointer to the start of the data records, which is past the unused portion of the preamble. ShowData can now read and process the table's data contents sequentially.

A couple of local variant record types provide the means for fetching integers and ASCIIZ strings. Again, the stream component furnishes type compatibility with the file. Because the host processor and not the compiler establishes the format of the integer type, an integer taken from the file as two consecutive bytes can be plucked directly from the variant without translation. The character field is accessible via a call to asciiz.

Table 2: A data record's field descriptors

Name      Type
fname pac   [20]
ftype      integer
flen      integer


Figure 1: Format of a simple table

Header             $4BA4               (=signature)
                   4                   (=nrecs)
                   Age list            (=tablename)
                   256                 (=datastart)
                   24                  (=descrsize)
                   2                   (=ndescr)
Descriptor#1       NAME                (=fname)
                   1                   (=ftype {pac (20]})
                   20                  (=flen)
Decriptor#2        AGE                 (=fname)
                   0                   (=ftype {integer})
                   2                   (=flen)
Rest of prumble    (garbage filler)

Data records       Ken Barker, 46
(datastart)        Tim Madden, 38
                   John Joyner, 42
                   Jim Hull, 59

A pair of nested loops control the reading and display of data fields. The outer loop repeats for the number of records in the file, as given by header.nrec. The inner loop processes individual fields. stepping through the array of descriptors in order to determine what to read next from the file. Because it's loop-driven, showData can process any number of records consisting of any number of integer and ASCIIZ fields in any order without modification. Additional data types would require the appropriate structure definitions and expansion of the CASE statement.

This is not a complete table system, of course, because it lacks date, floating-point, and Boolean types. Also, it's not compatible with any existing database package's file format. Given the specifications for a file and the techniques presented here, however, you should be able to write a Pascal program that reads non-Pascal files with header records.

Turbo Pascal 4.0 Flames

The mail the other day brought a letter from Charles Linett, who heads up the Computer Science Staff at the Census Bureau. Charles and his folks use Turbo Pascal for communications programs of 15,000+ lines, and he's not amused by Version 4.0. Here's part of what he has to say:

"I see two rather large defects bordering on the semicalamitous for our type of work.

    1. There are no overlays (as there were in Version 3.0). Borland has solved that problem in two suave ways, however. First, the company told us that if we used overlays, then we needed only to rewrite our programs (thanks, fellas, you're a big help). Second, Borland found the part of the documentation least likely to be read (a file called Q&A) and wrote that it recognized the need and was working on something intelligent,; I can only hope it gets it done before those awaiting the feature die of old age.

    2. The manual is awful and (worse yet) has almost no chance of improvement. It requires so many additions and corrections that what is called for is a new manual altogether. If 20 pages need changing in a real manual, you send the customer those 20 pages and let him or her stick them in the book. This cannot be done with the 4.0 manual because it's glued together in one big lump."

No quarrel with the first point, Charles; I'll get back to it in a minute. As for the second, probably a lot of us aren't crazy about a bound book as a manual. And that one in particular is too thick; you can't spread it out on the desk for reference without either breaking the spine or putting barbells on it for paperweights. But the adjective awful is kinda harsh. Versions 2.0 and 3.0 had bound manuals, too, and Borland isn't the only company whose docs come this way.

Nobody's ever told me this, but I suspect the purpose of bound docs is to discourage pirates from photocopying the manual. Maybe if the world was a more honest place, vendors such as Borland wouldn't resort to defensive tactics. Piracy is just another name for theft.

Yeah, I don't like the manual either, but it's a whole bunch better than its predecessor in terms of both quality and content. And manual corrections conveyed via READ.ME files are hardly a Philippe Kahn innovation.

Now for the overlay fiasco. No doubt about it, Borland shot itself in the foot by dropping overlays. Probably it figured it could get away with it because Version 4.0's .EXE files break through the infamous 64K barrier of Version 3.0's .COM files. Somebody should have surveyed the user community before Borland yanked the rug from under it.

But there's an alternative for Charles and anybody else who got abandoned. It's a product called Overlay Manager 4.0 from TurboPower Software (3109 Scotts Valley Dr., Ste. 122, Scotts Valley, CA 95066; 408-438-8608). Costing $45, this is an interactive program that lets you break a compiled .EXE file of any size (up to about 1 Mbyte) into any number of overlays. For truly enormous programs, there's another utility in the package that effects chaining. The slim 30-page manual is excellent and so's the quality of the software; TurboPower produces good stuff. Highly recommended if you need overlays.

[LISTING 1-2]

<a name="00f6_000c">

/* MKTABLE.C: Makes a data table with header record */

#include <stdio.h>
#include <string.h>
#define  SIG  19364            /* application file signature */

typedef struct {
  char      fname [20];
  int       ftype, flen;
} DESCR;

struct {                           /* header record for file */
  unsigned  signature;
  int       nrecs;
  char      tablename [10];
  int       reclen;
  long      datastart;
  int       descrsize;
  int       ndescr;
} header;

struct {                             /* data record for file */
  char      name [20];
  int       age;
} data;

main ()
{
FILE  *fp;
char  age [3];
int   n;
DESCR descr;

  fp = fopen ("database.xyz", "w");           /* create file */

  header.signature = SIG;               /* initialize header */
  header.nrecs = 0;
  strcpy (header.tablename, "Age list");
  header.reclen = sizeof data;
  header.datastart = 256L;
  header.descrsize = sizeof (descr);
  header.ndescr = 2;
  fwrite (&header, sizeof header, 1, fp);   /* write to file */

  strcpy (descr.fname, "NAME");     /* initialize descriptor */
  descr.ftype = 1;
  descr.flen  = 20;
  fwrite (&descr, sizeof (descr), 1, fp);   /* write to file */

  strcpy (descr.fname, "AGE");                /* ditto above */
  descr.ftype = 0;
  descr.flen  = 2;
  fwrite (&descr, sizeof (descr), 1, fp);

  fseek (fp, 256L, SEEK_SET);

  do {                                       /* capture data */
    printf ("\nName? ");
    gets (data.name);
    if (strlen (data.name)) {        /* continue until blank */
      printf ("Age?  ");
      gets (age);
      data.age = atoi (age);
      fwrite (&data, sizeof data, 1, fp);    /* write record */
      header.nrecs += 1;                     /* count record */
    }
  } while (strlen (data.name));     /* until no more entered */

  fseek (fp, 0L, SEEK_SET);           /* go to start of file */
  fwrite (&header, sizeof header, 1, fp);   /* update header */
  fclose (fp);                                 /* close file */
}


[NONPAS.PAS]

PROGRAM nonpas;

  { Reads a non-Pascal database table with a header record }
  { and some number of fixed-length data records           }

CONST signature = 19364;                     { application signature }
      divider = '---------------------------------------------------';

TYPE  s20            = STRING [20];
      pac            = PACKED ARRAY [1..20] OF CHAR;

      headrec = RECORD CASE tag : INTEGER OF
      1: (signature  : WORD;               { This is the real layout }
          nrecs      : WORD;                        { # data records }
          placeholdr : PACKED ARRAY [1..10] OF CHAR;    { table name }
          reclen     : INTEGER;                 { data record length }
          datastart  : LONGINT;               { file offset for data }
          descrsize  : INTEGER;              { field descriptor size }
          ndescr     : INTEGER);          { number of fields per rec }
      2: (dummy1,
          dummy2     : WORD;
          tablename  : pac);                  { To fool typechecking }
      3: (stream     : PACKED ARRAY [1..24] OF BYTE);
      END;

      fieldrec = RECORD CASE tag : INTEGER OF
      1: (fname      : pac;
          ftype      : INTEGER;
          flen       : INTEGER);
      2: (stream     : PACKED ARRAY [1..24] OF BYTE);
      END;

VAR   header   : headrec;
      field    : ARRAY [1..10] OF fieldrec;            { descriptors }
      n        : INTEGER;
      table    : FILE OF BYTE;
{ --------------------------- }

FUNCTION asciiz (max : INTEGER; VAR strng : pac) : s20;

    { Returns a Pascal string from a null-terminated string
        that is <= max bytes long }

VAR   i      : INTEGER;
      result : STRING [20];

BEGIN
  result := '';
  FOR i := 1 TO max DO
    IF strng [i] <> CHR (0) THEN
    result := result + strng [i];
  asciiz := result;
END;
{ --------------------------- }

PROCEDURE getDescriptors;

    { Reads field descriptors from header record }

VAR   c, d : INTEGER;

BEGIN
  FOR d := 1 to header.ndescr DO
    FOR c := 1 TO header.descrsize DO
      READ (table, field [d].stream [c]);
END;
{ --------------------------- }

PROCEDURE showHeaderInfo;

    { List information about the file format }

VAR   d : INTEGER;

BEGIN
  WRITELN (divider);
  WRITELN ('Table name is ',
           asciiz (10, header.tablename));
  WRITELN ('Table contains ', header.nrecs, ' records');
  WRITELN ('Data record length in bytes is ',
           header.reclen);
  WRITELN ('Each record contains ', header.ndescr, ' fields:');
  getDescriptors;
  FOR d := 1 TO header.ndescr DO BEGIN
    WRITELN ('  Field name:    ', asciiz (20, field [d].fname));
    WRITE   ('  Data type:     ');
    CASE field [d].ftype OF
      0: WRITELN ('Integer');
      1: WRITELN ('Character');
    END;
    WRITELN ('  Length:        ', field [d].flen);
    WRITELN;
  END;
  WRITELN ('Data records follow:');
  WRITELN;
END;
{ --------------------------- }

PROCEDURE showData;

      { List contents of each data record by fieldname }

TYPE  int = RECORD CASE tag : INTEGER OF
        1: (number : INTEGER);
        2: (stream : PACKED ARRAY [1..2] OF BYTE);
      END;

TYPE  charfield = RECORD CASE tag : INTEGER OF
        1: (bf : PACKED ARRAY [1..20] OF BYTE);
        2: (cf : pac);
      END;

VAR   rec, descr, n : INTEGER;
      intfield      : int;                      { integer data field }
      chfield       : charfield;              { character data field }

BEGIN
  FOR rec := 1 TO header.nrecs DO                  { For each record }
    FOR descr := 1 TO header.ndescr DO BEGIN        { For each field }
      WRITE (asciiz (20, field [descr].fname));          { Show name }
      FOR n := LENGTH (asciiz (20, field [descr].fname)) TO 25 DO
        WRITE (' ');                              { cosmetic spacing }
      CASE field [descr].ftype OF
        0: BEGIN
             FOR n := 1 TO 2 DO
               READ (table, intfield.stream [n]);    { get int field }
             WRITELN (intfield.number);
           END;
        1: BEGIN
             FOR n := 1 TO field [descr].flen DO
               READ (table, chfield.bf [n]);   { get character field }
             WRITELN (asciiz (20, chfield.cf));
           END;
      END;
    END;
END;
{ --------------------------- }

BEGIN
  ASSIGN (table, 'DATABASE.XYZ');                       { open table }
  RESET (table);
  FOR n := 1 TO 24 DO                           { read header record }
    READ (table, header.stream [n]);
  IF signature <> header.signature THEN
    WRITELN ('File not in proper format. Program ended.')
  ELSE
    BEGIN
      showHeaderInfo;                     { Show info about the file }
      SEEK (table, header.datastart);          { go to start of data }
      showData;                            { List each record's data }
    END;
  CLOSE (table);
END.










Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.