Dr. Dobb's | Structured Programming

Structured Programming

Earlier this year, Kent showed how to handle huge chunks of data. This month, he revisits the subject of large data arrays.

October 01, 1988
URL:http://www.drdobbs.com/tools/structured-programming/184408013

OCT88: STRUCTURED PROGRAMMING

Back in March, I ran an installment of this column dealing with huge arrays: matrices that exceed the limit of the IBM PC's 64K memory segments. March is now quite a while ago, but I'm still getting letters about that column. It's generated more commentary--all positive and some pleading for help--than everything else I've done here combined. Clearly, it's a topic of great concern and more needs to be said about it. So here goes.

If you missed the March column, I'll briefly summarize the problem and the solution presented there.

An array is a collection of similar data items arranged such that they have a common identifier. Individual elements are accessed by using a subscript; for example, Arr[n] refers to the nth element of an array called Arr. The compiler inserts code that does the arithmetic necessary to convert the subscript into an offset from the start of the array, thus locating the subscripted element.

The problem is that, in the 80x86 architecture, the largest possible offset from an address is 65,535 bytes, or 64K, which is one memory segment. This means that the maximum number of elements in any given array is 64K DIV size (element). Thus an array consisting of 16-bit integers is constrained to a maximum of 32,768 elements, and an array of 8-byte real numbers can hold up to 8,192 items. That's plenty for most applications, but it rules out the PC for many scientific and engineering calculations --unless you can find a way around the segmentation barrier.

That's what the March column did. The technique presented there is to treat each row of a two-dimensional matrix as a separate array--conceptually, as a horizontal list in which each element maps to a matrix "column." This scheme allows each row of the matrix to be as large as 64K. The glue that fastens this collection of rows together is a each indicating a row. You can then follow a pointer into a row and access an individual column with Pascal notation such as P^[r].col[n] which indicates the nth column of the rth row.

This is ugly notation, but it rather elegantly solves an even uglier problem. It's not the only solution, though, so let's discuss a couple of others.

The Basic Solution

Before you laugh too hard at the notion of using Basic, you'd better look at what's happened to the language lately. A lot of folks still use Basic, which means there is a demand, which in turn spawns innovation. The new breed of Basic compilers have abolished old bugaboos such as line numbers and spaghetti logic, replacing those nasty GOTOs with constructs that encourage disciplined, structured programming techniques. A well-written Basic program these days looks a lot like Pascal and in most ways it's just as powerful. And (here is the punch line) Microsoft's QuickBasic 4.0 directly supports arrays greater than 64K.

This feature alone makes Quick-Basic worthy of serious consideration for giant number-crunching problems. More than one computer-hacking PhD in physics has told me he has switched to QuickBasic because of its limitless no-hassle arrays. To this add a congenial integrated environment, support of user-defined aggregate data types, multiline IFs, linkable libraries, call-by-name subroutines, and structured GOTOless code, and Basic becomes a real programming language.

Listing One, on page 136, shows HUGEMATS.BAS, written with Quick-Basic. This program is a direct translation of the Pascal program that appeared in the March column (pages 86 - 88). Because no special routines or structures are required to implement huge arrays in Quick-Basic, the Basic version is half the length of the Pascal original: 58 lines vs. 108.

The $DYNAMIC metacommand in the program heading tells the Quick-Basic compiler to use the heap for the arrays, rather than allocating static memory for them. This is required when using huge arrays. So is the /AH switch on the compiler command line. These are the only extras needed. Note that, in all other ways, working with huge arrays is identical to working with normal arrays.

I should point out, though, that the syntactic convenience of huge arrays in QuickBasic has a penalty. The compiler is obviously doing some stuff with mirrors and smoke in order to manage arrays limited only by available memory. This hidden trickery shows up in decreased performance. The QuickBasic version of the program runs on my 8-MHz AT clone in 25.66 seconds; it takes Turbo Pascal, Version 4.0, 5.49 seconds to do exactly the same work. This is a 4.7-to-1 performance ratio, which is pretty poor. The scales tip the other way on .EXE size, though less dramatically: 3,965 bytes for Basic and 4,992 for Pascal. Some of this difference is explained by the fact that QuickBasic doesn't build a Ctrl-Break handler into the .EXE as Turbo Pascal does.

So the message is that if you've got a lot of huge-array crunching to do, QuickBasic offers an easier way to do it than my method in Pascal, but you pay a heavy price in performance for that convenience.

The Disk-Based Array Solution

The Pascal and Basic solutions are both limited by one insuperable barrier: the amount of memory available to the program. Free memory is almost infinitely variable, being proscribed by the physical amount of RAM, the demands of resident and running software and the DOS version, the heap range requested in the .EXE header, the size of the stack, and other factors. Therefore it's impossible to know in advance if the target machine will have enough memory to accommodate even one large array, let alone several.

I hate even to bring it up because the performance penalty is horrendous, but the one way of being reasonably certain of having enough space for huge arrays is to put them on a hard disk. A large disk with a lot of unallocated space can give you the ability to manipulate truly enormous arrays. For example, 30 Mbytes of free disk space will accommodate 3.9 million real (8-byte) numbers, or three arrays of 1,144 X 1,145 elements each. The same space can contain 15.7 million integers in three arrays each of 2,289 x 2,290 elements. We're not encroaching on Cray territory yet (and certainly not in throughput), but we're getting warm.

The secret lies in devising an algorithm that maps Virtual subscripts into random-access disk records. This is fairly easy to do. The first step is to ensure that the array's data elements are stored to disk in columnwise order. That is, go left to right across the first row, then across the second row, and so on. Let's use a simple 3 X 3 matrix as an example:

     R0   R1   R2

     R3   R4   R5

     R6   R7   R8

where R0 becomes disk record 0, R1 is record 1, and so forth. The disk file thus becomes a one-dimensional list of the nine elements R0. . R8. If the data type of the array is REAL, you can declare the file in Pascal as:

VAR ArrayFile : FILE OF REAL;

You can achieve random access to these elements via virtual two-dimensional subscripts, where R0 has the subscripts 0 and 0, R3 is at 1 and 0, and so on. The conversion expression is:

(RowNbr * NbrOfCols) + ColNbr

This yields a logical record number that you can use with Turbo Pascal's Seek procedure to position the disk's read/write head on the desired element. For example, you might declare:

VAR LogRec : LONGINT;

and then, somewhere in the program, locate an array element with the statements:

LogRec:= (row*3) + col; Seek (ArrayFile, LogRec);

You can then read the element into a variable of type REAL. If that variable is the value returned by a functIon, you can use the function directly in an expression, and it will look much like a normal subscripted variable.

As an example, say you define such a function to read a value from array file A as:

FUNCTION A (row, col : WORD)      :REAL;

and a similar function to read array file B. You can then write statements such as:

Sum := A (r, c) + B (r, c);

This is only slightly different from the pure array-handling expression:

Sum := A [r, c] + B [r, c];

and it definitely furthers the cause of readability.

(Note that Modula-2 and C don't have typed files as Pascal does, so the offset used in seek operations is not a submultiple of the file's data type but instead is a byte offset. You can, however, effect logical record numbers through multiplication or division by the size of the data type.)

Although, this approach is workable and opens the doors to arrays of virtually unlimited size, it has a serious drawback. Doing a disk seek for every array variable creates an unacceptable amount of head movement. Array-intensive programs that use it spend most of their time waiting for the disk. DISKARR.PAS in Listing Two, page 136, originally accessed each disk-based array element individually; adding two 500 x 500 arrays of REAL took 28 minutes. After I reworked it to its present form using a row-buffering algorithm, the program ran in 5.55 minutes. Reducing head movement thus achieved a 5-to-1 performance improvement.

The row-buffering algorithm derives from the observation that most array operations proceed in columnwise order. That is, you keep accessing successive columns in the same row until you've reached the rightmost, and then you move to the next row and repeat. A secondary observation is that it's more efficient to read/write a large block of disk data than small amounts.

Consequently, you can use a variant of disk caching to store, in memory arrays, the rows of the disk-based arrays you're currently working on. In DISKARR (Listing Two), the disk arrays are 500 X 500 matrices of type REAL. Thus the buffer for the current row from any matrix is defined as:

TYPE ArrayRow = ARRAY [0..499]      OF REAL;

and you have three variables of type ArrayRow, one for each matrix. The array files themselves are defined as:

TYPE RowFile = FILE OF ArrayRow;

The logical record number for any given ArrayRow is therefore numerically the same as the row number, which simplifies file I/O management:

Seek (ArrayFile, LONGINT (row));

followed by a read or write as appropriate.

Further simplifying row buffer management is a buffer control block, defined as:

TYPE BuffCtlBlock = RECORD DCurrentRow : WORD; DIsModified : BOOLEAN; END;

A control block exists for each buffer. The CurrentRow field contains the row number of the record presently occupying the buffer. The IsModified field is normally FALSE but becomes TRUE if you write new data into the buffer.

A routine that furnishes or accepts disk-based array values can compare the parametric row with the CurrentRow field. If different, it knows that it must fetch the new row from the disk-based array. But before doing so, it can check the IsModified field to find out if the current row has been changed; if TRUE, it must save the current row in the disk file's corresponding logical record before fetching the new row. On the other hand, if the parametric row and CurrentRow are the same, no disk activity occurs.

In Listing Two, functions A, B, and C, and also the Write To C procedure, all check and update the associated buffer control block as required. The control blocks are a slight overkill within the context of this particular program, but the combination of function C and procedure WriteTo C illustrates their application in more complex programs using disk-based arrays.

The elegance of the row-buffering algorithm as implemented by subroutines is illustrated by the main work-performing loop of the program. It adds the two arrays A and B to form the output matrix C with:

FOR row : = 0 TO maxRow DO BEGIN      FOR col:= 0 TO maxCol DO
           WriteToC (row, col, (A (row, col) + DB (row, col)));
           END;

The calls to A, B, and WriteTo C all manage row changing as appropriate.

Special functions not shown in Listing Two are necessary when an array operation doesn't follow the usual columnwise order. A case in point is matrix multiplication, which proceeds rowwise in the multiplicand and columnwise in the multiplier. For that you need a routine capable of fetching the multiplicand's row values and loading them into a buffer. Because matnx multiplication requires that there be the same number of rows in the multiplicand as there are columns in the multiplier, you can use a variable of type ArrayRow for the row values within the given column. A local variable of type ArrayRow can serve as the file input buffer, and from there you can pluck the target column values and stuff them into successive elements of the row-wise array. Having done that, you simply summarize the products of successive rowwise and columnwise multiplications and do a WriteToC with the final series product. It sounds complicated, but it's really quite straightforward. And it allows you to multiply matrices of virtually unlimited size, subject to the usual rules of matrix multiplication.

The DISKARR program in Listing Two creates test files for two-dimensional matrix addition and leaves them on the disk as files ARRAY.A, ARRAY.B, and ARRAY.C. If you run c PROGRAMMING the program a second time and the input files (A and B) are still the same size, the program assumes that you want to use them again, so it bypasses the file-creation step. This saves execution time. Each file occupies 1,500,000 bytes, for a total of some 4.3 Mbytes. Make sure you erase these files, lest you clutter your hard disk with useless test data.

The general approach to disk-based arrays can also be applied to EMS-based arrays without a great deal of reworking. EMS is faster, but unless you're selling turnkey systems, it's safer to assume that a user has a hard disk than to assume that he or she has enough EMS to handle truly enormous arrays.

Conclusion

Programmers of systems such as the 680x0-based Mac II and SPARC-based Sun workstations don't have to inveigle their way around the arbitrary 64K segments foisted on us by IBM and Intel. For arrays that might conceivably fit within a reasonably large memory array on the PC, QuickBasic or the method suggested in the March column furnish a workable solution.

But DRAM chips are currently in short supply and command a high price, and the PC and DOS in general comprise a limited platform from the standpoint of memory. For very large arrays, the disk-based method is probably the best solution. It doesn't furnish dazzling throughput, but it will do the job. The only practical limit is the amount of free disk space.

Huge arrays are clearly a topic of great concern to the DDJ readership. If you have other solutions or issues, let me hear about them.

_STRUCTURED PROGRAMMING_ by Kent Porter

[LISTING ONE]



' Program HUGEMATS.BAS
' Demo program to add two huge matrices > 64K, giving a third
' Written using Microsoft QuickBasic 4.00B
' Kent Porter, DDJ, October 1988

DEFINT A-Z                        ' All variables are integers
DECLARE SUB acquire (D())         ' Subroutine prototype
REM $DYNAMIC                      ' Use heap for arrays

' Constants
CONST maxRows = 250               ' Rows in matrices
CONST maxCols = 300               '  and columns

' Define arrays
OPTION BASE 1                     ' 1 is lowest subscript
DIM A(maxRows, maxCols)
DIM B(maxRows, maxCols)
DIM C(maxRows, maxCols)

' ----------------------------------------------------------------
' Main program follows
  CLS                             ' Clear screen
  size& = maxRows * 2
  size& = size& * maxCols         ' Array size as long int
  PRINT "Size of each array is"; size&; "bytes"

  PRINT "Setting up Array A"
  Acquire A()

  PRINT "Setting up Array B"
  Acquire B()

  PRINT "Adding arrays"
  FOR col = 1 TO maxCols
    FOR row = 1 TO maxRows
      C(row, col) = A(row, col) + B(row, col)
    NEXT row
  NEXT col

  PRINT "Proof:"
  PRINT "A(1, 1) + B(1,1) = C(1, 1) = ";
  PRINT A(1, 1); " + "; B(1, 1); " = "; C(1, 1)
  C = maxCols
  r = maxRows
  PRINT "A(max, max) + B(max, max) = C(max, max) = ";
  PRINT A(r, C); " + "; B(r, C); " = "; C(r, C)
' -----------------------------------------------------------

SUB Acquire (D())
  ' Load data into array 'D'

  FOR row = 1 TO maxRows
    FOR col = 1 TO maxCols
      D(row, col) = (row * 10) + col    ' Generate test data
    NEXT col
  NEXT row
END SUB

[LISTING TWO]




PROGRAM DiskArr;
(* Illustrates disk-based arrays, adding two 500 x 500 arrays *)
(*   of REAL to yield a third.                                *)
(* Requires 4.5MB of disk space                               *)
(* Turbo Pascal 4.0                                           *)
(* Kent Porter, DDJ, October 1988                              *)

USES CRT, DOS;

CONST  maxRow = 499;
       maxCol = 499;
       Yes    = TRUE;
       No     = FALSE;

TYPE   ArrayRow = ARRAY [0..MaxCol] OF REAL;    (* Row buffer *)
       RowFile  = FILE OF ArrayRow;              (* File type *)
       BuffCtlBlock = RECORD      (* Row buffer control block *)
         CurrentRow : WORD;
         IsModified : BOOLEAN;
       END;

VAR    ArrA, ArrB, ArrC  : RowFile;
       RowA, RowB, RowC  : ArrayRow;
       BufA, BufB, BufC  : BuffCtlBlock;
       BufSize           : WORD;
       row, col, nCols   : WORD;

(* ---------------------------------------------------------- *)

PROCEDURE Acquire (VAR arr  : RowFile;
                   VAR cb   : BuffCtlBlock;
                   VAR buf  : ArrayRow;
                       name : String);

  (* Load data into disk array 'arr'                          *)
  (* If the file already exists, simply open it               *)
  (* Upon return, row 0 is loaded into the buffer             *)

VAR   r, c, nread : WORD;
      newfile     : BOOLEAN;

BEGIN
  cb.CurrentRow := 0;      (* Initialize buffer control block *)
  cb.IsModified := No;
  NewFile       := Yes;    (* Assume we have to make new file *)

  Assign (arr, name);
  {$I-}
  Reset (arr);                        (* Does the file exist? *)
  {$I+}
  IF IOResult = 0 THEN                 (* File already exists *)
    IF FileSize (arr) = maxRow+1 THEN        (* If right size *)
      NewFile := No;                (* then use existing file *)

  (* If we have to create a new file *)
  IF NewFile THEN BEGIN
    Rewrite (arr);                         (* Create the file *)
    FOR r := 0 TO maxRow DO BEGIN
      Gotoxy (1, WhereY-1); Writeln ('Row ',r:3); (* Show row *)
      FOR c := 0 TO maxCol DO
        Buf [c] := ((row * nCols) + c) * 1.0;    (* Test data *)
      Write (arr, buf);                 (* Write out full row *)
    END;
    Writeln;
  END;

  Seek (arr, 0);                         (* Go to top of file *)
  Read (arr, buf);                         (* Get first block *)
END;
(* -------------------------- *)

FUNCTION A (row, col : WORD) : REAL;

  (* Return indicated element from Array A *)

BEGIN
  IF row <> BufA.CurrentRow THEN BEGIN     (* Reading new row *)
    IF BufA.IsModified THEN BEGIN     (* Save row if modified *)
      Seek (ArrA, LONGINT (BufA.CurrentRow));
      Write (ArrA, RowA);
    END;
    Seek (ArrA, LONGINT (row));                (* Get new row *)
    Read (ArrA, RowA);
    BufA.IsModified := No; BufA.CurrentRow := row;
  END;
  A := RowA [col];                      (* Return the element *)
END;
(* -------------------------- *)

FUNCTION B (row, col : WORD) : REAL;

  (* Same as A, but from ArrB *)

BEGIN
  IF row <> BufB.CurrentRow THEN BEGIN
    IF BufB.IsModified THEN BEGIN
      Seek (ArrB, LONGINT (BufB.CurrentRow));
      Write (ArrB, RowB);
    END;
    Seek (ArrB, LONGINT (row));
    Read (ArrB, RowB);
    BufB.IsModified := No; BufB.CurrentRow := row;
  END;
  B := RowB [col];
END;
(* -------------------------- *)

FUNCTION C (row, col : WORD) : REAL;

  (* Same as A, but from ArrC *)

BEGIN
  IF row <> BufC.CurrentRow THEN BEGIN
    IF BufC.IsModified THEN BEGIN
      Seek (ArrC, LONGINT (BufC.CurrentRow));
      Write (ArrC, RowC);
    END;
    Seek (ArrC, LONGINT (row));
    Read (ArrC, RowC);
    BufC.IsModified := No; BufC.CurrentRow := row;
  END;
  C := RowC [col];
END;
(* -------------------------- *)

PROCEDURE WriteToC (row, col : WORD; val : REAL);

  (* Write val to C [row, col] *)

BEGIN
  IF row <> BufC.CurrentRow THEN BEGIN        (* If a new row *)
    IF BufC.IsModified THEN BEGIN          (* and old changed *)
      Seek (ArrC, LONGINT (BufC.CurrentRow));     (* save old *)
      Write (ArrC, RowC);
    END;
    Seek (ArrC, LONGINT (row));           (* then get new row *)
    Read (ArrC, RowC);
    BufC.CurrentRow := row;
  END;
  RowC [col] := val;                       (* and write to it *)
  BufC.IsModified := Yes;
END;
(* -------------------------- *)

BEGIN   (* Body of main program *)
  ClrScr;
  Writeln ('*** Disk Array Processor ***');
  nCols := MaxCol + 1;
  BufSize := SizeOf (ArrayRow);

  (* Create output array file and fill with zeros *)
  Assign (ArrC, 'ARRAY.C');
  Rewrite (ArrC);
  Writeln ('Initializing output array'); Writeln;
  FOR col := 0 TO maxCol DO
    RowC [col] := 0.0;
  FOR row := 0 TO maxRow DO BEGIN
    Gotoxy (1, WhereY-1); Writeln ('Row ', row:3);
    Write (ArrC, RowC);
  END;
  Seek (ArrC, 0); Read (ArrC, RowC);
  BufC.CurrentRow := 0; BufC.IsModified := No;

  (* Get the test data into A and B *)
  Gotoxy (1, WhereY-1); Writeln ('Setting up Array A');
  Acquire (ArrA, BufA, RowA, 'ARRAY.A');
  Gotoxy (1, WhereY-1); Writeln ('Setting up Array B');
  Acquire (ArrB, BufB, RowB, 'ARRAY.B');

  (* Add A and B, giving C *)
  Gotoxy (1, WhereY-1); ClrEol; Writeln ('Adding arrays');
  FOR row := 0 TO maxRow DO BEGIN
    Gotoxy (1, WhereY);
    Write ('Row ', row:3);
    FOR col := 0 TO maxCol DO
      WriteToC (row, col, (A (row, col) + B (row, col)));
  END;

  (* Display proof that it worked *)
  Gotoxy (1, WhereY); Writeln ('Addition completed');
  Writeln ('Proof:');
  Write   ('A (1, 1) + B (1, 1) = C (1, 1) = ');
  Writeln (A (1, 1):6:0, ' + ',
           B (1, 1):6:0, ' = ',
           C (1, 1):6:0);

  Write   ('A (maxRow, maxCol) + B (maxRow, maxCol) = ');
  Writeln ('C (maxRow, maxCol) = ');
  Writeln (A (maxRow, maxCol):6:0, ' + ',
           B (maxRow, maxCol):6:0, ' = ',
           C (maxRow, maxCol):6:0);
  Close (ArrC);
END.