Margination and Project Gutenberg

The goal of Project Gutenberg is to distribute a million books electronically by the end of the year. To that end, project organizers have identified some unusual requirements on the margins of the lines.


May 01, 2000
URL:http://www.drdobbs.com/architecture-and-design/margination-and-project-gutenberg/184404098

May00: Margination and Project Gutenberg

Setting margins for electronic documents

William is a senior consultant with Alphatech Corp. He can be contacted at [email protected].


In 1971, Michael Hart took advantage of a grant in computer time at the University of Illinois to build a repository for free copies of books. Bearing in mind the huge difference to the world made by Johannes Gutenberg when he began printing copies of the Bible for general distribution, Hart started typing-in books with expired copyrights. The first "book" was the United States of America's "Declaration of Independence." Since that time, Project Gutenberg (http://www .gutenberg.net/) has been posting ASCII files free for download, with the goal to distribute 1 million titles by January, 2010.

While Hart is a prodigious typist with a fondness for books, he couldn't possibly type in a world-class book collection by himself. Volunteers soon appeared to help type-in books, scan material, and edit manuscripts. There are now books in a wide variety of languages and even the output of Human DNA effort. In fact, there are now well over 2000 e-texts available online and hundreds more in the process of being made available for public use. Volunteers are bringing in e-texts at unprecedented rates as high as a book a day. Bearing in mind that typing-in a book can take as long as six months of dedicated work, this is extraordinary.

As one of those volunteers, I have been typing-in books with outdated copyrights so that they can be distributed freely on the Internet. The books are stored in straight ASCII so that they can be viewed on the widest variety of platforms. In an effort to make the books as readable as possible, Project Gutenberg has defined some unusual requirements on the margins of the lines which are reminiscent of the era of handwritten and handprinted books. This is where the algorithm I present here comes in handy.

Requirements and Specifications

When it comes to setting the margins for Project Gutenberg books, there are a number of requirements:

Once the requirements had been identified and the specifications specified, the next step was to look at how this type of margin could be generated. Clearly, a hard margin (used in virtually all word processors) doesn't work because it fails to ensure that the minimum line length is met.

Selecting an appropriate line length for any given line involves two simultaneous, interdependent operations. First, the line length needs to approach the optimal length itself. Second, other lines in the paragraph must be considered so that they, too, can approach the optimal length. Because this process is by nature forward looking, a forward-looking algorithm makes sense. On the other hand, an algorithm that uses back propagation could also be used, so a recursive algorithm should be considered. Finally, it turns out that the problem devolves into a chaotic nature on any paragraph longer than four lines, so an algorithm that considers chaos should be considered as well.

Forward-Looking Algorithm

A forward-looking algorithm looks ahead a step and examines whether the current position is optimal or the next step is closer to optimal. The algorithm iterates in this manner until it is as close to the optimal configuration as it can be. Listing One presents the heart of the forward-looking algorithm.

The forward-looking component is in the second WHILE clause, which looks ahead to see if adding an additional word to the given line will bring it closer to optimum than the current size of the line.

This would be a fairly swift algorithm that would read through the file once. Keeping in mind the minimal and maximal line size, it is possible to calculate a worst-case number of words read by the look ahead that add to the processing time for the entire algorithm. This is given by the formula:

(MAXIMUM_LINE_SIZE - MINIMUM_LINE_ SIZE)/AVERAGE_WORD_LENGTH

In this article, for example, the average word length is around five characters. Thus, three words (including spaces) would be examined (on average) by the algorithm. If the average line is 65 characters, then this imposes an overhead of about 23 percent.

Back-Propagation Algorithm

An algorithm that uses back propagation sends information from the current line being examined back to a previous line. This is generally done using recursion. The question at hand is what information should be passed backwards? The intuitive answer is that the length of the subsequent line should be passed back. This allows the current line to examine a variety of possible line sizes and adjust itself according to what would be best for the two lines together. Listing Two is the heart of the back-propagation algorithm.

In the IF clause, the results of the back propagation are incorporated into the analysis of which line size should be used. Again, using the analysis just mentioned for this article, this would mean that three words on average would be used for each back propagation on each line. To evaluate the impact on performance from this, it is necessary to know the size of the paragraph. For each line in the paragraph, three back-propagation function calls will be made. Those lines, in turn, will make three back-propagation calls, and so on. Thus, the performance impact of this algorithm is dependent on the number of lines in a paragraph. At a minimum, each paragraph will be read three times for each line in the paragraph; that is a 300 percent increase in reading. Generally, the impact will be much larger to the worst-case 3paragraph_line_count-1. This exponential increase in reads, as the size of the paragraph increases, could be very detrimental to performance (for example, a 10-line paragraph incurs 19,683 reads in the worst case). Perhaps more importantly, this algorithm looks at the paragraph multiple times. This compounds the performance problem, as the true worst-case cost is paragraph_line_ count3paragraph_line_count-1. I coded this algorithm and tried to run it against the test file, but after 10 hours I gave up on it. The performance criteria were simply not met.

There is often a trade-off between performance and resource usage. This algorithm is no exception and a simple modification (see Listing Three) can dramatically reduce the performance impact. By building an array that contains the calculated values for lines beginning with (potentially) each word in the paragraph, it is possible to reduce the number of times reads are done and the amount of recursion required to process each paragraph. This array (global to the function, so available at any level of recursion) can be queried before ordering recursion to see if an optimal value has already been calculated.

This simple change dramatically reduces the number of reads of the original data. A quick analysis would suggest that the number of reads would be equal to the number of words. The problem, however, is that there are six words (on average) on every logical line (except the first and last, where there are three each) that can be either on the current line or another line depending on the optimization. Thus, the total number of reads are limited by 6×(number_of_lines-1). While this represents an average 600 percent increase in reads over the forward-looking algorithm above, it is dramatically smaller than the performance impact of the first back-propagation algorithm.

Implementation

To test the validity of the algorithms and evaluate the speed with which they perform on a UNIX and on a DOS/Windows platform, it was necessary to do some form of implementation. Since the testing to be done with this implementation was to be constrained to the aforementioned file, parameters were set to minimum values that would allow the file to be processed. In a practical implementation, such as will be provided to Project Gutenberg, the parameters would either need to be more dynamic (set on the fly) or much larger to handle the worst case. In addition, the code may be subject to additional optimization, although efforts were made to optimize the code for at least consistent throughput and ease of comprehension to the reader.

On the UNIX box, I used the C compiler provided by the manufacturer; on the DOS box, I used djgpp (available at http:// www.delorie.com/djgpp/) to compile the same source code. Table 1 provides a performance comparison of the two packages.

The forward-looking algorithm was implemented as forward.c (available electronically; see "Resource Center," page 5) Some details, such as the need to remove carriage returns from existing documents, inserting double spaces at the end of sentences that also end a line, and allowing portions of the document to go unprocessed (valuable for a table of contents, index, or poetry), consume a fair amount of the code. This portion of the code is carried through the implementation of the other algorithms. As expected, this algorithm is fast, as seen in Table 1, and resulted in a surprising average line length of 65.29 (excluding lines that end a paragraph and cannot, therefore, be optimized). The standard deviation is about 10 characters on either side of the average, which is not ideal (5 characters was in the specification). In any case, the algorithm looks very promising.

There are some problems, however, with the forward-looking algorithm:

Conclusion

To complete this program, and make it truly useful, the parameters that are currently hard coded to build the sizes of the arrays could be read from a file and the arrays built dynamically either from information gleaned in a preprocessing step or changed directly by the user. In addition, statistics should be generated from the program showing the user how successful the optimization effort was. Everybody likes to know to what degree a compression tool compressed a file, even though they don't really need to know the performance of the program in most cases. The same applies to this application. Also, the code would probably benefit from several more iterations of optimization that could tighten the sections that are slow and generate more consistent results across platforms. Finally, the arrays could be replaced with sparse matrices in an effort to reduce overhead because the current performance of the program far exceeds the specification.

DDJ

Listing One

WHILE (CURRENT_LINE_SIZE < MINIMAL_LINE_SIZE)
  Add a Word to the Line

WHILE (ABS(CURRENT_LINE_SIZE - OPTIMAL_LINE_SIZE) > 
                ABS(CURRENT_LINE_SIZE + NEXT_WORD_SIZE - OPTIMAL_LINE_SIZE))
  Add a Word to the Line

IF (CURRENT_LINE_SIZE > MAXIMUM_LINE_SIZE)
  Remove the last word from the line

RETURN (CURRENT_LINE_SIZE)

Back to Article

Listing Two

WHILE (CURRENT_LINE_SIZE < MINIMAL_LINE_SIZE)
  Add a Word to the Line

NEXT_LINE_RESULT = 0
BEST_LINE_RESULT = 0
BEST_CURRENT_LINE = CURRENT_LINE_SIZE

WHILE (CURRENT_LINE_SIZE <= MAXIMUM_LINE_SIZE)
  NEXT_LINE_RESULT = BACK_PROPAGATE(NEXT_WORD)
  IF (ABS((CURRENT_LINE_SIZE - OPTIMAL_LINE_SIZE) + 
                  (NEXT_LINE_RESULT - OPTIMAL_LINE_SIZE)) < 
                  ABS((BEST_CURRENT_LINE - OPTIMAL_LINE_SIZE) + 
                  (BEST_LINE_RESULT - OPTIMAL_LINE_SIZE))
    BEST_LINE_RESULT = NEXT_LINE_RESULT
    BEST_CURRENT_LINE = CURRENT_LINE_SIZE
  Add a Word to the Line
RETURN(BEST_CURRENT_LINE)

Back to Article

Listing Three

WHILE (CURRENT_LINE_SIZE < MINIMAL_LINE_SIZE)
  Add a Word to the Line

NEXT_LINE_RESULT = 0
BEST_LINE_RESULT = 0
BEST_CURRENT_LINE = CURRENT_LINE_SIZE

WHILE (CURRENT_LINE_SIZE <= MAXIMUM_LINE_SIZE)
  IF LINE_VALUES[NEXT_LINE][NEXT_WORD] is not calculated
    INCREMENT LINE_NUMBER
    NEXT_LINE_RESULT = BACK_PROPAGATE(NEXT_WORD)
    LINE_VALUES[LINE_NUMBER][NEXT_WORD] = NEXT_LINE_RESULT
    DECREMENT LINE_NUMBER
  ELSE
    NEXT_LINE_RESULT = LINE_VALUES[NEXT_LINE][NEXT_WORD]
  IF (ABS((CURRENT_LINE_SIZE - OPTIMAL_LINE_SIZE) + 
                       (NEXT_LINE_RESULT - OPTIMAL_LINE_SIZE)) < 
                       ABS((BEST_CURRENT_LINE - OPTIMAL_LINE_SIZE) + 
                       (BEST_LINE_RESULT - OPTIMAL_LINE_SIZE))
    BEST_LINE_RESULT = NEXT_LINE_RESULT
    BEST_CURRENT_LINE = CURRENT_LINE_SIZE
  Add a Word to the Line

RETURN(BEST_CURRENT_LINE)

Back to Article

May00: Margination and Project Gutenberg

Table 1: DOS and UNIX performance comparison. 1Pentium Pro200, SCSI, Micron Millennium Pro2. 2Average of 10 runs in DOS/Windows requesting the time before and after the 10 runs. 3Minicomputer. 4Average of 10 runs using UNIX "time" utility.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.