Dr. Dobb's | Margination and Project Gutenberg

Margination and Project Gutenberg

The goal of Project Gutenberg is to distribute a million books electronically by the end of the year. To that end, project organizers have identified some unusual requirements on the margins of the lines.

May 01, 2000
URL:http://www.drdobbs.com/architecture-and-design/margination-and-project-gutenberg/184404098

May00: Margination and Project Gutenberg

Setting margins for electronic documents

William is a senior consultant with Alphatech Corp. He can be contacted at [email protected].

In 1971, Michael Hart took advantage of a grant in computer time at the University of Illinois to build a repository for free copies of books. Bearing in mind the huge difference to the world made by Johannes Gutenberg when he began printing copies of the Bible for general distribution, Hart started typing-in books with expired copyrights. The first "book" was the United States of America's "Declaration of Independence." Since that time, Project Gutenberg (http://www .gutenberg.net/) has been posting ASCII files free for download, with the goal to distribute 1 million titles by January, 2010.

While Hart is a prodigious typist with a fondness for books, he couldn't possibly type in a world-class book collection by himself. Volunteers soon appeared to help type-in books, scan material, and edit manuscripts. There are now books in a wide variety of languages and even the output of Human DNA effort. In fact, there are now well over 2000 e-texts available online and hundreds more in the process of being made available for public use. Volunteers are bringing in e-texts at unprecedented rates as high as a book a day. Bearing in mind that typing-in a book can take as long as six months of dedicated work, this is extraordinary.

As one of those volunteers, I have been typing-in books with outdated copyrights so that they can be distributed freely on the Internet. The books are stored in straight ASCII so that they can be viewed on the widest variety of platforms. In an effort to make the books as readable as possible, Project Gutenberg has defined some unusual requirements on the margins of the lines which are reminiscent of the era of handwritten and handprinted books. This is where the algorithm I present here comes in handy.

Requirements and Specifications

When it comes to setting the margins for Project Gutenberg books, there are a number of requirements:

Each line must be between 55 and 75 characters long. Optimally, lines should be 65 characters long and preferably within the range of 60 to 70 characters. While this sounds simple, there is no word processor in the marketplace (free or commercial) that meets these interesting requirements. In addition, each line must end with a CR/LF typical of text documents. Most volunteers eyeball the lines and attempt to bring them within the margin requirements. Needless to say, this does not work well. Some volunteers set margins at 70 and hope the minimal length is met. Too many volunteers don't meet the margin requirements at all, therefore requiring additional work from the editing team.
Project Gutenberg could benefit from a program that would implement these margins on existing works and on future works. The program must process an entire book in a short amount of time (under five minutes), as this is not the only editing function that must be completed on a book before it is released to the Web.
The program should run at least under UNIX and DOS, and should be portable to other platforms with minimum effort. This implies a command-line interface to minimize the need to port a user interface between dissimilar desktop applications (for instance, X-Windows and Microsoft Windows).
The program should be written in ANSI-C for portability. While ANSI-C is not guaranteed to run on any platform it encounters, it is quite likely that it will be portable to a wide variety of platforms with no changes. Certainly ANSI-C programs can run under both UNIX and DOS.
The program should optimize lines to 65 characters. This should be done bearing in mind that the minimum and maximum lengths of a line still need to be respected.
The program should process a book of 35,000 words in five minutes or less. Okay, 35,000 is an arbitrary number not based on any averages from Project Gutenberg. However, it just so happens that one of the books I typed-in was almost exactly 35,000 words and it was about the average for books I have typed-in. This gave me a reasonable test sample and was a reasonable representation of the requirement.

Once the requirements had been identified and the specifications specified, the next step was to look at how this type of margin could be generated. Clearly, a hard margin (used in virtually all word processors) doesn't work because it fails to ensure that the minimum line length is met.

Selecting an appropriate line length for any given line involves two simultaneous, interdependent operations. First, the line length needs to approach the optimal length itself. Second, other lines in the paragraph must be considered so that they, too, can approach the optimal length. Because this process is by nature forward looking, a forward-looking algorithm makes sense. On the other hand, an algorithm that uses back propagation could also be used, so a recursive algorithm should be considered. Finally, it turns out that the problem devolves into a chaotic nature on any paragraph longer than four lines, so an algorithm that considers chaos should be considered as well.

Forward-Looking Algorithm

A forward-looking algorithm looks ahead a step and examines whether the current position is optimal or the next step is closer to optimal. The algorithm iterates in this manner until it is as close to the optimal configuration as it can be. Listing One presents the heart of the forward-looking algorithm.

The forward-looking component is in the second WHILE clause, which looks ahead to see if adding an additional word to the given line will bring it closer to optimum than the current size of the line.

This would be a fairly swift algorithm that would read through the file once. Keeping in mind the minimal and maximal line size, it is possible to calculate a worst-case number of words read by the look ahead that add to the processing time for the entire algorithm. This is given by the formula:

(MAXIMUM_LINE_SIZE - MINIMUM_LINE_ SIZE)/AVERAGE_WORD_LENGTH

In this article, for example, the average word length is around five characters. Thus, three words (including spaces) would be examined (on average) by the algorithm. If the average line is 65 characters, then this imposes an overhead of about 23 percent.

Back-Propagation Algorithm

An algorithm that uses back propagation sends information from the current line being examined back to a previous line. This is generally done using recursion. The question at hand is what information should be passed backwards? The intuitive answer is that the length of the subsequent line should be passed back. This allows the current line to examine a variety of possible line sizes and adjust itself according to what would be best for the two lines together. Listing Two is the heart of the back-propagation algorithm.

In the IF clause, the results of the back propagation are incorporated into the analysis of which line size should be used. Again, using the analysis just mentioned for this article, this would mean that three words on average would be used for each back propagation on each line. To evaluate the impact on performance from this, it is necessary to know the size of the paragraph. For each line in the paragraph, three back-propagation function calls will be made. Those lines, in turn, will make three back-propagation calls, and so on. Thus, the performance impact of this algorithm is dependent on the number of lines in a paragraph. At a minimum, each paragraph will be read three times for each line in the paragraph; that is a 300 percent increase in reading. Generally, the impact will be much larger to the worst-case 3^{paragraph_line_count-1}. This exponential increase in reads, as the size of the paragraph increases, could be very detrimental to performance (for example, a 10-line paragraph incurs 19,683 reads in the worst case). Perhaps more importantly, this algorithm looks at the paragraph multiple times. This compounds the performance problem, as the true worst-case cost is paragraph_line_ count!×3^{paragraph_line_count-1}. I coded this algorithm and tried to run it against the test file, but after 10 hours I gave up on it. The performance criteria were simply not met.

There is often a trade-off between performance and resource usage. This algorithm is no exception and a simple modification (see Listing Three) can dramatically reduce the performance impact. By building an array that contains the calculated values for lines beginning with (potentially) each word in the paragraph, it is possible to reduce the number of times reads are done and the amount of recursion required to process each paragraph. This array (global to the function, so available at any level of recursion) can be queried before ordering recursion to see if an optimal value has already been calculated.

This simple change dramatically reduces the number of reads of the original data. A quick analysis would suggest that the number of reads would be equal to the number of words. The problem, however, is that there are six words (on average) on every logical line (except the first and last, where there are three each) that can be either on the current line or another line depending on the optimization. Thus, the total number of reads are limited by 6×(number_of_lines-1). While this represents an average 600 percent increase in reads over the forward-looking algorithm above, it is dramatically smaller than the performance impact of the first back-propagation algorithm.

Implementation

To test the validity of the algorithms and evaluate the speed with which they perform on a UNIX and on a DOS/Windows platform, it was necessary to do some form of implementation. Since the testing to be done with this implementation was to be constrained to the aforementioned file, parameters were set to minimum values that would allow the file to be processed. In a practical implementation, such as will be provided to Project Gutenberg, the parameters would either need to be more dynamic (set on the fly) or much larger to handle the worst case. In addition, the code may be subject to additional optimization, although efforts were made to optimize the code for at least consistent throughput and ease of comprehension to the reader.

On the UNIX box, I used the C compiler provided by the manufacturer; on the DOS box, I used djgpp (available at http:// www.delorie.com/djgpp/) to compile the same source code. Table 1 provides a performance comparison of the two packages.

The forward-looking algorithm was implemented as forward.c (available electronically; see "Resource Center," page 5) Some details, such as the need to remove carriage returns from existing documents, inserting double spaces at the end of sentences that also end a line, and allowing portions of the document to go unprocessed (valuable for a table of contents, index, or poetry), consume a fair amount of the code. This portion of the code is carried through the implementation of the other algorithms. As expected, this algorithm is fast, as seen in Table 1, and resulted in a surprising average line length of 65.29 (excluding lines that end a paragraph and cannot, therefore, be optimized). The standard deviation is about 10 characters on either side of the average, which is not ideal (5 characters was in the specification). In any case, the algorithm looks very promising.

There are some problems, however, with the forward-looking algorithm:

Any line that ends with a word 21 or more characters in length ends up short. Now, this may seem unusual, and, indeed, it is, but it is far from impossible. Historical books in general use a much larger vocabulary than contemporary novels and use a fair amount of hyphenation. In a book on which I am currently working the following "word" appears: "Therefore-possibly."
This "word" is 21 characters in length and could, conceivably, end a line. In the forward-looking algorithm, there is no way to correct this problem.
On some occasions, moving one line slightly away from the optimum position allows many other follow-on lines to fall into optimum positions, thus increasing the optimization on the paragraph. Because the forward-looking algorithm attempts to optimize each line individually, the overall optimization of the paragraph doesn't come into play.
The back-propagation algorithm was implemented as backward.c (available electronically). This algorithm is faster than expected in the analysis. A quick review of Table 1 shows that the performance impact of the back propagation algorithm is about 280 percent in the UNIX environment and 426 percent in the DOS environment. Both are below the average worst-case impact of 600 percent predicted and the difference between the two can probably be attributed to memory management, which is generally better in the UNIX environment. The really startling result is the dramatic improvement in standard deviation down to well below the specification.
The dramatic improvements in optimization seen in the back-propagation algorithm are understandable:
As noted, movement of one line slightly away from the optimum position allows many other follow-on lines to fall into optimum positions.
Exhaustively looking for the best configuration dramatically increases the amount of information available to the program and allows the global optimum to be found.

Conclusion

To complete this program, and make it truly useful, the parameters that are currently hard coded to build the sizes of the arrays could be read from a file and the arrays built dynamically either from information gleaned in a preprocessing step or changed directly by the user. In addition, statistics should be generated from the program showing the user how successful the optimization effort was. Everybody likes to know to what degree a compression tool compressed a file, even though they don't really need to know the performance of the program in most cases. The same applies to this application. Also, the code would probably benefit from several more iterations of optimization that could tighten the sections that are slow and generate more consistent results across platforms. Finally, the arrays could be replaced with sparse matrices in an effort to reduce overhead because the current performance of the program far exceeds the specification.

DDJ

Listing One

WHILE (CURRENT_LINE_SIZE < MINIMAL_LINE_SIZE)
  Add a Word to the Line

WHILE (ABS(CURRENT_LINE_SIZE - OPTIMAL_LINE_SIZE) > 
                ABS(CURRENT_LINE_SIZE + NEXT_WORD_SIZE - OPTIMAL_LINE_SIZE))
  Add a Word to the Line

IF (CURRENT_LINE_SIZE > MAXIMUM_LINE_SIZE)
  Remove the last word from the line

RETURN (CURRENT_LINE_SIZE)

Back to Article

Listing Two

WHILE (CURRENT_LINE_SIZE < MINIMAL_LINE_SIZE)
  Add a Word to the Line

NEXT_LINE_RESULT = 0
BEST_LINE_RESULT = 0
BEST_CURRENT_LINE = CURRENT_LINE_SIZE

WHILE (CURRENT_LINE_SIZE <= MAXIMUM_LINE_SIZE)
  NEXT_LINE_RESULT = BACK_PROPAGATE(NEXT_WORD)
  IF (ABS((CURRENT_LINE_SIZE - OPTIMAL_LINE_SIZE) + 
                  (NEXT_LINE_RESULT - OPTIMAL_LINE_SIZE)) < 
                  ABS((BEST_CURRENT_LINE - OPTIMAL_LINE_SIZE) + 
                  (BEST_LINE_RESULT - OPTIMAL_LINE_SIZE))
    BEST_LINE_RESULT = NEXT_LINE_RESULT
    BEST_CURRENT_LINE = CURRENT_LINE_SIZE
  Add a Word to the Line
RETURN(BEST_CURRENT_LINE)

Back to Article

Listing Three

WHILE (CURRENT_LINE_SIZE < MINIMAL_LINE_SIZE)
  Add a Word to the Line

NEXT_LINE_RESULT = 0
BEST_LINE_RESULT = 0
BEST_CURRENT_LINE = CURRENT_LINE_SIZE

WHILE (CURRENT_LINE_SIZE <= MAXIMUM_LINE_SIZE)
  IF LINE_VALUES[NEXT_LINE][NEXT_WORD] is not calculated
    INCREMENT LINE_NUMBER
    NEXT_LINE_RESULT = BACK_PROPAGATE(NEXT_WORD)
    LINE_VALUES[LINE_NUMBER][NEXT_WORD] = NEXT_LINE_RESULT
    DECREMENT LINE_NUMBER
  ELSE
    NEXT_LINE_RESULT = LINE_VALUES[NEXT_LINE][NEXT_WORD]
  IF (ABS((CURRENT_LINE_SIZE - OPTIMAL_LINE_SIZE) + 
                       (NEXT_LINE_RESULT - OPTIMAL_LINE_SIZE)) < 
                       ABS((BEST_CURRENT_LINE - OPTIMAL_LINE_SIZE) + 
                       (BEST_LINE_RESULT - OPTIMAL_LINE_SIZE))
    BEST_LINE_RESULT = NEXT_LINE_RESULT
    BEST_CURRENT_LINE = CURRENT_LINE_SIZE
  Add a Word to the Line

RETURN(BEST_CURRENT_LINE)

Back to Article

May00: Margination and Project Gutenberg

Table 1: DOS and UNIX performance comparison. ¹Pentium Pro200, SCSI, Micron Millennium Pro2. ²Average of 10 runs in DOS/Windows requesting the time before and after the 10 runs. ³Minicomputer. ⁴Average of 10 runs using UNIX "time" utility.

Setting margins for electronic documents

Requirements and Specifications

Forward-Looking Algorithm

Back-Propagation Algorithm

Implementation

Conclusion

Listing One

Listing Two

Listing Three

Table 1: DOS and UNIX performance comparison. 1Pentium Pro200, SCSI, Micron Millennium Pro2. 2Average of 10 runs in DOS/Windows requesting the time before and after the 10 runs. 3Minicomputer. 4Average of 10 runs using UNIX "time" utility.

Table 1: DOS and UNIX performance comparison. ¹Pentium Pro200, SCSI, Micron Millennium Pro2. ²Average of 10 runs in DOS/Windows requesting the time before and after the 10 runs. ³Minicomputer. ⁴Average of 10 runs using UNIX "time" utility.