Channels ▼
RSS

Web Development

Reformatting Text Using Pattern Matching


April, 2004: Reformatting Text Using Pattern Matching

Julius is a freelance network consultant in the Philippines. He can be contacted at jcduque@lycos.com.


Whenever I write documents, I spend a considerable amount of time just reformatting lines of text so that they fit within a 70-character-wide column. I do this because I don't like reading long lines that wrap across the screen.

Because adjusting lines of text by hand is time consuming, I decided to put my Perl skills to good use by automating this cumbersome task. Being very good at text manipulation, Perl is the ideal tool for this job. Sure, there's already the fmt command, but it's available only on UNIX/Linux systems. Besides, it can only output left-justified lines.

The full source code of my Perl script, called pretty, can be found in Listing 1. Examples 1 through 6 (1, 2, 3, 4, 5, 6) show us all possible output formats when the input file, gettysburg.txt, which contains Lincoln's Gettysburg Address, is reformatted. Both the source for pretty and gettysburg.txt are available for download at http://www .tpj.com/source/.

Note: When using pretty, it expects unformatted paragraphs to be separated by at least two consecutive newlines.

Anatomy of pretty

Lines 9-12 of pretty declare the switches available to the user. The --width switch, which takes a mandatory integer value (=i), is used to control the line width of output lines. The decision to make --width take a mandatory value is just a matter of preference.

The options --help, --left, --right, --centered, --both, and --newline don't take any value; they're activated if explicitly specified on the command line.

--left and --indent are the only options that have default values. Lines are printed left-justified, regardless of whether --left (or its short form -l) is specified, unless overridden by --right, --centered, or --both. The switch --indent (or -i), whose integer argument specifies the amount of indention at the start of a paragraph, defaults to a value of 0 (no indention) if it is not explicitly specified (line 15).

--centered (or -c) places a line of text of equal spacing from the left and right margin. --right (or its short form -r) outputs lines that are right-justified, while --both (or -b) produces both left-justified and right-justified lines. To put empty lines between paragraphs, specify --newlines (or -n).

To aid the user, the function syntax() is called if --help is specified or if --width is omitted (line 14).

Parsing a Paragraph

Normally, Perl reads a chunk of data one line at a time—a "line" being a string of characters terminated by a newline (\n). Since we want to reformat paragraphs that span multiple lines, we need to change the meaning of "line." We tell Perl to parse paragraphs instead of single lines. Now, the meaning of a "line" becomes "a string of characters delimited by two or more consecutive newlines." This is, essentially, the meaning of a "paragraph."

Fortunately, Perl offers a special variable that can be set to change the meaning of a "line." That variable is $/, the input record separator. You may set it to a multicharacter string to match a multicharacter delimiter. In our problem of reformatting paragraphs, be warned, though, that the choice of a new value for $/ can be tricky. For instance, the obvious delimiter, \n\n, is wrong. If Perl sees three consecutive newlines, for example, Perl will assume that the third newline belongs to the next paragraph. Meanwhile, Perl will swallow the whole input, from the first up to the last character, if $/ is set to undef. The correct approach is to set $/ to an empty string, "". This tells Perl to treat two or more consecutive newlines as a single newline; see line 17.

The Basic Idea

Lines 19-23 illustrate the core idea of pretty. On line 19, the use of the loop while (<>) enables pretty to act as a filter. With this loop, you can use the script like this:

  cat file1.txt file2.txt file3.txt | ./pretty --width=64 --right 

as well as like this:

  ./pretty --width=64 --right file1.txt file2.txt file3.txt 

where file1.txt, file2.txt, and file3.txt are input files.

The paragraph read by Perl on line 19 is implicitly loaded into another special variable, $_. On line 20, the split function implicitly acts on $_ and strips off all whitespaces (spaces, tabs, and newlines), storing only chunks of nonwhitespaces into an array, @linein. On Line 21, function printpar() takes @linein as argument and prints out the formatted paragraph. Last, on line 22, a newline is printed to separate two consecutive paragraphs if --newline is specified.

Printing a Line

The function printpar() performs the actual work of reformatting paragraphs (starting on line 25). When this function is called on line 21, the argument, @linein, is passed on to another variable, @par (line 27). From now on, the paragraph read by Perl will be manipulated through this new variable. The variable $firstline (on lines 28, 31, and 36-39) is relevant only if the option --newline is specified. See the section "Indentation" for more details.

The logic of printpar() is as follows:

  1. We make use of a temporary line buffer that is, at most, as long as the line width ($width) specified by the user. We call this buffer $buffer. The unit of length is 1 character.
  2. 2 We maintain a running total of characters read so far. Store this running total to variable $charcount. Initially, this is set to 0. $charcount must not exceed $width.
  3. Extract an element (a nonwhitespace chunk) from @par one at a time and take note of the element's length. Insert the element into $buffer. Increment $charcount by an amount equal to the extracted element's length. Then, insert a single space into $buffer to serve as a word separator. Increment $charcount by 1 to account for this single space. In doing so, realize that the last inserted element in $buffer is always a single space.
  4. Repeat Step 3 until $charcount either exceeds or equals $width. Note that $charcount may exceed or equal $width by the insertion of an extracted element into $buffer, even before the mandatory word separator is added to $buffer.
  5. If Step 3 terminates because $charcount either exceeds or equals $width, discard the last single space inserted in $buffer. Decrement $charcount by 1. Jump to Step 7 if the new value of $charcount becomes less than or equal to $width; otherwise, proceed to Step 6.
  6. If, after decrementing by 1 (in Step 5), $charcount still exceeds the $width, requeue the last, extracted nonwhitespace element back into @par and update $charcount by subtracting from it the length of the excess element. Note that there is a word separator inserted into $buffer just before the returned element was inserted into $buffer. Delete this extra space also and decrement $charcount by 1.
  7. Transfer the contents of $buffer to another variable, $lineout, and print it.
  8. Repeat Steps 1-7 until there are no more elements in @par.

Lines 30-67, except those lines that contain the variable $firstline, show us how to the implement the 8 steps above. Lines containing $firstline are used only when --newline is switched on. Let's ignore these lines for the moment; we'll get to that when we discuss indentation later.

On line 32, we use two temporary variables: $buffer (to hold the line to be printed) and $word (to hold the extracted element from @par). On line 33, $wordlen holds the length of the element just extracted, and $charcount is the number of characters in $buffer so far. On line 34, we use a temporary variable, $linewidth, to hold the value of $width, the line width specified by the user on the command line. We're doing this because we don't want to alter $width directly.

On lines 30 and 41, there is a constant reference to scalar @par. This is necessary because the function scalar returns the number of elements in an array. Recall that we always remove elements from the array @par. We must make sure, then, that it still has elements to be extracted; we stop extracting if it is already empty, even if $charcount has not yet exceeded the line width.

When $buffer is ready to be printed out, we transfer its contents to another variable, $lineout, for final printing (lines 60 and 102).

In case the line width is too short to accommodate even a single word, I have provided an easy way out (lines 46-50). The solution is to use a wider line width.

Indention

pretty does not produce indented paragraphs by default. But you can alter this behavior by specifying the --indent (or -i) switch, with an integer as argument. This integer tells pretty to pad that many spaces at the start of a line. Note, however, that indention takes effect only on the first line of each paragraph.

Line 12 declares --indent as a switch that takes an optional integer value (:i). Line 15 initializes --indent with a value of 0, in case it is not specified.

To help pretty distinguish the first line of a paragraph from the rest of the lines, I make use of a flag, the $firstline variable, on line 28. Before pretty starts scanning for input lines (on line 30), $firstline is initially set to 0, corresponding to a false value. When scanning begins, $firstline is incremented by 1 (its value now becomes true), signifying that pretty has just found the first line of the paragraph (line 31).

On lines 36-39, we check the value of $firstline. If it is equal to 1, then we know that we are currently dealing with the first line of the paragraph.

On line 37, $linewidth is decreased by an amount equal to the integer specified as an argument to --indent. On line 38, we print out that many leading spaces to serve as indention. We then fill up the remaining portion of the first line with texts, whose accumulated lengths must be less than or equal to the value of the updated $linewidth.

On the next iteration, $firstline is no longer equal to 1. Hence, the value of $linewidth on line 34 remains unchanged. This also means that initial padding of spaces will no longer occur. Examples 5 and 6 present sample outputs using indention.

Left-Justified Output

Undoubtedly, this is the simplest paragraph format to implement: just print $lineout as is (line 102).

Right-Justified Output

For a right-justified format, we must know how many spaces to pad the left portion of $lineout. This amount of space is simply the difference between the specified line width and the length of $lineout. On line 69, we use the variable $spaces_to_fill to hold this value. On line 75, we print this many leading spaces before printing $lineout.

Centered Output

The centered format is very similar to the right-justified format, only this time, we divide the value of $spaces_to_fill by 2. If the result has a fractional part, take only the integer part (line 72), and print this many leading spaces (line 73).

Left- and Right-Justified Output

This is the hardest part to implement. The trick is to take $lineout and modify it by adding extra spaces between nonwhitespaces. But how do we distribute the spaces evenly within the line? I'll illustrate my own solution using examples.

Suppose that we want the line width to be 39 characters long, and the string to be printed is:

fathers•brought•forth•on•this

where • represents a single space.

The line above is 29 characters long (including 4 embedded spaces), 10 spaces short of being both left- and right-justified. My solution consists of two steps:

  1. Scanning from the left, look for the first occurrence of a single space, and replace it with a double space.

fathers••brought•forth•on•this

  1. Reverse the string.

siht•no•htrof•thguorb••srehtaf

Consider this as one round.

Since the new string is not yet 39 characters long, do another round:

  1. Scanning from the left, look for the first occurrence of a single space, and replace it with a double space.

siht••no•htrof•thguorb••srehtaf

<ol>
  <li>Reverse the string.</li>
</ol>
fathers••brought•forth•on••this

Notice that we have now evenly distributed two "filler" single spaces, one on the left and the other on the right. We repeat the two steps over and over until the string becomes 39 characters long.

We are now ready to formulate our strategy:

  1. Scanning from the left, look for the first occurrence of a single space and replace it with a double space. Reverse the string.
  2. Repeat Step 1 until the required line width is reached.

If we ran out of single spaces to replace, rephrase Step 1 as "Scanning from the left, look for the first occurrence of a double space, and replace it with a triple space. Reverse the string." Repeat the two steps.

If we ran out of double spaces to replace, rephrase Step 1 as "Scanning from the left, look for the first occurrence of a triple space, and replace it with a quadruple space. Reverse the string." Repeat Steps 1 and 2, and so on. Get the picture?

Lines 83-92 show us how to implement the strategy. The actual work of replacing spaces is found on lines 85-86. On line 81, we make use of a counter, $reps, that tracks the kind of spaces (single, double, triple, and so on) to look for. Initially, $reps is set to 1, meaning we begin our search for single spaces.

If the line width is not yet reached (line 83), replace the first single space found with a double space (lines 85-86). If you fail to see the single space in there, here's a closer view of lines 85-86 (a stands for a single space):

if ($tempbuf =~ /(\S+•{$reps})(\S+)/) { 
    $tempbuf =~ s/(\S+•{$reps})(\S+)/$1•$2/; 

When $reps is 1, lines 85-86 become:

if ($tempbuf =~ /(\S+•{1})(\S+)/) { 
    $tempbuf =~ s/(\S+•{1})(\S+)/$1•$2/; 

When searching for double spaces ($reps = 2), the two lines are equivalent to:

if ($tempbuf =~ /(\S+•{2})(\S+)/) { 
    $tempbuf =~ s/(\S+•{2})(\S+)/$1•$2/; 

or, to put it in another way:

if ($tempbuf =~ /(\S+••)(\S+)/) { 
    $tempbuf =~ s/(\S+••)(\S+)/$1•$2/; 

When given the pattern \S+•{n}, Perl will look for the presence of exactly n consecutive spaces preceded by one or more nonwhitespaces. And if the pattern is enclosed in parentheses, Perl will remember the substring that matches this pattern. The remembered substrings can be accessed via the special variables, $1, $2, $3, etc.

As an example, testing the string fathers•brought•forth for the pattern (\S+•{1})(\S+), we find that the substring fathersmatches the first subpattern, (\S+•{1}); thus, fathersgets assigned to $1. Likewise, the substring brought matches the second subpattern, (\S+), and gets assigned to $2. And so, the original string, fathers•brought•forth, now becomes fathers••brought•forth.

What makes this implementation so challenging is that it is not readily apparent that we need to include the nonwhitespaces in the search. Naively searching for whitespaces only, Perl will only replace the first whitespace it sees and not the spaces in the middle of the line. The result is that only the whitespaces at the beginning and end of the line get replaced. It's important to realize that the nonwhitespaces serve as reference points in the substitution.

Let's now turn our attention to line 84. I've only included this line to make pretty as idiot-proof as possible. Line 84 handles a special case. Here's the scenario: Suppose we want a line to be 12 characters wide, and we have the string, democratic•institutions. We can see immediately that only the word, democratic, can fit on the line buffer. With only one word, it's impossible to make the line both left- and right-justified.

Line 85 will repeatedly search for at least one whitespace between two nonwhitespaces. Finding only one nonwhitespace on the line, line 85 will fail, and two things will happen:

  1. $tempbuf will never be updated (line 86). As a consequence, there will be an infinite loop on line 83.
  2. Since the if condition on line 85 fails, the alternative else condition increments $reps continuously (line 90).

Soon enough, Perl will complain and will generate error messages similar to the following:

 Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE 
in m/(\S+ { <-- HERE 32767})(\S+)/ (#1) 
   (F) There is currently a limit to the size of the min and max 
values of the {min,max} construct. The <-- HERE shows in the 
regular expression about where the problem was discovered. See 
perlre. 

The infinite loop on line 83 was averted because $reps exceeded the maximum value of the {min,max} regular expression construct.

The fix, therefore, is to check if $tempbuf has any embedded spaces. If there is a space in it, then we are sure that there are at least two words in $tempbuf. Break out of the while loop as soon as possible if no space is found—this means that $tempbuf contains only one word. In doing so, we print the one-word line as is, making that line left-justified.

Finally, recall our previous example:

fathers•brought•forth•on•this

After the first substitution, the string is reversed, and becomes

siht•no•htrof•thguorb••srehtaf

We see here that, for an odd number of substitutions, the last call to reverse is unnecessary. So, we employ another variable, $replacements_made, to keep track of how many spaces have been inserted. $replacements_made, initially set to 0 (line 78), is incremented every time a substitution is made (line 87). So, on lines 95-98, we check $replacements_made to know whether to undo the last reverse. If $replacements_made is odd, we reverse the string one more time (line 98); otherwise, we leave the string alone (line 96).

Sample Outputs

Examples 1-6 show some example output using pretty. Any of the lines at the top of these examples will produce the lines below them. Line width is set to 64 characters. Indentions are set to four single spaces. The input file is gettysburg.txt (available with the source code for this article at http://www.tpj.com/source/).

TPJ



Listing 1

1  #!/usr/local/bin/perl 
2  use diagnostics; 
3  use strict; 
4  use warnings; 
5  use Getopt::Long; 
6   
7  my ($width, $help, $left, $centered, $right, $both); 
8  my ($indent, $newline); 
9  GetOptions("width=i" => \$width, "help" => \$help, 
10    "left" => \$left, "centered" => \$centered, 
11    "right" => \$right, "both" => \$both, 
12    "indent:i" => \$indent, "newline" => \$newline); 
13   
14  syntax() if ($help or !$width); 
15  $indent = 0 if (!$indent); 
16   
17  local $/ = ""; 
18   
19  while (<>) { 
20    my @linein = split; 
21    printpar(@linein); 
22    print "\n" if ($newline); 
23  } 
24   
25  sub printpar 
26  { 
27    my (@par) = @_; 
28    my $firstline = 0; 
29   
30    while (scalar @par) { 
31      $firstline++; 
32      my ($buffer, $word); 
33      my ($charcount, $wordlen) = (0, 0); 
34      my $linewidth = $width; 
35   
36      if ($firstline == 1) { 
37        $linewidth -= $indent; 
38        print " " x $indent; 
39      } 
40   
41      while (($charcount < $linewidth) and (scalar @par)) { 
42        $word = shift @par; 
43        $buffer .= $word; 
44        $wordlen = length($word); 
45   
46        if ($wordlen > $linewidth) { 
47          print "\nERROR: The word \"$word\""; 
48          print " ($wordlen chars) cannot be accommodated\n"; 
49          exit 1; 
50        } 
51   
52        $charcount += $wordlen; 
53        $buffer .= " "; 
54        $charcount++; 
55      } 
56   
57      chop $buffer; 
58      $charcount--; 
59   
60      my $lineout = $buffer; 
61   
62      if ($charcount > $linewidth) { 
63        unshift(@par, $word); 
64        $charcount -= $wordlen; 
65        $charcount--; 
66        $lineout = substr $buffer, 0, $charcount; 
67      } 
68   
69      my $spaces_to_fill = $linewidth - $charcount; 
70   
71      if ($centered) { 
72        my $leftfill = int($spaces_to_fill/2); 
73        print " " x $leftfill; 
74      } elsif ($right) { 
75        print " " x $spaces_to_fill; 
76      } elsif ($both) { 
77        my $tempbuf = $lineout; 
78        my $replacements_made = 0; 
79   
80        if (scalar @par) { 
81          my $reps = 1; 
82   
83          while (length($tempbuf) < $linewidth) { 
84            last if ($tempbuf !~ /\s/); 
85            if ($tempbuf =~ /(\S+ {$reps})(\S+)/) { 
86              $tempbuf =~ s/(\S+ {$reps})(\S+)/$1 $2/; 
87              $replacements_made++; 
88              $tempbuf = reverse $tempbuf; 
89            } else { 
90              $reps++; 
91            } 
92          }  # while 
93        } 
94   
95        if ($replacements_made % 2 == 0) { 
96          $lineout = $tempbuf; 
97        } else { 
98          $lineout = reverse $tempbuf; 
99        } 
100      } 
101   
102      print "$lineout\n"; 
103    } 
104  } 
105   
106  sub syntax 
107  { 
108    print "Options:\n"; 
109    print "--width=n (or -w=n or -w n)   Line width is n chars "; 
110    print "long\n"; 
111    print "--left (or -l)                Left-justified"; 
112    print " (default)\n"; 
113    print "--right (or -r)               Right-justified\n"; 
114    print "--centered (or -c)            Centered\n"; 
115    print "--both (or -b)                Both left- and\n"; 
116    print "                                right-justified\n"; 
117    print "--indent=n (or -i=n or -i n)  Leave n spaces for "; 
118    print "initial\n"; 
119    print "                                indention (defaults "; 
120    print "to 0)\n"; 
121    print "--newline (or -n)             Output an empty line \n"; 
122    print "                                between "; 
123    print "paragraphs\n"; 
124    exit 0; 
125  }
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV