Web Development

All the news that's fit to print — in Perl!

By Ray Snow, June 01, 2001

How would you go about organizing thousands of daily news stories from hundreds of Internet-based sources? Ray turned to Perl's pattern-matching capabilities to cull, tag, sort, and present all the news that's fit.

Jun01: An Information Assembly Line in Perl

Ray is principal engineer and manager at NewsEdge. He can be contacted at ray [email protected].

For more than 10 years, NewsEdge (http://www.newsedge.com/) has been supplying organizations and web sites with sharply focused news. Drawn from 2000 sources and organized into 2000 standard topics, NewsEdge Review Topic stories are used by millions of people at over 1450 organizations. Web administrators see our news as a direct way to increase traffic and encourage repeated visits to intranet portals and public Internet sites.

Every day stories are culled, tagged, and sorted by software, then presented to a team of 40 editorial reviewers. Each reviewer is an expert in one or more fields and typically scans thousands of stories per day, selecting, ranking, and organizing them into appropriate categories. An automated system then marks up the stories in SGML, HTML, or XML, and packages them into feeds and distributes them via the Internet. Customers receive stories only from topics and sources they choose.

In 1998, however, the packaging system (named "MakeFeed") began running out of steam. Built incrementally as the business grew, the system — made up of hundreds of C, C++, and UNIX shell files — could barely keep up with the demands of 250 customers. Consequently, we launched a software-development project to replace MakeFeed and we chose Perl as the implementation language. To illustrate how we use Perl in MakeFeed, I'll examine five specific problems. I chose these to show you how the programs communicate and to introduce you to Perl pattern matching.

The Assembly Line

Our first problem was to reduce the large number of programs. Although news feeds differ in their final appearances, their constructions have much in common: Topic-story pairings must be identified. Story text marked up. Feed directories created. These commonalities led to the concept of an assembly line of Perl processes linked together. Unlike a real assembly line, MakeFeed passes only the feed names from stage to stage, not the feeds themselves. But the metaphor of an assembly line proved useful in explaining the system to its users, so the term stuck.

The line consists of seven stages, each a Perl program, connected together in a UNIX pipeline. The Standard Output of one stage is connected to the Standard Input of another. Feeds are assigned unique names. The first stage, FeedPump, places the name of each feed onto the assembly line by pumping it out of its Standard Output. Each remaining stage then performs one major processing step and forwards the feed names to the next. The assembly-line stages are:

FeedPump, which orders the feeds and schedules their production.
BuildTrees, which builds topic/story hierarchies.
MarkUp, which marks up stories in SGML, HTML, or XML.
Index, which organizes story web pages by topic.
CheckFeed, which checks feeds and stories for correctness.
Aggregate, which aggregates web pages and ancillary files via ZIP or tar.
SendFeed, which transfers finished feeds to Internet-accessible FTP pickup sites.

Listing One shows main, the Bourne shell script that creates and starts the assembly line. Lines 5 through 11 run the seven stages. Each stage accepts the assembly-line name, main, as a parameter. FeedPump also accepts a second, in this case All, which tells it which feeds to pump. The 2>> on each line redirects and appends Standard Error to the corresponding log file; the |\ at the end is a UNIX pipe symbol followed by the UNIX line-continuation backslash (\) character. Together these convert lines 5 through 11 into a single UNIX command. Including main itself, this results in eight separate UNIX processes.

Bottlenecks and Substages

Having created a pipeline, we then encountered bottlenecks. (This is the second problem.) Some feeds are bigger than others and monopolize a stage while their successors wait. To permit concurrency, we create child UNIX processes; represented in Figures 1 and 2 by the smaller circles. Parent stages distribute feed names to their children by creating pairs of UNIX pipes. For each child, one output pipe in the parent is connected to the child's Standard Input and a corresponding input pipe to the child's Standard Output. This lets the child think it is part of the main data flow, when in reality its parent is selectively diverting the flow to it. This magic is performed via the UNIX select() system service.

What a parent cannot do to a feed in a generic way is done instead by one of its children. For instance the siblings, MarkUpHtml and MarkUpXml, tag their feeds in the appropriate mark up dialects. On the other hand, identical twins such as SendFeed-1 and SendFeed-2, two instances of the same program, simply divide up their labor (see Figure 1). In all cases when a big feed is assigned to one child, its siblings are free to handle smaller feeds. As a result small feeds can pass bigger ones and don't have to wait.

Substages provide another benefit — modularity. Originally, MakeFeed handled only SGML and HTML. To add XML we simply added XML-specific substages to BuildTrees, MarkUp, and Index; see Figure 1. This year we plan to add WML (Wireless Markup Language) and expect it will be just as straightforward.

A Sample Input Story

In addition to its UNIX I/O behavior, we chose Perl for its string handling. Here I will again focus on MarkUp and its children, and on two problem areas: extracting metadata and cleaning up raw story text. But first, I present an example of a story file as it is presented to MakeFeed.

Figure 3 is a story doctored to contain some instructive problems. The extended ASCII character set, ISO Latin-1, is used. Each line of text ends in a carriage-return/line-feed (cr/lf). I added the line numbers and their following periods (.) and spaces. Line 6 occupies four physical lines only because the text has wrapped around the right margin of the paragraph. Editorial metadata has been added by other NewsEdge software upstream of MakeFeed. Lines 1 through 5 and 10 through 12 contain this metadata, which is delimited by a period (.) at the front and a cr/lf at the end.

Invisible Characters

Figure 3 contains many characters you cannot see, such as the cr/lf pair that terminates each line. But there are others. For instance, the story headline in line 3 begins not with the word "Wallenberg," but with three invisible ASCII BEL characters. This is a holdover from the time when news stories were typically sent to Teletype machines. Ringing the bell on this device served to call attention to a hot story. There is a series of 32 invisible characters at the beginning of the ASCII set, traditionally called "control characters." The problem of spotting and eliminating such characters is compounded by the fact that, depending on the software being used, some of them may be acceptable: Perl, for example, considers space, tab, line-feed, carriage-return, and form-feed legitimate and refers to them collectively as "whitespace."

This kind of annoyance, while it may seem minor to programmers outside the news industry, occupies much of our time. Today's electronic news systems were forged incrementally in earlier days, giving rise to such character anomalies. There are many others.

Special Characters in SGML

Technically, HTML and XML are dialects of SGML, and as such inherit four special metacharacters: less than (<), greater than (>), ampersand (&), and quotation mark ("). Less than and greater than are used to delimit elements, sometimes called "tags." Quotation marks are used to delimit element attribute values, such as the font size in a <FONT> tag. Ampersand together with semicolon (;) are used to delimit character entities that are used in place of the four special characters. These are respectively "&lt;", "&gt;", "&amp;", and "&quot;", and are known as "the SGML Standard Entities." Except when quoting verbatim text as in an XML CDATA section, you must be careful to substitute entity for character in every case. This is called "escaping the characters." Managing this is surprisingly difficult. For instance, it will not do to simply replace all ampersands with "&amp;": The story, having previously been handled by other editorial systems, may already contain escaped characters. A blind ampersand replacement could then result in incorrect double entities, like "&amp;amp;".

It's worse than that. The four Standard Entities are not alone. Any contiguous series of characters beginning with ampersand and ending with semicolon is potentially a legal entity. For example, in HTML "&nbsp;" is an entity for non-breaking space, which unlike its low-numbered ASCII cousin, space, is not eliminated or merged with other whitespace by a browser. That's not all; nonbreaking space, and all other legitimate single characters, can be represented in at least two other ways as numeric character entities, like "&#160;" in decimal, and "&#xA0;" in hexadecimal notation. (All three representations appear in line 8 of Figure 3.)

In addition, depending on the precise character set used to mark up the story, it may be necessary to replace all high-ASCII characters (Latin-1 characters numbered above 127 decimal) with corresponding entities. This may be the only way, short of a CDATA section, to display the text legibly in many browsers. For instance, we once had an XML parser crash because of the N-Tilde character (numbered 241) in the word "jalapeño."

Perl Special Characters

Like SGML, Perl also has special characters, but they are not unfamiliar to UNIX programmers. The pound sign (#) is used to delimit Perl comments. Period (.), question mark (?), plus sign (+), braces ({}), and brackets ([]) are seven of many metacharacters in Perl regular expressions. Dollar sign ($) is used to delimit the name of any scalar quantity, including a large set of Perl global variables. Of these, one in particular, dollar underscore ($_), which contains the default input and pattern-searching character string, is convenient because its implicit use shortens many Perl commands. Four examples of this will be shown. Braces are used in some places where a scalar name may be difficult to identify. So "$num" and "${num}" are equivalent.

In Perl regular expressions, backslash-s (\s) represents a single whitespace character. Multiple consecutive whitespace characters are signified by backslash-s-plus (\s+). Characters may be specified by their decimal or hexadecimal ASCII numbers, as in \7 and \x07 for BEL. A set of alternate choices for matching is enclosed in brackets. So [&<>"] matches the four SGML special characters. Period (.) represents one arbitrary character. Period-plus (.+) means one or more arbitrary characters. Finally parentheses, while useful for grouping, in the correct context also causes their contents to be remembered by Perl. So the expression, "(.+)", matches one or more characters and stores them in the automatic Perl variable, Dollar-one ("$1"). A successful match of "(.+) announces (.+)," as in "Peter announces Paul," causes the variables $1 and $2 to contain "Peter" and "Paul," respectively.

A useful technique in Perl is to interpolate the value of a scalar variable into a character string. For example if $num contains 16, then the command print "Replaced $num Ampersands. \n", causes the text "Replaced 16 Ampersands." to be sent to Standard Output, followed by a line-feed. As in other shell scripting languages, Perl recognizes backslash-n (\n) as the new-line character, typically line-feed under UNIX. Here backslash (\) is used to escape the n so Perl won't interpret it as just another n in the text. Multiple escaping backslashes may be needed in expressions destined to be parsed by Perl more than once. For example, if we are to search for the string, ".begin" stored in a variable, $tag, to be interpolated into a match command, the leading period must be escaped twice. Therefore, we initialize $tag to "\\.begin".

If all this character talk seems idle, take a good look again at Figure 3, and start thinking about parsing it in Perl. That's where we're going next.

Extracting the Headline

I'm now ready to examine the third problem: How to extract the headline from the story in Figure 3. If its entire text is loaded into $_, then Listing Two does the job of extracting the headline. Lines 1 and 2 create and initialize two scalar variables to the text, ".begin (header)" and ".begin (text)". Line 4 treats the entire story as a single string and searches for a substring containing the two tags. Between the tags it hopes to find whitespace, followed by an arbitrary amount of text, followed by more whitespace. Table 1 lists the components of the match command.

The Perl match command returns either a True or False value. The exclamation-point (!) in line 4 causes the if test to succeed if the match command fails, so if there is no match, the error message on line 6 is output to Standard Error. On the other hand, if the match succeeds, all of the text between the two tags, excluding any leading or trailing whitespace, is captured in the temporary local variable, $1, and then saved by assigning it in line 10 to the global variable named $headline. If the story is the one shown in Figure 3, then $headline contains the three invisible BEL characters followed by "Wallenberg monument inaugurated outside UN."

Deleting Low-Numbered ASCII Characters

The fourth problem is how to get rid of the BEL characters. I will attack this more generally. Again, if the Perl variable, $_, is used to hold the entire story text, then Listing Three deletes ASCII control characters that may be embedded invisibly. Line 1 contains a Perl substitute command, which returns the number of substitutions that were successfully performed. Table 2 lists the components of the substitute command.

By replacing the ASCII characters whose decimal values are 0 through 7, 11, 12, and 14 through 31 with nothing, the substitute command causes their deletion. Only backspace, tab, line-feed, and carriage return are allowed to remain.

Replacing Special HTML Characters

And now the final problem — handling metacharacters. The sample story in Figure 3 contains special characters in lines 6 and 8. Line 6 contains a single ampersand (&), while line 8 contains two less-than signs, two greater-than signs, four HTML character entities, "&#160;", "&#xa0;", and two instances of "&nbsp;", each of which represents a nonbreaking space. Each of the four contains a leading ampersand that should not be escaped.

Once more, if the Perl variable, $_, is used, then Listing Four escapes only the single ampersand in line 11 and the less-thans and greater-thans in line 16. Lines 1 through 4 create a Perl hash (an associative array) named "tbl." A hash allows the fast lookup of values (on the right side of the equal signs) via their corresponding keys (on the left side of the equal-signs, enclosed in braces.) In this case, the keys are the SGML special characters, and their corresponding values are the appropriate Standard Entities.

In lines 6 through 8, I create regular expressions to match the insides of decimal, hexadecimal, and general character entities, respectively. Backslash d (\d) represents one decimal digit. Braces ({ and }) indicate minimal and maximal repetitions. So "\d{1,3}" matches 1, 2, or 3 decimal digits. Line 9 brings the expressions in lines 6 through 8 together creating a Perl regular expression that matches any one of them. (The pipe symbol (|) indicates alternation. So "a|b|c" indicates one and only one of the choices — a, b, or c.)

In line 11, you have the first of two Perl substitute commands. Why two? To avoid the erroneous double escaping of ampersands. It is apparent, by looking at the replacement strings in lines 11 and 16, that you are handling ampersand first, and then separately handling less than, greater than, and quotation mark. Table 3 lists the components of the substitute command in line 11. (I'll postpone an explanation of the match target for the time being.) The replacement string is the value of the tbl Perl hash in line 3: The Standard Entity, "&amp;". So all substitutions will result in this value. Line 16 contains the second Perl substitute command. Table 4 lists its components.

Notice the match string, "([<>"])". This is a regular expression consisting of a set of characters to be matched. They are, respectively, less than, greater than, and quotation mark. Square brackets are used to delimit such a set. But then the entire regular expression is enclosed in parentheses. Recall that this means Perl will remember the matched text and store it in the automatic variable, $1. This is very convenient for our purpose because the replacement string, $tbl{$1}, is just the value in the Perl hash tbl that corresponds to $1. When $1 is less than, the replacement string will be "&lt;", when it is greater than, "&gt;", and when it's quotation mark, "&quot;" — exactly as you desire. Here the hash semantics of Perl do the magic by working hand-in-hand with the substitute command.

Look-Ahead Matching

So what is the meaning of the match target &(?!${choices};) in line 11? In Perl, an expression of the form, x(?!y), where x is a character string and y is a regular expression is called a "zero-width negative look-ahead assertion." Such an assertion matches an occurrence of x immediately followed by anything except y. So, for example, andy(?!hardy) matches andy followed by anything but hardy.

The value of the variable, $choices, which was set in line 9, is the regular expression that tells Perl what to avoid. It is interpolated into the match command, and as a result legal decimal, hexadecimal, and general character entities are avoided while everything else is accepted. The result of executing lines 11 through 16 is first to escape all appropriate ampersands and then to escape all less thans, greater thans, and quotation marks. As a final result, line 8 in Figure 3 is replaced with the string,

<<News Suppiler X — 11-09-97>>

which displays correctly in all browsers as:

<<News Supplier X — 11-09-97>>

Conclusion

These five solutions typify the kind of analysis we do to produce validly marked up web pages containing the news. Object-oriented methodologies and new technologies like XML, and its recently standardized sublanguage, NITF (News Industry Text Format), will reduce this burden. Until then, but probably even after, Perl will be in our toolbox.

DDJ

Listing One

 1. #!/bin/sh
 2. # Assembly line "main" for Makefeed Version 3.1:
 3. # -------------------------------------------------------
 4.
 5.     FeedPump    main All 2>> main.FeedPump.log   |\
 6.     BuildTrees  main     2>> main.BuildTrees.log |\
 7.     MarkUp      main     2>> main.MarkUp.log     |\
 8.     Index       main     2>> main.Index.log      |\
 9.     CheckFeed   main     2>> main.CheckFeed.log  |\
10.     Aggregate   main     2>> main.Aggregate.log  |\
11.     SendFeed    main     2>> main.SendFeed.log

Back to Article

Listing Two

 
 1. $tag01 = "\\.begin \\(header\\)";          # Escape Period & Parentheses.
 2. $tag02 = "\\.begin \\(text\\)";            # Here too.
 3.
 4. if (! m/${tag01}\s+(.+)\s+${tag02}/m )     # Look for a match.
 5.     {
 6.     print STDERR "Can't find headline.\n"; # If not found, error.
 7.     }
 8. else
 9.     {
10.     $headline = $1;                    # If so, $1 contains the headline.
11.     }

Back to Article

Listing Three

 1. my $num =  s/[\x00-\x07\x0b\x0c\x0e-\x1f]//g;
 2. if ( $num > 0 )
 3.    {
 4.    print STDERR "Replaced $num ASCII Control Characters.\n";
 5.    }

Back to Article

Listing Four

 1. $tbl{"<"}  = "<"   ;
 2. $tbl{">"}  = ">"   ;
 3. $tbl{"&"}  = "&"  ;
 4. $tbl{"\""} = """ ;
 5.
 6. $dec_char_ent = "#\\d{1,3}";             # Like in " ".
 7. $hex_char_ent = "#x[0-9A-Fa-f]{1,2}";    # Like in " ".
 8. $gen_char_ent = "[0-9A-Za-z]{1,6}";      # Like in " ".
 9. $choices      = "${dec_char_ent}|${hex_char_ent}|${gen_char_ent}";
10.
11. $num =  s/&(?!${choices};)/$tbl{"&"}/g ; # Escape SOME Ampersands; Not all.
12. if ($num > 0)
13.     {
14.     print STDERR "Replaced $num Ampersands (\"&\").\n";
15.     }
16. $num =  s/([<>"])/$tbl{$1}/g ;  # Escape Less-Than, Greater-Than, & Quote.
17. if ($num > 0)
19.     print STDERR "Replaced $num special HTML charactors 
20.                                      with SGML Standard Entities.\n";
21.     }

Back to Article

1 2 3 4 5 6 7 8 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development

All the news that's fit to print — in Perl!

The Assembly Line

Bottlenecks and Substages

A Sample Input Story

Invisible Characters

Special Characters in SGML

Perl Special Characters

Extracting the Headline

Deleting Low-Numbered ASCII Characters

Replacing Special HTML Characters

Look-Ahead Matching

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Web Development Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Web Development

All the news that's fit to print — in Perl!

The Assembly Line

Bottlenecks and Substages

A Sample Input Story

Invisible Characters

Special Characters in SGML

Perl Special Characters

Extracting the Headline

Deleting Low-Numbered ASCII Characters

Replacing Special HTML Characters

Look-Ahead Matching

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Web Development Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content