Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Automating the Web with WebL

Hannes Marais and Tom Rodeheffer

, January 01, 1999


Jan99: Automating the Web with WebL

Hannes and Tom are members of the research staff at Compaq Computer Corp. Systems Research Center. They can be contacted at [email protected] and [email protected], respectively.


Many people regard the World Wide Web as a world-wide distributed hypertext with content that is read and understood by humans. However, the Web is also home to a growing number of software applications such as spiders, softbots, and intelligent agents. These applications perform various operations on web data, primarily automating tasks that humans find too repetitive and time consuming.

Such applications need to be programmed to discern the organization, protocols, and data formats encountered on the Web. Although rudimentary software libraries exist for communicating across the Internet, there are few comprehensive libraries for fetching and serving web pages, resolving URLs, filling out input forms, parsing HTML, and so on. To bring such facilities to the application programmer, we at Compaq Computer Corporation's Systems Research Center (SRC) have developed a new web scripting language called "WebL" (pronounced "webble").

WebL is an ideal language for prototyping web applications. Shopping robots, meta-search engines, site validators, and page-information extractors are typical applications that are easy to write in WebL. The WebL language interpreter is written in Java, and is freely available (including documentation) at http://www.research.digital.com/SRC/WebL/. We described an earlier version of WebL at the 7th International World Wide Web Conference (see "WebL: A Programming Language for the Web," Proceedings of the 7th International World Wide Web Conference, In Computer Networks and ISDN Systems 30, 1998).

As Figure 1 shows, a typical WebL program fetches documents from the Web, extracts and refines data from the documents, then converts the resulting data back into a document on the Web. In addition to the usual assortment of high-level control statements and data types, WebL contains two special features to help the web-application programmer: service combinators, which help handle unreliable services; and a markup algebra, which helps handle structured text like HTML and XML.

Special Features In WebL

In theory, the Web is ubiquitous, but in practice, you often find traffic jams and construction delays on the information superhighway. For example, a web server might provide intermittent service or a data connection might abort or stall. These are modes of failure. Humans know when to click "stop" and try again, but in most computer languages, failure is difficult to handle and programmers often do not bother to try. WebL provides service combinators to simplify handling failures. Service combinators can also be used to exploit the inherent parallelism of replicated servers, as we will explain later.

In addition, most web pages contain structured text written in a markup language such as HTML or XML. In theory, structured text is a carefully organized presentation of data, but in practice, you often find clutter, misuse, and unexpected reorganization. For example, many pages use manual formatting in place of headings or lists; tables are frequently used for presentation effects; illegal HTML is not rare, and formats unexpectedly change from day to day. WebL provides a markup algebra to simplify the extraction of data from structured text. The operators of the markup algebra are designed to make it easy to selectively pick information from a page while ignoring the surrounding clutter. WebL programs can also be updated quickly to reflect page format changes because, in general, only a few lines of code need to be changed.

Service Combinators

First introduced by L. Cardelli and R. Davies (see "Service Combinators for Web Computing," Research Report 148, Digital Equipment Corporation Systems Research Center, June 1997 at http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-148.html), service combinators are used to handle service failures and exploit replicated servers. Service combinators make it easy to mimic human reflexes such as reloading stalled links, retrying requests after pauses, switching to lesser used servers, and running multiple requests in parallel. By providing an easy way to express these reflexes, it becomes easier to write robust scripts.

A service is either a primitive action (such as fetching a web page) or a compound action involving control structures and component services. A service terminates with either success or failure and, in the event of success, the service also produces a value. A service combinator is a control structure that determines the execution order of its component services based on their success or failure. Table 1 gives brief descriptions of service combinators, and Example 1 uses them in some WebL program fragments.

The GetURL and PostURL functions perform the primitive service of fetching a web page. GetURL uses the HTTP GET protocol and returns a WebL object that encapsulates the parsed structure of the page. If the fetch fails, then the function fails. The optional second argument to GetURL provides query arguments. PostURL is similar but uses the HTTP POST protocol, which fills in web-based input forms. Example 1(a) attempts to fetch the named URL and Example 1(b) looks up the word "java" on the AltaVista search engine.

The sequential execution combinator ? allows a secondary service to be consulted in case the primary service fails. The service S ? T acts like the service S, except that if S fails, it then executes the service T. Example 1(c) first attempts to contact AltaVista in California and, in case of failure, then attempts to contact a mirror site in Australia.

The concurrent execution combinator | allows two services to be executed concurrently. The service S | T starts both services S and T in parallel and returns the result of whichever succeeds first. If both S and T fail, then the combined service also fails. Example 1(d) attempts to fetch a page from one of two alternate sites. Both sites are attempted concurrently, and the result is the site successfully contacted first.

The timeout combinator allows a time limit to be placed on a service. The service Timeout(t,S) acts like S, except that it fails after t milliseconds if S has not terminated within that time. Example 1(e) attempts to fetch a page from one of two alternate sites, but gives a limit of 10 seconds to succeed.

The retry combinator provides a way to repeatedly invoke a service until it succeeds. The service Retry(S) acts like S, except that each time S fails, S starts again. Combining retry with sequential execution, Example 1(f) makes repeated attempts in the case of failure, alternating between two sites.

Finally, the primitive service Stall() never succeeds or fails. Combining stall with timeout and retry, Example 1(g) repeatedly attempts to fetch the URL, but waits 10 seconds after each failure before trying again.

Markup Algebra

The markup algebra is used to extract information from, and manipulate the contents of, web pages written in HTML and XML format. The markup algebra is based on the WebL data types piece and piece-set and on a set of algebraic operators.

A piece is a contiguous region of text in a document, identified by a starting position and an ending position. A piece is either named or unnamed.

A named piece is a markup element and its starting and ending positions are the element's open and close tags. The name of a named piece is simply the name of the markup element itself, for example, "img," "h1," or "p."

An unnamed piece is a region of text that does not correspond to a markup element. The starting and ending positions of an unnamed piece are unnamed, invisible open and close tags that WebL automatically inserts as necessary. These invisible tags function merely as placeholders in the document and have no other effect. Unnamed pieces are created, for example, by a text search.

Figure 2 shows some pieces in a fragment of an HTML document. As you can see, pieces may overlap in an arbitrary manner.

A piece-set is an ordered set of pieces. Pieces within a piece-set may overlap but must belong to the same document. Pieces are ordered according to starting position, with ties broken according to ending position. By comparing the starting and ending positions of a piece p and a piece q, WebL determines whether p is before, after, inside, contains, or overlaps with q. Figure 3 illustrates the possible comparisons.

Typically, piece-sets are created by searching a page for markup elements (Elem) or for occurrences of given text (Pat). For example, consider the HTML page in Listing One, the WebL program in Listing Two, and the output of this program in Listing Three.

The Elem function searches a page for all HTML or XML markup elements of a specific name. In our example, the Elem on line 5 searches for hyperlink anchor elements. The result is a piece-set that contains a piece for each anchor element in the page. Note that the attributes of the element can be accessed as fields of the piece (lines 10, 20). The pieces in a piece-set can be enumerated in order (lines 9-15), or an individual piece can be extracted by using a numerical index (line 20).

The Pat function searches a page for all occurrences of a Perl5-style regular expression (see J.E.F. Friedl's Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools, O'Reilly & Associates, 1997). In our example, the Pat on line 24 searches for the character sequence "Web" or "web." The search ignores markup. The result is a piece-set that contains a piece for each matched text segment. WebL automatically inserts unnamed, invisible tags to serve as starting and ending positions for these unnamed pieces.

In addition, a function called Seq searches for a sequence of markup elements. This function is useful for matching computer-generated HTML, which often contains highly stylized markup patterns without hierarchical structure. In our example, the Seq on line 40 searches for a bold element, followed by text, followed by an italic element. For each piece in the resulting piece-set, a numerical index can be used to extract the region of text matched by the corresponding specifier in the sequence. These regions are actually pieces as well.

In addition to searching a page, WebL's searching functions can also search inside a single piece. The results are the same, but limited as to what appears inside the piece.

Algebraic Operators

Once we have some initial piece-sets, we can combine them algebraically to produce additional piece-sets. Each of WebL's algebraic operators takes two piece-sets, P and Q, and produces a third piece-set as a result. WebL provides set operators, positional operators, and hierarchical operators. Table 2 gives brief definitions and Example 2 uses them in some WebL program fragments. WebL also has negated versions of the positional and hierarchical operators, as well as functions to traverse the document in a hierarchical manner.

In addition to piece-sets, the algebraic operators also work with single pieces as operands. The algebraic operators treat a piece as if it was a piece-set with a single element.

Set operators are used for basic set manipulation. The set union operator merges the two piece-sets P and Q, eliminating duplicates. The set intersection operator returns the set of all pieces that are in both P and Q. The set exclusion operator calculates the set of pieces that are in P but not in Q. Example 2(a) extracts level-one and level-two headings, then merges them.

Positional operators are used to express relationships between pieces based on their locations in the linear text flow of the document. The before operator returns the set of pieces in P that is located before some piece in Q. In Figure 3, a piece p is before a piece q if the ending position of p precedes the starting position of q. Similarly, the after operator returns the set of pieces in P that is located after some piece in Q, and the overlap operator returns the set of pieces in P that overlaps with some piece in Q. Example 2(b) finds all instances of "Web" or "web" that occur before the fourth level-one heading. (Recall that index zero extracts the first element of a piece-set.)

Although they are very effective, the before and after operators are not always sufficient. For example, you might be interested in only the first occurrence of a link after a special keyword, rather than all occurrences. For this purpose, WebL provides the stronger operator directlyafter, which returns the set that contains, for each qQ, the first piece (if any) in P that follows that q. The directlybefore operator is analogous. Example 2(c) finds the first anchor element that occurs after each occurrence of the word "CITE."

Hierarchical operators are used to express relationships between pieces based on containment and inclusion. The inside operator returns the set of pieces in P that is contained in some piece in Q. A piece p is contained in a piece q if: p does not start before q starts, p does not end after q ends, and either the starting positions or the ending positions (or both) are different. Correspondingly, the contain operator returns the set of pieces in P that contains some piece in Q. Example 2(d) finds all hyperlink anchors that contain an image. As in the case of positional operators, WebL also provides the stronger hierarchical operators directlyinside and directlycontain.

The true power of the algebraic operators becomes clear when they are combined in useful ways. In fact, one of WebL's innovations is the ability to combine piece-set operators with arguments calculated using Elem, Pat, Seq, and so on. Example 2(e) extracts all rows of all tables that mention the phrase "Stock Quotes," except the first such table -- which is omitted. Our example uses directlycontain to extract only the innermost tables that mention the phrase. This is because web pages often use nested tables for presentation effects, and we want to ignore such clutter.

WebL also contains functions for manipulating web pages to insert, delete, and replace the contents of pieces (some of which are in Listing Five).

Putting it all Together

Using service combinators and the markup algebra, it becomes quite easy to program the Web. For example, Listing Four implements a meta-search engine that combines search results from the AltaVista and HotBot public-search services. The tricky part is determining how to parse the search results, but whatever parse you choose will be easy to write. Listing Five implements a highlight server that fetches an arbitrary web document, modifies it to highlight all occurrences of some given text, and serves the result back to a browser. It also rewrites all hyperlinks to query through the highlight server so that clicking on a link gets the highlighted version of the linked document. Each of these examples takes less than a single page of WebL code.

WebL is distributed as a research prototype and includes source code. It is easy to dynamically add new functionality to the system in the form of modules written in either WebL or Java. Modules that control your web browser, manipulate URLs, and implement a web server and a customizable web crawler are already available. More information about WebL can be found at our release page at http://www.research.digital.com/SRC/WebL/.

DDJ

Listing One

<html><head><title>WebL Test Page</title></head>
<body><img SRC="pix/webllogo.gif"><p>
<font size=+2>W<b>e</b>bL</font> (pronounced "webble")
is a <font size=+1>scripting language</font> for automating tasks on the
World-Wide Web.
It is an imperative, interpreted language that has built-in support
for common web protocols like HTTP and FTP, and
popular data types like HTML and XML.
<p>
<a name="features">WebL has two special features.</a>
<p>
<b>Service combinators</b> provide an <i>exception-handling mechanism</i>
that makes computations on the web more reliable.
<p>
A <b>markup algebra</b> provides a way
to <i>extract <b>and</b> manipulate data</i> in web pages.
<p>
These features make it easy to implement tools like web shopping robots,
meta-search engines, HTML analysis and checking routines, and so on.
<p>
WebL's implementation language is Java, and the complete source code is
<font size=+1><a href="http://www.research.digital.com/SRC/WebL/index.html">
freely available</a></font>.
Extensions (in the form of modules) are easy to add by Java programmers.
<p>
<a href=http://www.compaq.com><img SRC="pix/cpqlogo.gif"></a>
</body>
</html>


</p>

Back to Article

Listing Two

1 var page = GetURL("file:demo.html");   // fetch the demo page2 
3 
4 
5 var links = Elem(page,"a");         // extract anchor elements
6 
7 // loop for each piece, same order as in document
8 
9 every link in links do
10     PrintLn(link.href) ? nil        // print its href attribute
11 
12     // link.href fails if piece has no href attribute,
13     // service combinator handles failure with a secondary
14     // execution of nil (which does nothing)
15 end;
16 
17 // since piece-sets are ordered, the individual pieces can be
18 // extracted by applying a numeric index
19 
20 PrintLn("name = ",links[0].name);   // name attribute of first anchor
21 
22 
23 
24 var words = Pat(page,"(W|w)eb");    // extract text occurrences
25 
26 // loop for each piece, same order as in document
27 
28 every word in words do
29     Print(Markup(word)," ")         // print word including markup
30 end;
31 
32 PrintLn("= ", Size(words), " times.");    // how many words
33 
34 
35 
36 // extract sequences of <bold element> text <italic element>
37 // each resulting piece is such a sequence
38 // loop for each piece, same order as in document
39 
40 every item in Seq(page,"b # i") do
41 
42     // by applying a numeric index, we get the component that
43     // matched the respective specifier of the search sequence
44 
45     // print the bold element and the italic element,
46     // just text, ignoring markup
47 
48     PrintLn(Text(item[0])," -- ",Text(item[2]))
49 end;


</p>

Back to Article

Listing Three

http://www.research.digital.com/SRC/WebL/index.htmlhttp://www.compaq.com/
name = features
Web W<b>e</b>b web Web web Web web web web Web = 10 times.
Service combinators -- exception-handling mechanism
markup algebra -- extract and manipulate data

Back to Article

Listing Four

1 import Str;2 
3 var QueryAltaVista = fun(query)
4   var results = [];               // initially empty result list
5   var page = GetURL(              // how to query AltaVista
6     "http://www.altavista.digital.com/cgi-bin/query",
7     [. pg="q", q=query .]);
8   every i in Seq(page,"b a br # font br #") do  // find the answers
9     results = results +          // concatenate this answer
10       [[.                        // make an object with these fields
11         title = Text(i[1]),12         href = i[1].href,
13         abstract = Str_Trim(Text(i[6]))
14       .]]
15   end;
16   results
17 end;
18 
19 var QueryHotBot = fun(query)
20   var results = [];              // initially empty result list
21   var page = GetURL(             // how to query HotBot
22     "http://www.hotbot.com/default.asp",
23     [. MT=query .]);
24   every i in Seq(page,"b br # br") do     // find the answers
25     // Elem can search inside a piece too
26     // URL is the 2nd anchor in the 1st component 
27     var a = Elem(i[0],"a")[1];  
28     results = results +          // concatenate this answer
29       [[.                        // make an object with these fields
30         title = Text(a),
31         href = a.href,
32         abstract = Text(i[2])
33       .]]
34   end;
35   results
36 end;
37 
38 var q = ARGS[1];                 // get query word from command line
39 var results = QueryAltaVista(q) + QueryHotBot(q);
40 PrintLn("Results for ",q,":");
41 every r in results do            // print each result
42   PrintLn();
43   PrintLn("    ",r.title);
44   PrintLn("    ",r.href);
45   PrintLn(r.abstract);
46 end


</p>

Back to Article

Listing Five

1 import Url;2 import WebServer;
3 
4 var port = 9092;
5 var where = "/bin/highlight";
6 
7 
8 var Highlight = fun(req,res)
9   // access the url and word parameters from the request
10   // missing parameters cause the field access to fail
11   // default values supplied using a service combinator
12 
13   var url = req.param.url ? "http://www.compaq.com";
14   var word = req.param.word ? "Compaq";
15 
16   PrintLn("url=",url," word=",word);   // log it on the console
17 
18   var page = GetURL(url);              // fetch the page
19 
20   // for each matching text not inside the title 
21   every w in Pat(page,word) !inside Elem(page,"title") do
22     var p = NewNamedPiece("font",w);   // wrap a font element around it
23     p.size := "+1";                    // define its size attribute
24     p.color := "red";                  // define its color attribute
25   end;
26 
27   every a in Elem(page,"a") do         // for each anchor
28     a.href = where +                   // rewrite its href to be me
29              "?word=" + Url_Encode(word) +    // word parameter
30              "&url=" + Url_Encode(a.href)     // url parameter
31       ? nil;                           // but do nothing if no href
32   end;
33 
34   res.result = Markup(page);           // this is the result
35 end;
36 
37 
38 WebServer_Publish(where,Highlight);    // associate url with function
39 WebServer_Start("/dev/null",port);     // disk pages root, server port
40 
41 // when a browser tries to fetch a "published" page,
42 // web server task calls back to our function to handle the request
43 
44 PrintLn("Highlight Server running. Contact :",port,where);
45 
46 Stall()                                // server task runs in background

Back to Article


Copyright © 1999, Dr. Dobb's Journal

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.