Hannes and Tom are members of the research staff at Compaq Computer Corp. Systems Research Center. They can be contacted at email@example.com and firstname.lastname@example.org, respectively.
Many people regard the World Wide Web as a world-wide distributed hypertext with content that is read and understood by humans. However, the Web is also home to a growing number of software applications such as spiders, softbots, and intelligent agents. These applications perform various operations on web data, primarily automating tasks that humans find too repetitive and time consuming.
Such applications need to be programmed to discern the organization, protocols, and data formats encountered on the Web. Although rudimentary software libraries exist for communicating across the Internet, there are few comprehensive libraries for fetching and serving web pages, resolving URLs, filling out input forms, parsing HTML, and so on. To bring such facilities to the application programmer, we at Compaq Computer Corporation's Systems Research Center (SRC) have developed a new web scripting language called "WebL" (pronounced "webble").
WebL is an ideal language for prototyping web applications. Shopping robots, meta-search engines, site validators, and page-information extractors are typical applications that are easy to write in WebL. The WebL language interpreter is written in Java, and is freely available (including documentation) at http://www.research.digital.com/SRC/WebL/. We described an earlier version of WebL at the 7th International World Wide Web Conference (see "WebL: A Programming Language for the Web," Proceedings of the 7th International World Wide Web Conference, In Computer Networks and ISDN Systems 30, 1998).
As Figure 1 shows, a typical WebL program fetches documents from the Web, extracts and refines data from the documents, then converts the resulting data back into a document on the Web. In addition to the usual assortment of high-level control statements and data types, WebL contains two special features to help the web-application programmer: service combinators, which help handle unreliable services; and a markup algebra, which helps handle structured text like HTML and XML.
Special Features In WebL
In theory, the Web is ubiquitous, but in practice, you often find traffic jams and construction delays on the information superhighway. For example, a web server might provide intermittent service or a data connection might abort or stall. These are modes of failure. Humans know when to click "stop" and try again, but in most computer languages, failure is difficult to handle and programmers often do not bother to try. WebL provides service combinators to simplify handling failures. Service combinators can also be used to exploit the inherent parallelism of replicated servers, as we will explain later.
In addition, most web pages contain structured text written in a markup language such as HTML or XML. In theory, structured text is a carefully organized presentation of data, but in practice, you often find clutter, misuse, and unexpected reorganization. For example, many pages use manual formatting in place of headings or lists; tables are frequently used for presentation effects; illegal HTML is not rare, and formats unexpectedly change from day to day. WebL provides a markup algebra to simplify the extraction of data from structured text. The operators of the markup algebra are designed to make it easy to selectively pick information from a page while ignoring the surrounding clutter. WebL programs can also be updated quickly to reflect page format changes because, in general, only a few lines of code need to be changed.
First introduced by L. Cardelli and R. Davies (see "Service Combinators for Web Computing," Research Report 148, Digital Equipment Corporation Systems Research Center, June 1997 at http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-148.html), service combinators are used to handle service failures and exploit replicated servers. Service combinators make it easy to mimic human reflexes such as reloading stalled links, retrying requests after pauses, switching to lesser used servers, and running multiple requests in parallel. By providing an easy way to express these reflexes, it becomes easier to write robust scripts.
A service is either a primitive action (such as fetching a web page) or a compound action involving control structures and component services. A service terminates with either success or failure and, in the event of success, the service also produces a value. A service combinator is a control structure that determines the execution order of its component services based on their success or failure. Table 1 gives brief descriptions of service combinators, and Example 1 uses them in some WebL program fragments.
The GetURL and PostURL functions perform the primitive service of fetching a web page. GetURL uses the HTTP GET protocol and returns a WebL object that encapsulates the parsed structure of the page. If the fetch fails, then the function fails. The optional second argument to GetURL provides query arguments. PostURL is similar but uses the HTTP POST protocol, which fills in web-based input forms. Example 1(a) attempts to fetch the named URL and Example 1(b) looks up the word "java" on the AltaVista search engine.
The sequential execution combinator ? allows a secondary service to be consulted in case the primary service fails. The service S ? T acts like the service S, except that if S fails, it then executes the service T. Example 1(c) first attempts to contact AltaVista in California and, in case of failure, then attempts to contact a mirror site in Australia.
The concurrent execution combinator | allows two services to be executed concurrently. The service S | T starts both services S and T in parallel and returns the result of whichever succeeds first. If both S and T fail, then the combined service also fails. Example 1(d) attempts to fetch a page from one of two alternate sites. Both sites are attempted concurrently, and the result is the site successfully contacted first.
The timeout combinator allows a time limit to be placed on a service. The service Timeout(t,S) acts like S, except that it fails after t milliseconds if S has not terminated within that time. Example 1(e) attempts to fetch a page from one of two alternate sites, but gives a limit of 10 seconds to succeed.
The retry combinator provides a way to repeatedly invoke a service until it succeeds. The service Retry(S) acts like S, except that each time S fails, S starts again. Combining retry with sequential execution, Example 1(f) makes repeated attempts in the case of failure, alternating between two sites.
Finally, the primitive service Stall() never succeeds or fails. Combining stall with timeout and retry, Example 1(g) repeatedly attempts to fetch the URL, but waits 10 seconds after each failure before trying again.
The markup algebra is used to extract information from, and manipulate the contents of, web pages written in HTML and XML format. The markup algebra is based on the WebL data types piece and piece-set and on a set of algebraic operators.
A piece is a contiguous region of text in a document, identified by a starting position and an ending position. A piece is either named or unnamed.
A named piece is a markup element and its starting and ending positions are the element's open and close tags. The name of a named piece is simply the name of the markup element itself, for example, "img," "h1," or "p."
An unnamed piece is a region of text that does not correspond to a markup element. The starting and ending positions of an unnamed piece are unnamed, invisible open and close tags that WebL automatically inserts as necessary. These invisible tags function merely as placeholders in the document and have no other effect. Unnamed pieces are created, for example, by a text search.
Figure 2 shows some pieces in a fragment of an HTML document. As you can see, pieces may overlap in an arbitrary manner.
A piece-set is an ordered set of pieces. Pieces within a piece-set may overlap but must belong to the same document. Pieces are ordered according to starting position, with ties broken according to ending position. By comparing the starting and ending positions of a piece p and a piece q, WebL determines whether p is before, after, inside, contains, or overlaps with q. Figure 3 illustrates the possible comparisons.
Typically, piece-sets are created by searching a page for markup elements (Elem) or for occurrences of given text (Pat). For example, consider the HTML page in Listing One, the WebL program in Listing Two, and the output of this program in Listing Three.
The Elem function searches a page for all HTML or XML markup elements of a specific name. In our example, the Elem on line 5 searches for hyperlink anchor elements. The result is a piece-set that contains a piece for each anchor element in the page. Note that the attributes of the element can be accessed as fields of the piece (lines 10, 20). The pieces in a piece-set can be enumerated in order (lines 9-15), or an individual piece can be extracted by using a numerical index (line 20).
The Pat function searches a page for all occurrences of a Perl5-style regular expression (see J.E.F. Friedl's Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools, O'Reilly & Associates, 1997). In our example, the Pat on line 24 searches for the character sequence "Web" or "web." The search ignores markup. The result is a piece-set that contains a piece for each matched text segment. WebL automatically inserts unnamed, invisible tags to serve as starting and ending positions for these unnamed pieces.
In addition, a function called Seq searches for a sequence of markup elements. This function is useful for matching computer-generated HTML, which often contains highly stylized markup patterns without hierarchical structure. In our example, the Seq on line 40 searches for a bold element, followed by text, followed by an italic element. For each piece in the resulting piece-set, a numerical index can be used to extract the region of text matched by the corresponding specifier in the sequence. These regions are actually pieces as well.
In addition to searching a page, WebL's searching functions can also search inside a single piece. The results are the same, but limited as to what appears inside the piece.
Once we have some initial piece-sets, we can combine them algebraically to produce additional piece-sets. Each of WebL's algebraic operators takes two piece-sets, P and Q, and produces a third piece-set as a result. WebL provides set operators, positional operators, and hierarchical operators. Table 2 gives brief definitions and Example 2 uses them in some WebL program fragments. WebL also has negated versions of the positional and hierarchical operators, as well as functions to traverse the document in a hierarchical manner.
In addition to piece-sets, the algebraic operators also work with single pieces as operands. The algebraic operators treat a piece as if it was a piece-set with a single element.
Set operators are used for basic set manipulation. The set union operator merges the two piece-sets P and Q, eliminating duplicates. The set intersection operator returns the set of all pieces that are in both P and Q. The set exclusion operator calculates the set of pieces that are in P but not in Q. Example 2(a) extracts level-one and level-two headings, then merges them.
Positional operators are used to express relationships between pieces based on their locations in the linear text flow of the document. The before operator returns the set of pieces in P that is located before some piece in Q. In Figure 3, a piece p is before a piece q if the ending position of p precedes the starting position of q. Similarly, the after operator returns the set of pieces in P that is located after some piece in Q, and the overlap operator returns the set of pieces in P that overlaps with some piece in Q. Example 2(b) finds all instances of "Web" or "web" that occur before the fourth level-one heading. (Recall that index zero extracts the first element of a piece-set.)
Although they are very effective, the before and after operators are not always sufficient. For example, you might be interested in only the first occurrence of a link after a special keyword, rather than all occurrences. For this purpose, WebL provides the stronger operator directlyafter, which returns the set that contains, for each qQ, the first piece (if any) in P that follows that q. The directlybefore operator is analogous. Example 2(c) finds the first anchor element that occurs after each occurrence of the word "CITE."
Hierarchical operators are used to express relationships between pieces based on containment and inclusion. The inside operator returns the set of pieces in P that is contained in some piece in Q. A piece p is contained in a piece q if: p does not start before q starts, p does not end after q ends, and either the starting positions or the ending positions (or both) are different. Correspondingly, the contain operator returns the set of pieces in P that contains some piece in Q. Example 2(d) finds all hyperlink anchors that contain an image. As in the case of positional operators, WebL also provides the stronger hierarchical operators directlyinside and directlycontain.
The true power of the algebraic operators becomes clear when they are combined in useful ways. In fact, one of WebL's innovations is the ability to combine piece-set operators with arguments calculated using Elem, Pat, Seq, and so on. Example 2(e) extracts all rows of all tables that mention the phrase "Stock Quotes," except the first such table -- which is omitted. Our example uses directlycontain to extract only the innermost tables that mention the phrase. This is because web pages often use nested tables for presentation effects, and we want to ignore such clutter.
WebL also contains functions for manipulating web pages to insert, delete, and replace the contents of pieces (some of which are in Listing Five).
Putting it all Together
Using service combinators and the markup algebra, it becomes quite easy to program the Web. For example, Listing Four implements a meta-search engine that combines search results from the AltaVista and HotBot public-search services. The tricky part is determining how to parse the search results, but whatever parse you choose will be easy to write. Listing Five implements a highlight server that fetches an arbitrary web document, modifies it to highlight all occurrences of some given text, and serves the result back to a browser. It also rewrites all hyperlinks to query through the highlight server so that clicking on a link gets the highlighted version of the linked document. Each of these examples takes less than a single page of WebL code.
WebL is distributed as a research prototype and includes source code. It is easy to dynamically add new functionality to the system in the form of modules written in either WebL or Java. Modules that control your web browser, manipulate URLs, and implement a web server and a customizable web crawler are already available. More information about WebL can be found at our release page at http://www.research.digital.com/SRC/WebL/.
<html><head><title>WebL Test Page</title></head> <body><img SRC="pix/webllogo.gif"><p> <font size=+2>W<b>e</b>bL</font> (pronounced "webble") is a <font size=+1>scripting language</font> for automating tasks on the World-Wide Web. It is an imperative, interpreted language that has built-in support for common web protocols like HTTP and FTP, and popular data types like HTML and XML. <p> <a name="features">WebL has two special features.</a> <p> <b>Service combinators</b> provide an <i>exception-handling mechanism</i> that makes computations on the web more reliable. <p> A <b>markup algebra</b> provides a way to <i>extract <b>and</b> manipulate data</i> in web pages. <p> These features make it easy to implement tools like web shopping robots, meta-search engines, HTML analysis and checking routines, and so on. <p> WebL's implementation language is Java, and the complete source code is <font size=+1><a href="http://www.research.digital.com/SRC/WebL/index.html"> freely available</a></font>. Extensions (in the form of modules) are easy to add by Java programmers. <p> <a href=http://www.compaq.com><img SRC="pix/cpqlogo.gif"></a> </body> </html> </p>
1 var page = GetURL("file:demo.html"); // fetch the demo page2 3 4 5 var links = Elem(page,"a"); // extract anchor elements 6 7 // loop for each piece, same order as in document 8 9 every link in links do 10 PrintLn(link.href) ? nil // print its href attribute 11 12 // link.href fails if piece has no href attribute, 13 // service combinator handles failure with a secondary 14 // execution of nil (which does nothing) 15 end; 16 17 // since piece-sets are ordered, the individual pieces can be 18 // extracted by applying a numeric index 19 20 PrintLn("name = ",links.name); // name attribute of first anchor 21 22 23 24 var words = Pat(page,"(W|w)eb"); // extract text occurrences 25 26 // loop for each piece, same order as in document 27 28 every word in words do 29 Print(Markup(word)," ") // print word including markup 30 end; 31 32 PrintLn("= ", Size(words), " times."); // how many words 33 34 35 36 // extract sequences of <bold element> text <italic element> 37 // each resulting piece is such a sequence 38 // loop for each piece, same order as in document 39 40 every item in Seq(page,"b # i") do 41 42 // by applying a numeric index, we get the component that 43 // matched the respective specifier of the search sequence 44 45 // print the bold element and the italic element, 46 // just text, ignoring markup 47 48 PrintLn(Text(item)," -- ",Text(item)) 49 end; </p>
http://www.research.digital.com/SRC/WebL/index.htmlhttp://www.compaq.com/ name = features Web W<b>e</b>b web Web web Web web web web Web = 10 times. Service combinators -- exception-handling mechanism markup algebra -- extract and manipulate data
1 import Str;2 3 var QueryAltaVista = fun(query) 4 var results = ; // initially empty result list 5 var page = GetURL( // how to query AltaVista 6 "http://www.altavista.digital.com/cgi-bin/query", 7 [. pg="q", q=query .]); 8 every i in Seq(page,"b a br # font br #") do // find the answers 9 results = results + // concatenate this answer 10 [[. // make an object with these fields 11 title = Text(i),12 href = i.href, 13 abstract = Str_Trim(Text(i)) 14 .]] 15 end; 16 results 17 end; 18 19 var QueryHotBot = fun(query) 20 var results = ; // initially empty result list 21 var page = GetURL( // how to query HotBot 22 "http://www.hotbot.com/default.asp", 23 [. MT=query .]); 24 every i in Seq(page,"b br # br") do // find the answers 25 // Elem can search inside a piece too 26 // URL is the 2nd anchor in the 1st component 27 var a = Elem(i,"a"); 28 results = results + // concatenate this answer 29 [[. // make an object with these fields 30 title = Text(a), 31 href = a.href, 32 abstract = Text(i) 33 .]] 34 end; 35 results 36 end; 37 38 var q = ARGS; // get query word from command line 39 var results = QueryAltaVista(q) + QueryHotBot(q); 40 PrintLn("Results for ",q,":"); 41 every r in results do // print each result 42 PrintLn(); 43 PrintLn(" ",r.title); 44 PrintLn(" ",r.href); 45 PrintLn(r.abstract); 46 end </p>
1 import Url;2 import WebServer; 3 4 var port = 9092; 5 var where = "/bin/highlight"; 6 7 8 var Highlight = fun(req,res) 9 // access the url and word parameters from the request 10 // missing parameters cause the field access to fail 11 // default values supplied using a service combinator 12 13 var url = req.param.url ? "http://www.compaq.com"; 14 var word = req.param.word ? "Compaq"; 15 16 PrintLn("url=",url," word=",word); // log it on the console 17 18 var page = GetURL(url); // fetch the page 19 20 // for each matching text not inside the title 21 every w in Pat(page,word) !inside Elem(page,"title") do 22 var p = NewNamedPiece("font",w); // wrap a font element around it 23 p.size := "+1"; // define its size attribute 24 p.color := "red"; // define its color attribute 25 end; 26 27 every a in Elem(page,"a") do // for each anchor 28 a.href = where + // rewrite its href to be me 29 "?word=" + Url_Encode(word) + // word parameter 30 "&url=" + Url_Encode(a.href) // url parameter 31 ? nil; // but do nothing if no href 32 end; 33 34 res.result = Markup(page); // this is the result 35 end; 36 37 38 WebServer_Publish(where,Highlight); // associate url with function 39 WebServer_Start("/dev/null",port); // disk pages root, server port 40 41 // when a browser tries to fetch a "published" page, 42 // web server task calls back to our function to handle the request 43 44 PrintLn("Highlight Server running. Contact :",port,where); 45 46 Stall() // server task runs in background
Copyright © 1999, Dr. Dobb's Journal