Writing Intelligent Web Agents
By Michael Schrenk
Browser Creator Predicts Its Demise. Back in the mid-1990s I attended a panel discussion where Marc Andreessen, the cofounder and then chief technology officer of Netscape, was asked for his predictions of where the World Wide Web was headed. He predicted, to a surprised crowd, that the days of the browser, as the primary means of viewing the Internet, were numbered.
Andreessen continued, "That's why it's called Netscape Communications, and not Netscape Browser Company." He was referring to the fact that a Web browser is a general-purpose Web agent. And while they do a fine job of displaying Web pages, browsers do a poor job of identifying relevant information. Plus, their sheer size makes them impractical for use on anything other than a computer. He foresaw the browser's dominance threatened by Web agents running on appliances that perform specific tasks like playing online radio stations, providing email services, or allowing low-cost long distance Internet phone calls.
Today, due to the bandwidth limitations of wireless networks, portable devices depend heavily on special Web clients and intelligent agents. The Palm VII, for example, speeds Web surfing by employing special filtering agents in proxy servers, which work in concert with the Web client, to preserve bandwidth.
The new Web agents are important tools because they facilitate new uses for the Internet. For example, you can use a Web agent on a cell phone to see if your flight is on time while you're sitting in the back of a cab. Try doing that with a standard browser!
Today many agents are tied to servers and accessible through Web sites. Some examples of intelligent agents accessible through Web servers include the Ask Jeeves search engine (www.askjeeves.com), and the glut of financial sites that let you place conditional buy and sell orders. Many strategists suggest that using special Web agents, which scour e-commerce sites for the best prices, will forever change the way we make purchases. One such site is www.mySimon.com. MySimon not only considers prices of products in thousands of online stores and auctions, but it also analyzes other factors that sway buyers, like shipping options and warranties.
Web agents can do much more than pull information off the Web. Since they aren't bound by the security restrictions of browsers, agents running as Web clients blur the functional distinctions between servers and clients. They can combine data from the Internet with locally generated data, read and write to files, automatically print documents, send or receive email and faxes, or upload data to Web sites.
The goal of this article is to introduce you to the basic concepts of writing intelligent Web agents, and then to point you to where you can get ideas for agents of your own design.
Intelligent Web agents can be written in nearly any language. After bouncing around with several ideas for this article using C and Perl, I settled on Tcl/Tk because it easily interfaces with Web servers and makes scripting graphical user interfaces easy.
As the name implies, Tcl/Tk is two entities. Tcl (pronounced "tickle") is a Perl-like interpreter and Tk is library of commands used to create graphical user interfaces on Linux/UNIX, Windows, and Macintosh platforms. Tcl/Tk is also a product of the Open Source movement, so it's available as a free download on many sites across the Internet.
All of this article's examples are written on a Win32 platform with Tcl/Tk version 8.0. You'll need at least version 8.0 because earlier versions of Tcl/Tk won't support the HTTP package that's used heavily in the examples.
Despite its growing popularity, Tcl/Tk is still unfamiliar ground to many developers. Don't be discouraged if you're in this group because the language is very easy to learn. An excellent Tcl/Tk tutorial and reference is Eric Foster Johnson's book, Graphical Applications with Tcl/Tk published by M&T Books (ISBN: 1-55851-569-0). This book also includes a companion CD-ROM with Tcl/Tk binaries for several platforms. There are also many fine Web sites featuring Tcl/Tk information (see " Online Resources").
Getting Information from the Web
All of the examples in this article use the procedure
getFromWeb( Listing 1) to download files from Web servers. It only has five lines of actual code, so you might scratch your head and wonder how modern browsers justify consuming tens of megabytes of disk space. It's important to remember that the example script was designed for maximum simplicity. A more robust agent would include error checking and possibly a "callback routine" to indicate download progress. This code also doesn't reflect the effort required to downloaded and render Web pages.
getFromWebprocedure is loaded into the Tcl/Tk interpreter with the command:
Once loaded into the interpreter, the procedure is very easy to use. For example, the following command downloads my home page into a file called myFile:
The program in Listing 2 (properties.tcl) uses this procedure to download the NASA home page. The program then analyzes the downloaded Web page and displays server and Web page properties in a text widget.
One of the first things you'll notice when running Listing 2 is that, compared with a standard browser, it takes very little time to execute. This is because a browser needs to download the HTML document plus any related images and then render the Web page. Our little Web agent, in contrast, downloads only a single HTML document and does not need to download images or render anything (see Figure 1).
Since the data you download from the Web may have little or no predictable structure, the job of discriminating between important and unimportant information is probably the most difficult thing a Web agent needs to do.
The Inclusive Parse
The parse procedure in Listing 3 (parse.tcl) was written specifically to parse HTML documents. It uses an "inclusive" parse criterion. In other words, when we call this procedure we specify the criteria we want to pull out of the HTML document. Everything else is ignored.
Once the parse procedure is loaded into the interpreter, again with the source command, we can capture everything in an input file that matches our parse criteria, and write the matching data into an output file with the following command:
parse inputfile $bCrit $gap $eCrit outputfile
This isn't as complicated as it looks. For example, if we want to parse a file called "temp" for all references to GIF files and store those references into a file called gifFile, we'd use the following command:
parse temp "src=" 40
In this case, we parse temp for all information that starts with "src=" and is followed by ".gif" (with no more than 40 characters in between). All matches are then stored in a file named gifFile. Always check the parse criteria first if you have any problems with these examples!
To Parse HTML Files
Parsing HTML files presents special demands that aren't satisfied with a simple regular expression. The gap between search criteria, for instance, is vital to discriminating between HTML tags with common attributes. For example, Java scripts, applets, and images can all use the src attribute. By specifying the distance between beginning and ending parse criteria you can greatly limit the chances of false matches.
Since case sensitivity is an issue on UNIX servers, it also becomes an issue for our parser. Our sample parse procedure simplifies searching for parse criteria by first converting everything to lower case. Once the parse criteria are satisfied, the data is extracted from an image of the original HTML file, where all the original case information is retained.
As is often the case, you'll need to clean up parsed data before you use it. In Listing 2, some simple regular expressions are used to remove HTML tags from the actual title.
The Exclusive Parse
Earlier I referred to the parse procedure as an inclusive parse. Another useful type of parse is the exclusive parse, where everything except the parse criteria is retrieved. An example of an exclusive parse is found in Listing 4 (deleteTags.tcl). Listing 5 is a source file used by nearly all the other programs to do File I/O.
This routine uses the inclusive qualities of the parse procedure in Listing 3 to capture all of the HTML tags in a given file. These HTML tags (and their attributes and data) are extracted from the original file, leaving you with everything but (exclusive of) the parse criteria. This information, without HTML tags, is ready for applications that require plain ASCII text, like Palm Pilots or text-to-speech converters.
In addition to HTML documents, Web servers also return HTTP metadata. Metadata should not be confused with HTML META tags. The actual metadata varies from server to server and from configuration to configuration, but some metadata common to most servers includes the server name, document size, MIME type, and cookies. In Listing 2, this data is returned by
$tokenis a "pass by reference" variable, we need to use the Tcl upvar command to extract its contents. The metadata values are contained within curly braces. Our program uses regular expressions to format the information by placing a new line character after each right curly brace.
There's no need to limit yourself to text. The previous example downloaded a text document, but the
getFromWebroutine can also download images, applets, or anything else on the Web. The example program in Listing 6 is a special purpose browser that interfaces directly with the Internet and downloads a weather map of the San Francisco area (see Figure 2).
This is a very simple example, but it wouldn't be difficult to analyze the weather map for specific weather patterns by downloading the image periodically, analyzing the colors of the pixels at specific coordinates, and correlating those colors with a weather advisory.
This technique may yield information that's more reliable than data based solely on an agent's ability to parse text. The weather advisory would be even more accurate, however, if it were combined with accurate textual information. A decision based on a variety of input is usually more intelligent than a decision made from a single source of information. Decisions based on text alone are also dependent on the agent's ability to discern information from a language designed for people. Unfortunately, dependence on language creates inherent limitations. A classic example of this was described in Fred Moody's book, I Sing the Body Electric (1995 Viking Press), in which he explained that Microsoft used an agent to automatically cross-reference topics in an early prerelease version of Encarta. According to Moody, the limitations of the agent were apparent when the section on camping suggested further reading on concentration camps.
The image in Figure 2 was generated by the National Weather Service with tax dollars, and is in the public domain. There are many image files on the Internet that contain similar realtime information. Many financial Web sites, for example, use GIF files to display graphs of daily stock movement. It's important to remember, however, that much of the information on the Web is someone's intellectual property and may be protected by copyright. Make sure you're not violating copyright law if you use information that you don't own.
The program in Listing 7 (auction.tcl) is a Web agent that alerts the user when a Web page changes. If the Web page is from an auction site, it can indicate when a new bid is placed.
Once a URL is entered and the button pressed (see Figure 3), the URL is retrieved from the Web with the
getFromWebprocedure. After a predetermined period, the same Web page is downloaded again. If the newly retrieved data doesn't match the initial data a dialog box is displayed, prompting the user that a new bid has been made (see Figure 4).
If the OK button is pressed, a browser is launched, the Web page is displayed, and action can be taken dependent on the auction's status. If you're a careful programmer (and willing to live with the results!) it's possible to augment this program to automatically reply with a higher bid. The
getFromWebprocedure can do this if you place HTML <
FORM> information after the "?" in the URL.
The auction Web agent uses the parse procedure to discard any data that is outside the document's body. This helps but it's important to remember that any change, including a page counter, will register as a change. You may have to further parse the retrieved data to discard counters and rotating banner ads.
When It's Your Web Server
It can be daunting to identify useful information on the Internet since you have no control over the content of most Web pages. If you do control the content, however, your job becomes much easier if you create your own easy-to-parse tags.
For example, your Web server might pull a name from a database. You can make that name easy to parse if you format it like this:
<name> "Abe Lincoln" </name>
As a rule, Web browsers ignore tags they don't understand. As long as the tags you create are unique, these pseudo-XML tags should be invisible to people not using your custom Web agent.
The second line of uncommented code in
getFromWebis used to identify the Web agent to the Web server. Your server can use this information to decide if the data is accessed from a standard browser or from a specialized Web agent. You can program your server to authenticate users and provide special data or security depending on which Web agent is used.
Ideas for Your Own Projects
You have the basic skills to pull information from the Web, parse what you need, and interface with the rest of the world. Now, where do you start getting ideas for new Web agents? Possibly the best way is to watch the way you use the Internet. Look for things that could be done more effectively. Do you find yourself replicating efforts? Can you combine individual functions on existing Web sites into a single effective application? Do you miss important information because you're busy doing something else? Are there aspects of the Web that you'd like to incorporate with non-Internet based technology? Is there a better way to format information? If you develop something interesting, let me know.
(Get the source code for this article here.)
Michael is an Internet strategist living in Minneapolis. His email address is [email protected].