Dr. Dobb's | Implementing A Smart Spider

Implementing A Smart Spider

The smart spider presented here crawls the Web, locating links of interest based on topic keywords.

August 01, 2002
URL:http://www.drdobbs.com/web-development/implementing-a-smart-spider/184405127

Aug02: Implementing a Smart Spider

Manu is a lead engineer and Ray chief scientist at the Center for Integrated Intelligence Systems at the Mitre Corp. They can be contacted at [email protected] and [email protected], respectively.

For most users, searches on the Web begin with a query to one or more search engines. However, with the size of the Web growing at a rate of nearly a million pages a day, it is increasingly difficult for general-purpose search engines such as Google (http://www.google.com/), Fast (http://www.alltheweb.com/), and Inktomi (http://www.inktomi.com/) to keep up with the flow of information by indexing the entire Web. The limits imposed by search engines restrict access to a small fraction of the number of hits for a query. And while these can provide effective access to information in highly specific topic areas, they are limited in their ability to provide coverage on broad or rapidly changing issues for a number of reasons, including:

Current search engines are most often engineered to support broad user populations and have designed web crawlers (spiders) that collect pages based on web site popularity, collection cost, or other economic criteria.
Broad queries can generate many thousands of hits.
The ranking algorithm of most search engines is proprietary and the user has no control over the order of hits.
Search engines usually lag the Web by a period based on the frequency of the spider's visit to a site. This period could range from a day to three months. (For example, to estimate the period between visits to a site on Google, try the query "cache:www.cnn.com" — new links are not usually indexed at the time of a query.)
Search engines retain a percentage of dead links in the index.

In essence, search engine limitations have the effect of restricting access to a small subset of the relevant information on the Web and provide only a starting set of pages, leaving it to users to manually traverse and locate other relevant pages not provided by the search. We assume that the most relevant pages on the Web are accessible from a small collection of relevant pages; however, it is possible for relevant pages not to be linked to any page on the Web, making it inaccessible to any crawler. However, most relevant pages will have at least one or more links from other pages.

With that in mind, we'll present in this article a smart spider that crawls the Web, locating links of interest based on topic keywords. Rather than indexing and collecting all accessible web documents, our approach relies on a small group of URLs initially chosen by users to expand the search. It then exploits links inserted by authors of web pages and ignores navigational links in a page. We assign higher importance to links across sites. When authors add a link to a page in another site, it is usually an implicit endorsement of the linked page. The smart spider follows these links and extracts the associated text. Irrelevant pages are ignored in the crawl, limiting the crawl to a small, but relevant, area of the Web; see Figure 1. Links from relevant pages (green circles) are traversed. By intelligently pruning the web tree, the smart spider can follow paths up to 10-12 links away from an initial set of pages without collecting millions of documents and overwhelming the system.

Among the advantages of this method over the use of search engines are:

There is no limit on the number of pages that can be retrieved.
The ranking algorithm is determined by users and the results can be sorted in any order.
Valuable pages, many links away from the initial set of pages, can be located. These pages may not be available from a search engine.
A high-quality document collection can be built for a specific topic. The volume of data collected depends on the query and constraints applied to the spider.

Approach

The smart spider expands the search within some specified constraints; for example, setting the number of links from the initial set of URLs or the total number of collected URLs. The automated approach can be used to collect 10,000 or more URLs. This list of URLs can be ranked based on user preferences, such as the URL with the most "in" links (links from other URLs to a URL), the URL with the most "out" links (links from a URL to other URLs), or the site with the most relevant pages. A page with many relevant in links is possibly an authority for a topic, while a page with many out links is a hub. Locating the hubs and authorities for a particular topic is usually sufficient for a researcher to identify key sites and collect data. None of this information is directly available from a search engine. In a broad query, a search engine provides a set of relevant pages from which a user must manually expand the search: It is most probable that valuable sources will be overlooked while following links to manually build a database of major sources (it can take many hours to manually follow links and build a database of major sources).

One of the well-known algorithms for finding hubs and authorities is the algorithm presented by Jon Kleinberg (see "Authoritative Sources in a Hyperlinked Environment" at http://www.cs.cornell.edu/home/kleinber/auth.ps), which is based on a mutually reinforcing relationship between hubs and authorities. A good hub points to many good authorities and a good authority is pointed to by many good hubs. In an iterative algorithm, hub and authority weights for each URL can be adjusted to reflect the mutually reinforcing relationships. At the end of the HITS algorithm, each URL has an associated hub and authority weight. The major hubs and authorities can be located by ranking the URLs in descending order of weight. In our experiments for several broad queries, we were unable to obtain a sufficient number of reinforcing relationships within our sample of URLs. This is not surprising considering the size of the Web.

For example, a "censorship" query on Google produces over 500,000 hits. If we use the initial 200 URLs to expand the results to about 10,000 URLs, we still have access to only about 2 percent of all the hits. Collecting all 500,000 is not a simple task. Therefore, we have a very small subset of the complete web graph for the query. This makes it unlikely that a sufficient number of mutually reinforcing hubs and authorities will be found in the subset of the web graph. Therefore, the hubs and authority weights for most pages will not reveal importance or ranking in the web community for the topic.

For those queries that generate several hundred thousand hits, excessive resources would be required to collect all hits with our smart spider. It is likely that there would be convergence without collecting all relevant pages on the Web. Both the initial set of URLs and termination criteria determine the results for the crawl — to expand the crawl, it is possible to use a larger initial set of URLs by combining the query results from multiple search engines.

Using Multiple Spiders

We've used a MySQL database, although other relational databases can be used. All SQL statements are executed through a common module. Similarly, a common module is used for external communication. The code for the smart spider has been parallelized by using the fork() function, which is available in UNIX and Windows implementations of Perl. If there are problems with fork() in Windows, the Win32 module can be used instead. SQL statements to lock/unlock tables are used to synchronize database access. This parallel implementation is not scalable beyond eight or 16 spiders. The purpose of this implementation is to improve performance over the sequential implementation where a single URL is processed at a time. When URLs are processed sequentially, the spider is subject to long Internet delays due to dead links or traffic congestion. In a parallel implementation with multiple spiders, it is unlikely that all spiders will face identical delays.

In a scalable parallel implementation, the database tables could be replicated across multiple processors with periodic exchange of messages to synchronize the tables. However, at many locations with shared access to an Internet connection, bandwidth limitations may restrict the total number of spiders. With our parallel implementation on a PC, the performance of other tasks on the same machine may be affected. For minimum impact on the bandwidth of an Internet connection and the performance of a machine, two to four spiders can be used. Example 1 is the algorithm for the spider.

The process begins by collecting an initial set of URLs from a Google query. The size of the initial set can vary from 100 to 1000 URLs. Results from multiple search engines can be used to expand the initial set. The process of collecting the initial set of URLs is similar to the manual process of submitting requests to a search engine through a browser. The source returned by the search engine is parsed and the URLs returned are saved in a table.

The smart spider is implemented in Perl (available electronically; see "Resource Center," page 5) and includes embedded SQL calls.

For every query, two tables are created — a URL table and a link table. A separate query table tracks all the queries submitted by the smart spider. The URL table for a query contains the following columns — url, status, site, relevancy, level, and spider. For each URL, we have the associated parameters:

Status, which indicates if a URL has been processed, is not processed, or is being processed.
Site, the web site of the URL; useful for grouping URLs by site.
Relevancy, a real number between 0 and 1 indicating the degree of relevancy to the query. The value of this column can be set based on user-determined criteria.
Level, the number of links from the initial set when a particular URL is first encountered. URLs in the initial set are at level 0 — those that can be accessed in one hop from the initial set are at level 1, and so on. The value does not necessarily indicate the length of the shortest path from the initial set because the same URL may be accessible through a shorter path. In our source, we have not tracked duplicates or changed the level number as necessary.
Spider, the number of the spider that processed a particular URL. This information is useful to track load distribution in a parallel implementation. The link table contains two columns, a URL, and a link.

Following the collection of an initial set of URLs, the URL table has a number of unprocessed URLs available for individual spiders. Each spider enters a loop and exits when the termination criteria have been met; in our case, when the required URLs have been processed or when there are no more URLs to process. For each URL:

Individual spiders verify that the text contains the query keywords as defined in the Boolean query: A recursive match function checks for keyword matches and handles Boolean AND and OR operators. Multiple keywords entered without Boolean operators are assumed to have a default AND operator.
A constant value of 0.2 is assigned for relevancy if the URL text matches the query keywords.
Higher relevance can be set for documents that have keywords between header, title, or metatags.
For queries with multiple keywords, relevancy can be inversely proportional to the word distance between keywords in URL text.
Other user criteria such as the length of the text or URL site can be used to compute relevancy. Customization of relevancy allows for user-driven preferences when ranking URLs.

After assigning the relevancy of a URL, all out links are extracted. These links include links to external and internal sites. The following links are not traversed:

Links to advertisements, images, and other formatted documents.
Links to pages in the same site, with the exception of links that contain any of the query keywords in the URL or the anchor text (description of the link in the page).
Links from high-volume sites (by setting a limit on the number of URLs from any particular site).

Entries are made in the URL and links tables for URLs that will be traversed. After the smart spider has terminated, a list of URLs can be printed in descending order of relevancy. The hubs and authorities can be extracted from the links table. This information is often sufficient to find key sources for a topic on the Web. A reasonable number of URLs should be collected to find the hubs and authorities. The web tree for each topic is unique and it is not possible to set a fixed URL limit for all queries.

Results

We evaluated the smart spider by running a few broad queries. We submitted a "censorship" query to Google, which returned about 530,000 hits. The top 100 URLs from Google were used as the initial set of URLs for the smart spider. A page was considered relevant if at least two of four words — censorship, ban, speech, or censor occurred in the text. The expansion through the smart spider was limited to three levels from the initial set of URLs.

About 10,000 URLs for the query were collected. The list of URLs was evaluated based on the number of in links and the number of out links. The list of sites with the highest volume of pages was also identified. In the initial collection of 100 pages from Google, pages from sites such as http://www.epic.org/, http://www.aclu.org/, http://www.cato.org/, http://www.ala.org/, http://www.eff.org/, http://www.efa.org.au/, and http://www.cyber-rights.org/ were identified. These sites were not ranked in any order. With the smart spider, we were able to rank these sites based on the number of pages and locate additional sites such as http://www.cpsr.org/, http://www.cdt.org/, http://civilliberty.about.com/, http://www.hazlitt.org/, http://techlawjournal.com/, http://www.efc.ca/, and http://www.newshare.com/.

The following sites were the top four sites ranked by the number of pages: 628 URLs were located at the http://www.aclu.org/ site, http://rene.efa.org.au/ (469), http://techlawjournal.com/ (453), and http://www.epic.org/ (284). These sites are potentially good sources to mine for censorship-related information. Since there is no benchmark for the censorship query, it is difficult to validate the results. We did find some overlap between the sites in our URL collection and the sites under the "Censorship and Net" category in the Yahoo search engine taxonomy.

We also ranked the URLs based on the number of links from pages at other sites. Several pages from the http://epic.org/ and http://civilliberty.about.com/ sites were most often referenced by other pages. The list of these URLs provides another set of URLs to begin locating censorship information. We did not find many useful URLs among the list of URLs with the most out links. Some of these pages contained as many as 6000 links. Several of these pages were just a long list of URLs with a few links to censorship-related pages.

Conclusion

Searching for relevant information on the Web is a tedious task. Search engines provide an initial set of URLs to begin locating information. This set of URLs may include some valuable sources, but it is up to the user to scan each URL and manually determine the most relevant sites. The smart spider presented here relieves the burden of the user by intelligently crawling the Web and locating as many relevant URLs as possible. At the end of the crawl, the user can identify key sources of information on the Web for a particular topic. The smart spider adds a higher degree of confidence for a search on the Web over the typical search engine query, but it does not find every relevant URL on the Web. The number of hits returned by search engines for broad queries can be in the hundreds of thousands of URLs. It can take a long time to retrieve and process 100,000 URLs. The search can be limited by restricting the number of links from the initial set of URLs or the total number of URLs. The smart spider is a useful tool to expand a web search within user-specified constraints.

DDJ

Aug02: Implementing a Smart Spider


1. Load an initial set of URLs for a query from Google
2. Begin Parallel
   * Lock the URL table
   * Find an unprocessed URL in the URL table
   * Set the status of the URL to v (visiting)
   * Unlock the URL table
   * Get the text and links for the URL
   * If the text contains the query keywords
     *  add the links from the URL to the url and link tables
   * else
     *  delete the URL and all associated links from the url and link tables
   * check for termination 
3. End Parallel
4. Wait until all spiders processed in the parallel section complete
5. Clean up and remove URLs and links that have not been processed

Approach

Using Multiple Spiders

Results

Conclusion

Example 1: The smart spider algorithm.

Figure 1: A web tree for a query from an initial set at level 0.