Database

Web Site Searching & Indexing in Perl

By Neil Gunton, April 01, 2001

Mysearchbot, the tool Neil provides here, is a script that crawls any web site you specify, and indexes all the pages it finds into a MySQL database. This database can then be searched from a web browser HTML form in much the same manner as the major search engines.

Apr01: Web Site Searching & Indexing in Perl

Neil is a web site developer who can be contacted at [email protected].

Most nontrivial web sites these days provide some kind of local search facility. These let visitors look for words and phrases in a manner similar to that used by sites such as AltaVista and Google. To implement a search facility, you basically need two components — an indexing tool that can build up a database and keep it up-to-date as the site changes, and a search tool to query the database and present results. There are many options out there, both commercial and free.

As a web site developer, I work using purely open-source tools such as Linux, Apache, mod_perl (http://www.cpan.org/), Embperl (http://perl.apache.org/embperl/), and MySQL (http://www.mysql.com/). So when I needed to provide search capability, I wanted to stay in the open-source arena rather than going commercial. Also, being the typical lazy programmer, I was really hoping that someone had already written a magic script that I could just plug in and run.

But I was a little surprised to find that there were no immediately obvious choices out there in open-source land — there just didn't seem to be any simple, easy black-box scripts that could crawl my site and index it into a real (that is, relational) database. I did find, however, a number of Perl modules that provided the necessary building blocks and could be wrapped up relatively easily. It turns out that these building blocks are actually very substantial, providing most of the required functionality. So Perl fulfills its classic purpose yet again — programming glue.

I'd like to get on a soapbox for a moment regarding Perl. For 10 years, I have been heavily into C++ and Java. I steered clear of Perl because whenever I looked at Perl programs, they seemed totally incomprehensible — all those weird symbols! Also, I believed the myth that Perl was old hat, and making web sites with Perl was the way of the dinosaurs. No, the only way to go was with Java Servlets and C++ for speed. Then, about a year ago, I finally sat down and tried Perl out on a small personal web site I was developing, using Embperl. I was promptly blown away by its power and flexibility, and all the stuff that's out there on CPAN. Also, with the advent of Apache/mod_perl, all the old criticisms of Perl associated with CGI go out the window (since Perl is always loaded up, each server-page access is very fast). Finally, I have been mightily impressed that since I started using Perl on my web sites, they haven't crashed. Never. Sure, I get bugs, but they never seem to last long, and they are never those mysterious memory bugs that you get with C++. And because there is no compilation phase, it's much faster to develop than Java. The combination of Linux, Apache, mod_perl, MySQL, and Embperl make a world-class platform. Perl even does the object-oriented thing in a way that is so intuitive and neat that I will probably never go back to C++ or Java again. Friends who use Java and C++ a lot always seem to be spending time debugging code, whereas the Perl stuff just seems to work. And as for the incomprehensibility of many Perl programs — I have realized that while Perl certainly lets you do write-only code (just like pretty much any other language), there's nothing stopping you from writing perfectly clear programs. Perl lives! And it rocks!

Back to our Regularly Scheduled Programming

Mysearchbot only took a couple of days to write — it's a simple script that crawls any web site you specify and indexes all the pages it finds into a MySQL database. This database can then be searched from a web browser HTML form in much the same manner as the major search engines. Users type in phrases, and the database query returns a list of URLs and page titles that contain the phrase. To demonstrate this, I've included an example Embperl form (available electronically; see "Resource Center," page 5). As you'll see, most of the real work is done by other CPAN Perl modules.

You can see Mysearchbot in action at one of my web sites; for instance, go to http://www.crazyguyonabike.com/ and select the Search option in the navigation bar. This web site (a journal of a bicycle ride across America) has a fair amount of text in it, and you can get a good feel for how the search index works.

I'll also provide a quick overview of Mysearchbot's functionality, describe how it all works, and present some potential areas of improvement. It has been designed to operate almost completely from the command line. Finally, I won't provide a line-by-line account of the script here — you can examine the extensive comments in the program listing (also available electronically) for the gory details regarding the inner workings.

Functionality

At a high level, here's how the bot works: Mysearchbot connects to your web server through TCP/IP, just like any other web client (such as a browser). Because of this, it really doesn't matter how your web site is implemented — the bot just sees the final HTML. It doesn't care if that HTML was generated by Apache, Perl, Python, Java, or Visual Basic. The bot reads in the root page, indexes it in the MySQL database, and parses out all the links contained in the page. It then follows all those links until the link stack is empty.

One issue that arises here is links that go off your site — you probably don't want the bot to do that, so by default, it won't. But there is a command-line option to allow it, just in case you feel like potentially going off and indexing the entire Web...

Only "A HREF" links are followed; IMG links, for example, are ignored. Also, the bot stores a checksum of the page in the database so that when it sees it in future runs, it can skip the indexing if the page is unchanged. This doesn't save on web-server accesses, but it does on database and CPU time during indexing. At the end of each indexing run, stale links are removed automatically. Examples 1 through 5 illustrate how you would make an index.

Bots should be polite to the web servers they visit. To this end, Mysearchbot respects any robots.txt file it finds. Also, there is a delay that you can specify, which will make the bot wait for some period of time between page requests. This is so that the bot doesn't swamp your server. A good default number is something like 10 seconds because any really busy site will not notice that and a quiet site shouldn't be unduly bothered by hits of that frequency. Again, you can change the default if you like. When testing my site, I liked to set the delay to zero, just so I could see it doing its thing in real time.

When Mysearchbot indexes a page, it associates the TITLE tag of the page with the URL in the database. Consequently, make sure that the TITLEs of your pages make sense — when you do a search later, what is returned is a list of URLs that contained the search term. This list can then be used to look up the associated TITLE in the database, and it is this string that is displayed for users. You can see this in the example Embperl form that demonstrates a search. Embperl is a cool tool for embedding Perl code in your HTML files. Even if you haven't used it before, you should have no problem understanding this example if you are familiar with Perl. Embperl is an equivalent tool to PHP, but it's better because of all the extra stuff you get for free when you use Perl.

If you're anything like me, you'll find that you're at times a little lazy with regard to the TITLE tags on your pages. They are easy to forget, or to make them all say the same thing. So when you first see your search results, don't be discouraged if a long list of identical titles comes back — you just need to do a little work to make sure each page title is unique and gives some context to the page.

Finally, you probably won't be indexing your site just once and then never again. Things change, and you'd like to just say, "Go forth and index!" and forget about it. So, Mysearchbot has an option to just loop, indexing and reindexing the site.

Example 4, for instance, adds a URL to the database, sets the delay for that site to 10 seconds, lists all the URLs currently in the index, and finally does a single indexing run. You don't need to do all this in one command, but it keeps looping, indexing, and never exiting. Example 5, for instance, never exits, and I have also added the option for keeping the script "quiet" — suppressing the output to the screen about which URLs are being indexed. Remember, since it stores checksums of the pages, it won't be doing a lot of work after the first pass unless things have changed. Thus, you can just leave Mysearchbot to chug away happily, and depending on the delay you have set and the size of your web site, you can probably expect it to cycle every few hours. If you have several different sites that you want to index in different databases, then you could just run Mysearchbot multiple times in parallel.

Using Mysearchbot

To use the script, you first enter the command without any parameters, and it prints out its syntax and some examples. Rather than repeat the whole thing here, you can just look at the listing (or run the script) and see a comprehensive list of the options and some examples of common usage. I will merely comment here on some of the basic aspects of using the script.

There are two options that are always required. The first is --database. This is because just about anything we do will require the use of the database. The other required option is --email. This is because your contact e-mail is passed into the RobotUA and is used to identify the robot to the web site it is indexing, by convention.

All other options are optional. The first thing you probably want to do is initialize your index database. Example 1, for instance, takes an existing MySQL database, mydb, and creates the necessary tables in it using the default prefix fts. Also, I give a URL for indexing (mysite.com) and specify a delay of 0 seconds between hits because it's our server and nobody's using it right now anyway, and I want to see the thing go through in real time. Finally, I specify --index to make it do the index run. This demonstrates multiple operations being specified on the same command line.

You do need to make sure that the database already exists in MySQL before running Mysearchbot. You can have other tables in the database if you like, or it can be empty. Mysearchbot creates tables using the prefix fts_, which you can change using the --index_name option if you like. Once you have a database, you can initialize the Mysearchbot tables using the --create option. At the same time, you could use --insert_url to specify your web site address. Then, --index does the work. Example 3, for instance, reindexes all sites in the index that were previously inserted using --insert_url. URLs are printed to stdout as they are indexed (or skipped, if they haven't changed since last time). By default, Mysearchbot prints out the pages it is indexing as it goes. You can turn this off using --verbose=0. If the bot comes across a page it has seen before, which hasn't changed since last time, then it skips indexing that page. If you want the bot to keep looping indefinitely, then use --loop instead of --index; see Example 5. To test out the index, use --search to look for a particular phrase that you know to be present. The bot prints out the URL and the TITLE of the pages that contain the phrase. Example 6 lists the URLs that contain the string specified.

If you want to delete the tables that Mysearchbot created, then use the --drop option. You can use this with --create to reinitialize the whole thing. Example 2 is for when you have an existing index in the database but still want to start from scratch. The only difference from Example 1 is that you add --drop to clear the existing tables. You could use --drop by itself without the --create and subsequent options if you merely want to remove the index from the database altogether. The order of the options on the command line doesn't matter. For instance, Example 7 involves several commands. First drop the URL from the index (because there is currently no option to update), then reinsert the same URL with the desired delay_secs. The action of dropping and reinserting a URL does not in itself affect the index; it merely deletes and inserts records in a database table that is used when indexing to determine which sites are to be visited. If you decide to drop a URL completely from the index, then just use --drop_url. Then on the next --index or --loop run, the dropped index would automatically be purged. All documents in the index that were not visited during any given run are automatically deleted at the end of that run. Here, I also list the current URLs as a check, and finally perform a single pass reindexing run.

There really isn't that much more to it. The whole idea is for the script to be as simple as possible and self contained.

How it Works

In the grand tradition of Perl, I stand on the backs of other programmers to implement this indexing engine. In particular, the module DBIx::FullTextSearch performs the crucial indexing step into MySQL. The LWP::RobotUA module provides the basic functionality to fetch web pages, respecting the robots.txt rules. I use the Getopt::Long module to help with the command-line arguments. HTML::LinkExtor is used to extract the links from the web pages, and of course, DBI is utilized for all the database access. It goes without saying that you need to have MySQL already installed on your system, though the database doesn't have to be on the local box. I have set up my MySQL installation to have no username or password required for the root user, because I am the only user on my server and it sits safely behind a firewall (the web server is accessible from the outside, but that's about it).

If you're interested in the inner workings of the code, take a look at the listings available online. They have extensive comments and should be largely self explanatory. If you are wondering about any of the calls to the other modules, then take a look at the man page documentation for that module.

Areas of Possible Improvement

I've intentionally kept the script simple to make it easier to type in and read. As with any utility of this kind, there are enhancements that could be added. For example:

Support for databases other than MySQL, though this is not as simple as it sounds, since the DBIx::FullTextSearch module currently only uses MySQL, so it kind of depends on that.
More support for the different indexing and search options that are available in DBIx::FullTextSearch. I have used the simplest configuration, allowing searching for a single phrase at a time. If you look at the documentation for that module, you'll see that it can handle other types of indices.
The ability to index multiple web sites into multiple databases from a single process, perhaps in parallel using threads. Currently, you can specify multiple URLs to index, but all the URLs, which are specified in a particular database, will also be indexed into that database. You may want to be indexing several different web sites, each into their own database, simultaneously. That way, the search results for one web site won't be mixed up with another one. You can do this currently — just run the script multiple times, once for each database. But if you don't like having the Perl interpreter running multiple times (memory issues, for example) then you may want to enhance the script to handle multiple databases on a single run.
Weights and prioritization of results. This is really dependent on the functionality of DBIx::FullTextSearch. Also, the output currently just prints out the TITLE of the web page, not any context of the search phrase within the document. I figured that for most web sites doing local searches, this is sufficient. You may want to play with enhancing this functionality.

Conclusion

Mysearchbot is really nothing new, but it does a basic job. It makes a good starting point for your own indexing engine, since it is always easier to begin with some working code and change it to your needs. If you have a major enhancement that you think other people would find useful, then please do pass it back to me. If it makes sense, then we could even put the script into some kind of open-source forum so that it can be developed properly. In the meantime, happy indexing!

DDJ

1 2 3 4 5 6 7 8 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

Web Site Searching & Indexing in Perl