Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

JVM Languages

A Java Applet Search Engine


Feb99: A Java Applet Search Engine

Tim is senior technical editor for DDJ. He can be reached at [email protected].


Sidebar: Java Applets on CD-ROM

HTML is perfect for CD-ROM publishing: Tools for authoring and managing HTML are mature and stable, and most users already have a browser (and if they don't, you can usually include one on your CD for free). One weak point, on the other hand, is searching. If you have 200 MB of HTML, you'd like to provide users with a way to rapidly find information. However, most search engines are designed for use on web sites, and have to be started by a web server in response to certain requests.

The search engine I present here uses a Java applet to handle searches. This works well on CD-ROMs, although not without complications (see the accompanying text box, entitled "Java Applets on CD-ROM"). The applet uses an index database built by a Perl indexing program I described in "Full-Text Searching in Perl" (DDJ, January 1999).

My full-text search applet works like this: To find all HTML files that contain "java search engine," you first look up each separate word in a prebuilt database. Thus, when you look up "java," you might get the list 27, 28, 29, 38, 43, 79, 88. Looking up "search" and "engine" will give similar lists. You can easily combine these lists to find the files that match any particular Boolean expression. You then look up each file number to obtain a file description and display that information to users.

The Database

The first requirement for my applet was that it be able to access the on-disc database. Since I used Perl to create the database, I needed a database format that could be accessed from both Perl and Java. Berkeley DB (http://www.sleepycat. com/) is a portable C library with a clean interface to Perl. Although Berkeley DB does have a Java API, it works by calling the C library, an approach I could not use. Consequently, I wrote my own Java classes to read Berkeley DB B-Tree files.

A Berkeley DB file consists of a series of pages, all the same size. Page 0 contains the metadata in Table 1. The magic number is used not only to identify the file type, but also to determine the endianess of the platform that created the file. By recording the order of the 4 bytes 0x00, 0x05, 0x31, and 0x62 at the beginning of the file, you have a template for decoding 4-byte integers throughout the file.

Except for Page 0, all pages have a layout like Figure 1. The fixed-size header (Table 2) indicates certain basic facts about this page. The pointers are 4-byte offsets to nodes. As the page is filled, the pointers grow up and the nodes grow down into the free space.

In Table 3, a node on an internal page contains a key and the number of the page containing elements greater than or equal to that key but smaller than the next key. For leaf pages, each node has a key and a data item.

If a key is too big to fit in a page, the overflow-key bit is set in the flags, and the key is replaced by the page number where the actual key is stored. If the key spans more than one overflow page, the "next page" field can be used to chain together multiple overflow pages. Very long data items are handled similarly.

The DBBTree Classes

My Java code that reads Berkeley DB files consists of three classes. The primary class (and currently the only public class) is DBBTree (Listing One). You create a DBBTree object by giving it a filename or File object. It opens the file, reads the metadata, and then provides access to the database through the search() method, which accepts a byte array with the desired key, and returns another byte array with the corresponding data.

DBBTree handles the file management. It uses a RandomAccessFile() object to access the data and manages a cache of pages. The searching is handled by two auxiliary classes. DBBTreePage takes a byte array containing a DB page and provides a set of methods to access that page. This includes a search() method that searches that page for the indicated key. DBBTree handles searches by simply reading Page 1, creating a DBBTreePage object, then asking that object to search itself.

A DBBTreePage searches itself using a simple binary search. It uses the DBBTreeNode class to access particular nodes and compare them to the requested key. If the DBBTreePage is an internal page, it identifies the correct child page, asks the DBBTree object to fetch that page, then asks that page to search itself. This process continues recursively until a leaf page searches itself and either returns the desired data or returns a null value to indicate failure.

Every page or node object has a reference to the top-level DBBTree object. This allows any page or node to request a page. For example, if I need to add support for overflow keys or data, nodes will then be able to directly request those pages. This way, I preserve the abstraction that leaf nodes contain key/data pairs directly.

My current DBBTree package does not support writing to the database, deleting items, database cursors, or duplicate keys. However, the file format does support them, and my package should be easily extendible if I ever find such features necessary. Eventually, I hope to flesh this out and package it as a JDK 1.2-style collection class. That would make it easy to write fairly generic Java code that operated essentially the same regardless of whether the underlying storage was memory or disk.

Java and Files

Java supports URLs very well, but is somewhat myopic about filenames. For my search applet, I need to open a file called "index.db." But, it's probably not in the current directory, so I need to somehow generate the full path name.

Java does not provide a direct way to obtain the filename from which a document was loaded. Applet.getDocumentBase() will tell you the URL, but a URL is not a filename. Java does not provide a way to convert a "file:" URL into a local filename.

My solution is the messy piece of code in Listing Two that uses Applet.getDocumentBase() and the java.net.URL class to build a "file:" URL for the index.db file. I then simply try several combinations of path separators and other changes to build a filename that will successfully open.

Interactive Searching

One advantage of running on CD-ROM rather than over a network connection is that you have direct access to the database. Figure 2 shows the search applet interface. Note there are no buttons. I don't ask you to type your request and press a button to see the results. Instead, I update the search results with every keypress.

Full-text searching against a database is suited for this type of interactive searching. To rebuild the "Words" listbox in Figure 2, for instance, I need to first parse the query string and then look up each word in the database. Because I'm repeating the search on every keypress, probably only one word has changed since the last search. The words that haven't changed are stored in memory, so I only need to look up one word in the database. B-Trees store their data in sorted order, and the changed word has probably changed at the end. So the changed word is probably on the same database page, and that page is probably still in the database cache. Hence, It's quite likely that you can rebuild the "Words" listbox without going to the disk at all.

However, the "Articles" listbox is more time-consuming to update. A particular search may have completely different results than the previous search, so you can't rely on caching to save you from several hundred database probes. Worse, most AWT implementations completely redraw the listbox on every change, even if that change is not visible on the screen. This policy dramatically slows listbox updates; I've seen it take nearly 15 seconds to add 1000 or so entries to a standard listbox, even on fairly fast machines.

Lazy Listbox

Of course, you don't need all of the results immediately. All you really need are the first 15 or so entries (so you can draw the listbox display) and the total number of entries (so you can size the vertical scroll bar appropriately).

My lazy listbox accepts an array of integers and a database handle. It then proceeds to look up file numbers in the database only when they are needed for the display. The complete listbox implementation requires a Canvas (for the actual display), two scroll bars, and some layout glue to package it together. I deliberately wrote for JDK 1.0 to support older browsers.

My lazy listbox involves some interesting details. For example, since I don't already know all of the strings, I don't know how to best size the horizontal scroll bar. Instead, I simply track the maximum width of the strings I have seen and read just the horizontal scrollbar as I find out more information. In practice, this means that as you scroll the listbox vertically, the horizontal scroll button will shrink.

Things I learned about Java when implementing this custom listbox include:

  • JDK 1.0 and JDK 1.1 handle the maximum and minimum values for scroll bars differently. I haven't seen this mentioned in any Java documentation.
  • java.awt.Graphics.draw3DRect() computes highlight colors using its own technique, which doesn't match any particular platform, much less the current one. To get a nice border around my custom listbox, I drew it manually.
  • On different Java systems, the scrollbars send different events. (On one platform, the scrollbars send a string of mouse-move events when you move the mouse over the scrollbar.) It was important to explicitly check for the events I wanted to avoid needless redraws.
  • Generally, Java AWT implementations do a lot of needless redraws; be careful not to compound this problem.
  • Java AWT doesn't provide a blit function, so there's no way to accelerate scrolling; you have to simply redraw the entire display.

Fine Points

Thanks to feedback from a number of people, I incorporated numerous details to improve my search applet and make it easier to use and configure. For example, every text string you see in the user interface is defined by a parameter. That way, you can simply change the HTML page to translate the interface into a new language.

When you point to an article name, the corresponding URL is handed to Applet.showStatus(). This follows the common convention that reference targets are displayed in the browser status line. My original applet used double-click to display an article; I changed this because most web browsers use single-click to follow a reference.

The full source is available electronically (see "Resource Center," page 5). I've also included a sample index database generated from the last 10 years of DDJ. This database was created from 2500 articles, comprising 64 MB of HTML.

DDJ

Listing One

public class DBBTree {  protected RandomAccessFile file;
  protected File fileName;
  private int pageSize;  // page size
  private boolean msbFirst; // True = MSB byte order, False = LSB byte order
  /** The constructor takes a File name or File object. */
  public DBBTree(String name) throws IOException {
    this(new File(name));
  }
  public DBBTree(java.io.File name) throws IOException {
    fileName = name;
    try {
      file = new RandomAccessFile(name,"r");
    } catch (SecurityException e) {
      throw new IOException("Wrong permissions for "+name+": "+e);
    } catch (IOException e) {
      throw new IOException("file open failed: " + e);
    }
    /* Read the metadata.  I only read the first three values. */
    pageSize = 256; // Big enough to get all of the metadata
    byte [] metaData = readRawPage(0); // Page 0 holds metadata
    // Read magic number and determine endianness of DB file
    msbFirst = true;  // Try MSB format first...
    int magic = bytesToInt(metaData,0);
    if(magic != 0x053162) { // Failed? Try LSB...
      msbFirst = false;
      magic = bytesToInt(metaData,0);
      System.out.println("Opening LSB format database");
    } else
      System.out.println("Opening MSB format database");
    // Read version number and abort if not a DB file.
    int version = bytesToInt(metaData,4);
    if((magic != 0x053162) || (version != 3)) // BTREE Magic number & version
      throw new IOException("Not a DB file (magic: 0x" 
                    + Integer.toHexString(magic) + ", version: " 
                    + Integer.toString(version) + ")");
    // Set actual page size
    pageSize = bytesToInt(metaData,8);
    setCacheSize(8);    // Set the cache to 8 pages as default
    primeCache();       // Pre-read first 8 pages into cache
  }
  /** Search for a key.  A 'key' here is a byte array. Returns the data 
   * associated with the key or null if the key wasn't found. */
  public byte [] search(byte [] key) throws IOException {
    return readPage(1).search(key);    // Search always begins on page 1
  }
  /** Read a raw page, return an array of bytes containing that page. */
  protected byte []  readRawPage(int pageno) throws IOException {
    try{
      byte data[] = new byte[pageSize];
      file.seek(pageno*pageSize);
      file.read(data);
      return data;
    } catch (IOException e) {
      throw new IOException("readRawPage("+pageno+") failed: " + e);
    }
  }
  /** The cache is kept in LRU (least-recently-used) order; that is, whenever 
   * a page is accessed, it gets moved to the front of the list, and pages 
   * are lost when they fall off the end of the list. Happily, this code is 
   * simpler than the corresponding C version, thanks to the GC. */
  int cachePageNumbers[];
  byte cachePages[][];
  /** Set the cache size.  By default, it's set to 8 pages. */
  public void setCacheSize(int size) {
    int[] newCachePageNumbers = new int[size];
    if(cachePageNumbers != null)
      for(int i=0;i<size && i<cachePageNumbers.length;i++)
    newCachePageNumbers[i] = cachePageNumbers[i];
    else
      for(int i=0;i<size;i++)
    newCachePageNumbers[i] = 0;
    cachePageNumbers = newCachePageNumbers;
    byte[][] newCachePages = new byte[size] [];
    if(cachePages != null)
      for(int i=0;i<size && i<cachePages.length;i++)
    newCachePages[i] = cachePages[i];
    cachePages = newCachePages;
  }
  /** Fill the cache with the first pages in the database. This can improve 
   * perceived performance noticably. In particular, this pulls in page 1, 
   * which is the top page of the B-Tree, and is searched on every lookup. */
  public void primeCache() throws IOException {
    int i = 1;
    while(cachePageNumbers[cachePageNumbers.length-1] == 0) {
      readPage(i);
      i++;
    }
  }
  /* Because I'm only caching a few pages, it's perfectly reasonable to use a 
   * simple array and just move elements down in the array to maintain LRU 
   * order. For a large cache, something more sophisticated would be 
   * appropriate. */
  DBBTreePage readPage(int pageno) throws IOException {
    // Try to find page in cache, return it if found
for(int i=0; i< cachePageNumbers.length; i++) {
      if (cachePageNumbers[i] == pageno) {
    byte page[] = cachePages[i];
    if(i>0) {
      // Move this page to front of cache
      System.arraycopy(cachePageNumbers,0, cachePageNumbers,1,i);
      System.arraycopy(cachePages,0, cachePages,1,i);
      cachePageNumbers[0] = pageno;
      cachePages[0] = page;
    }
    // Return found page
    return new DBBTreePage(this,page);
      }
    }
    // Page wasn't in cache, read it from disk
    byte page[] = readRawPage(pageno);
    // Insert new page at front of cache.
    System.arraycopy(cachePageNumbers,0,
             cachePageNumbers,1,cachePageNumbers.length-1);
    System.arraycopy(cachePages,0, cachePages,1,cachePages.length-1);
    cachePageNumbers[0] = pageno;
    cachePages[0] = page;
    return new DBBTreePage(this,page);
  }
  /** Convert four bytes from a byte array into an integer, using LSB or MSB 
   * data order, as appropriate. 
   * @param barray Array of bytes
   * @param offset Offset in barray to read integer
   * @return integer formed from indicated bytes */
  int bytesToInt(byte barray[],int offset) {
    if(msbFirst)
      return 16777216 * (((int)barray[offset  ])&255) +
                65536 * (((int)barray[offset+1])&255) +
                  256 * (((int)barray[offset+2])&255) +
                        (((int)barray[offset+3])&255);
    else
      return            (((int)barray[offset  ])&255) +
                  256 * (((int)barray[offset+1])&255) +
                65536 * (((int)barray[offset+2])&255) +
             16777216 * (((int)barray[offset+3])&255);
  }
  /** Convert two bytes from a byte array into a short.
   * @param barray Array of bytes
   * @param offset Offset in barray to read short
   * @return short formed from indicated bytes */
  short bytesToShort(byte barray[],int offset) {
    if(msbFirst)
      return (short)(256 * (((short)barray[offset  ])&255) +
                           (((short)barray[offset+1])&255));
    else
      return (short)(      (((short)barray[offset  ])&255) +
                     256 * (((short)barray[offset+1])&255));
  }
}

Back to Article

Listing Two

  // There doesn't seem to be any one way to open a file that  // works on all platforms, so this function tries several...
  DBBTree openDatabase(String filename) {
    StringBuffer work;
    DBBTree db;
    // First, try seeing if there's such a thing as a 'current directory'
    try {
      db = new DBBTree(new File(filename));
      System.out.println("    Opened " + filename);
      return db;
    } catch (IOException e) {
      System.out.println("    Couldn't open " + filename);
      System.out.println("    IOException: " + e);
    }
    // Manipulate the document base URL to build a more complete pathname.
    try {
      filename = new java.net.URL(getDocumentBase(),filename).getFile();
    } catch (java.net.MalformedURLException e) {
      // If you can't build a more complete pathname, you're sunk
      return null;
    }
    // Try filename part of file: URL
    try {
      db = new DBBTree(filename);
      System.out.println("    Opened " + filename);
      return db;
    } catch (IOException e) {
      System.out.println("    Couldn't open " + filename);
      System.out.println("    IOException: " + e);
    }
    // Try changing the / to the native separator
    // Note: some platforms lie about the native separator... <sigh>
    work = new StringBuffer(filename);
    try {
      char separator = File.separatorChar; 
      System.out.println("    file.separator = " + separator);
      for(int i=0;i<work.length();i++) // Change / to native separator
    if(work.charAt(i) == '/') work.setCharAt(i,separator);
      db = new DBBTree(work.toString());
      System.out.println("    Opened " + work);
      return db;
    } catch (IOException e) {
      System.out.println("    Couldn't open " + work);
      System.out.println("    IOException: " + e);
    }
    // Strip first character ...
    filename = filename.substring(1);
    try {
      db = new DBBTree(filename);
      System.out.println("    Opened " + filename);
      return db;
    } catch (IOException e) {
      System.out.println("    Couldn't open " + filename);
      System.out.println("    IOException: " + e);
    }
    // Everything failed... <sigh>
    return null;
  }

Back to Article


Copyright © 1999, Dr. Dobb's Journal

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.