Implementation
The implementation of an experimental Q&A system that includes some of the features described here is available electronically (see http://www.ddj.com/code/) and on SourceForge (mustru.sf.net).
You first create an index using the text of files from a set of selected directories. It's not necessary to index the entire hard disk of a machine. Many directories containing system files may not be of interest to the typical desktop user. The selected directories are recursively scanned for files whose text can be extracted.
Files whose text can be extracted include formatted files (such as .pdf, .doc, or .odt files) and web pages. A text filter is called depending on the suffix of the file. Currently, more than 40 different suffixes are handled. An image, audio, or video file is a media file and the only text associated with such files is the filename.
After the text has been extracted, it is indexed and stored in a database. I used the Berkeley DB Java Edition to manage the database. The filename is the key for any document stored in the database. Documents that contain identical text are excluded. The text is broken into sentences and entities are extracted from each sentence. The index for any sentence consists of keywords from the text and a set of entities.
A web-based interface to search the index is included (see Figure 3). A submitted question is parsed and processed by Lucene. The best sentence from the text of each hit is returned to the client.