Design

A Fast Q&A System

By Manu Konchady, June 08, 2007

Search engines don't give answers in response to queries. Instead users depend on question/answering (Q&A) systems to scan the text of a ranked list of documents to find answers.

Implementation

The implementation of an experimental Q&A system that includes some of the features described here is available electronically (see http://www.ddj.com/code/) and on SourceForge (mustru.sf.net).

You first create an index using the text of files from a set of selected directories. It's not necessary to index the entire hard disk of a machine. Many directories containing system files may not be of interest to the typical desktop user. The selected directories are recursively scanned for files whose text can be extracted.

Files whose text can be extracted include formatted files (such as .pdf, .doc, or .odt files) and web pages. A text filter is called depending on the suffix of the file. Currently, more than 40 different suffixes are handled. An image, audio, or video file is a media file and the only text associated with such files is the filename.

After the text has been extracted, it is indexed and stored in a database. I used the Berkeley DB Java Edition to manage the database. The filename is the key for any document stored in the database. Documents that contain identical text are excluded. The text is broken into sentences and entities are extracted from each sentence. The index for any sentence consists of keywords from the text and a set of entities.

A web-based interface to search the index is included (see Figure 3). A submitted question is parsed and processed by Lucene. The best sentence from the text of each hit is returned to the client.

Figure 3: A web-based interface for searching.

Previous 1 2 3 4 5 6

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

A Fast Q&A System

Implementation

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

A Fast Q&A System

Implementation

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content