Dr. Dobb's | A Fast Q&A System

Search engines don't give answers in response to queries. Instead users depend on question/answering (Q&A) systems to scan the text of a ranked list of documents to find answers.

For over a decade, search engines have been the most popular tools with which to find information on the Web. The almost instant access to information and ease of use has been the main reason for their popularity. Yet there are drawbacks to using search engines. For instance, it is more natural to express an information need in the form of a question than a set of keywords and operators. The choice of keywords strongly influences the relevance of the results, and the use of a keyword that does not match the author's choice of keywords may not result in a relevant hit. This occurs more often than you would assume. Slang words can have as many as 10 or more synonyms, while other common words may have at least two or three synonyms. For example, the word "bought" has several synonyms, among them, "purchased" and "acquired."

A search engine doesn't give an answer in response to a query. Instead, users need to scan the text of a ranked list of documents to find answers. Most question/answering systems address both of these problems. In this article, I describe the design and implementation of one such question/answering (Q&A) system. Other Q&A systems include those at www.brainboost.com, www.answerbus.com, and start.csail.mit.edu.

Design

Most of the initial Q&A system designs were built on existing search engines with preprocessing and postprocessing modules, like those depicted in Figure 1. The preprocessing module consisted of parsing a natural-language question into a search-engine query. But the translation of a question into a search-engine query is imprecise in nature due to the infinite possibilities in posing questions. However, you can categorize a factoid question using a set of patterns and question words into one of several defined categories with reasonable accuracy. Examples of question words include single words and phrases such as "who," "what," "when," "how much," "how long," and so on.

Figure 1: Q&A system architecture.

The postprocessing module extracted the text from the top n hits for the generated query, segmented the text into passages, and assigned ranks to passages based on a measure of "closeness" to the original question. Each passage was scored based on the degree of overlap between query words and passage words, as well as a density measure of the occurrence of query words in the passage.

Factoid questions are among the easier types of questions to answer. The two other types—opinion and summary questions—are more complex and may involve compiling, analyzing, and synthesizing text to generate answers. Most answers to factoid questions can be found by extracting the most appropriate sentence verbatim from the text. Therefore, the challenge to find answers for factoid questions is to look for the most likely passage that answers a question.

Question Categorization

A question category represents the type of answer—a person, place, animal, organization, time, currency, dimension, and so on. Fine-grained question classifiers may have 30 or more different categories. The accuracy of the classifier is critical because this is the first step in the processing pipeline (Figure 1). Any errors introduced in this step are propagated to the following steps and likely lead to the extraction of a wrong answer.

However, classifying questions is somewhat harder due to the limited amount of text. Pattern matching is a more accurate way to classify questions, instead of standard classification algorithms. One of the biggest clues to a question category involves the first noun following a question word. For example, in "What is the wingspan of a condor?", the noun "wingspan" following the question word "what" indicates that the answer should contain a dimension. A fine-grained categorizer uses a higher number of question categories. Some Q&A systems also use a hierarchy of question categories. The type of categorization—fine-grained or coarse-grained—is linked to the extraction of entities from the text. The likelihood of an answer is partially based on matching the question category with extracted entities; see Table 1.

Word/Phrase	Answer Entity
Who/Whose	Person
Who is	Organization, Person
What/Which	Person, Organization, Company, Place
When	Time
How many	Number
How much	Currency
How far/How long	Dimension

Table 1: Question words/phrases and associated entities.

Extracting Entities

Entities are typically nouns that can represent the name of a person, place, or organization. The entities extracted from text supplement the traditional inverted index used in search engines. For example, in "The headquarters of the UNIDO is located in Vienna," an organization entity and a place entity would be extracted in addition to the keywords. It's difficult to form accurate search-engine queries pertaining to an organization in general using keywords alone. A query such as, "Which international organization has its main office in Vienna," will have some irrelevant hits. The process of extracting entities may be time consuming depending on the type of extraction tool. Tools like GATE (gate.ac.uk), for instance, use a gazetteer including lists of organizations and abbreviations along with pattern-matching code to find entities in text. Other tools such as Lingpipe (www.alias-i.com) use a training model based on context to identify the most likely entity type and can be faster.

The use of either model adds to the overhead and increases the response time. Some Q&A systems pretag the text with the associated entities to reduce the response time. This approach has the advantage of speed, but increases the time to index and size of the index. The second method is to extract entities from just the text of the hits returned in response to the translated query. There is less text to process, however, because the extraction occurs while users are waiting for a response—it's usually slower than the first method.

WordNet

The answers to these questions may not contain the same nouns ("Holland" and "mountain") or verb ("buy"). Passages using synonyms such as the "Netherlands," "peak," and "purchase" may be potential answers.

WordNet (wordnet.princeton.edu) is a popular open-source English database containing over 150,000 words from four parts of speech (nouns, verbs, adverbs, and adjectives) with various relationships between words and meanings (synonym sets or synsets). Many Q&A systems use WordNet, even though it was not designed specifically for any particular linguistic process.

The relationship between words is more commonly known than the relationships between synsets. Although words and synsets have distinctive sets of relationships (Figure 2), there is a many-to-many relationship between words and synsets. A single word such as "purchase" is found in four noun synsets and one verb synset. Each of these synsets represent a single sense of the word. Some senses are used more frequently than others. For example, the verb sense of "purchase" is used more often than any of the noun senses.

Figure 2: Relationship among words.

Hypernyms

We are mainly interested in the hypernym/hyponym relationships between synsets for the Q&A task. A hypernym is a more general sense of a meaning—the hypernym of the synset {plant, flora} is {living thing, organism} whose hypernym is {thing, entity}. There are several such hierarchies of synsets that represent the specificity of a particular synset. For example, the word "horse" is found lower in the hierarchy than the more general word "animal."

Hypernyms enable matches between questions and answers without exact word comparisons. A hypernym ("animal") in a question matches a specific word ("horse") from an answer. You identify a set of hypernyms with every specific word. However, if you include hypernyms that are too specific, then words will have long lists of associated hypernyms, unnecessarily expanding the size of the index. On the other hand, if the hypernyms are too general, then you may have "noisy" matches for cases where the question and answer words match at a high level. Therefore, you select hypernyms such that the number of words assigned to the hypernym is within a range, and the depth of the hypernym is not too high.

Every hypernym is also assigned a score. This score is computed using a training set of answers and questions. If the hypernym is seen often in many question and answer pairs, then it is assigned a higher weight than a hypernym that occurs more broadly across diverse questions and answers. The hypernym score is used to more precisely generate a query from a question. A hypernym with a higher score has a proportionately higher keyword boost in the generated query.

The use of inflected words is another problem in matching words. A search engine may not match words such as "buy" and "bought" or "purchased" and "purchase". WordNet uses a dictionary and a set of rules to strip suffixes to find the root form of any given word. The three word usage problems (synonyms, hypernyms, and inflected words) are the leading causes of word mismatches between potential answers and questions.

Query Transformation

The best answer for a question may not be the one that contains the most keywords in common with the question. For example, the answer to the question "Where is the main office of UNIDO?" may contain phrases such as "located in" or "is located."

A simple question categorizer looks for the occurrence of question words such as "who," "when," "where," "which," "what," "how," and "why." A set of phrases commonly associated with a question category can be found using a collection of training questions and answers. The list of questions and answers are grouped by a question category. We identify a set of phrases that are frequently seen in answers for a particular question category. For example, a "when" question will have phrases such as "on the" and "will be." These phrases are used to supplement the query generated from the question. The selection of phrases excludes occurrences with nouns, since such phrases are usually specific to a question. A technical question and answer training set may contain many occurrences of the word "computer" or "database" that are not seen in general answers. Therefore, all phrases containing a noun are excluded.

A weight is computed for every phrase depending on the number of occurrences in all answers, the set of relevant answers (for the question category), and the set of irrelevant answers (for all other question categories). The n top-ranked phrases are assigned to a question category and become part of the generated query for a question.

Query Generation

I previously pointed out that the first step of generating the query from a question is possibly the most important step. There are five different components in a generated query:

For example, the question "Who is the Prime Minister of Britain?" has the following unigrams and bigrams—"Prime," "Minister," "Britain," "Prime Minister," "the Prime," "Minister of," and "of Britain." The expected entity based on the question word is a person, and the entity from the first noun is a head of state or representative. The transformations are phrases like "responsible for" and "eligible to."

The Lucene search engine (lucene.apache.org) allows for boosts of specific keywords in a query. In a general search engine query, users assign equal importance to all keywords. In some search engines, the order of keywords may affect the results. Keywords earlier in the query may be given higher importance than later keywords. In Lucene, this can be codified by assigning boosts for individual keywords.

Some components of the query may have a higher weight than others since they are better discriminators. The weights of unigrams and bigrams are usually higher than the weight of entities in most question types. This is expected since the unigrams and bigrams are specific terms, while the entities are general indicators of what is expected in an answer. Similarly, the transformation also has a lower weight than the specific components of the query.

Each question type is assigned a set of weights for the five components of the generated query. The weights are computed from a training set of questions and answers. An initial set of weights is assigned and a hill-climbing algorithm searches for an optimized set of weights for all training questions belonging to a particular question type.

The bag-of-words approach for query generation uses just the set of unigrams and bigrams alone to find answers. This approach finds an answer roughly 40 percent of the time. Adding entities and query transformation can increase the accuracy rate to 60 percent or higher.

Implementation

The implementation of an experimental Q&A system that includes some of the features described here is available electronically (see http://www.ddj.com/code/) and on SourceForge (mustru.sf.net).

You first create an index using the text of files from a set of selected directories. It's not necessary to index the entire hard disk of a machine. Many directories containing system files may not be of interest to the typical desktop user. The selected directories are recursively scanned for files whose text can be extracted.

Files whose text can be extracted include formatted files (such as .pdf, .doc, or .odt files) and web pages. A text filter is called depending on the suffix of the file. Currently, more than 40 different suffixes are handled. An image, audio, or video file is a media file and the only text associated with such files is the filename.

After the text has been extracted, it is indexed and stored in a database. I used the Berkeley DB Java Edition to manage the database. The filename is the key for any document stored in the database. Documents that contain identical text are excluded. The text is broken into sentences and entities are extracted from each sentence. The index for any sentence consists of keywords from the text and a set of entities.

A web-based interface to search the index is included (see Figure 3). A submitted question is parsed and processed by Lucene. The best sentence from the text of each hit is returned to the client.

Figure 3: A web-based interface for searching.