Design

Getting Better Search Results

By Bob Zeidman, April 23, 2008

Search engines are great, but more often than not that bring you too much useless information. That's when human-aided filtering can make the difference.

Information Retrieval

Since the best-known information retrieval process is probably web searching, I use it as an example. Information retrieval can be classified into two types—exact match and best match.

The exact match type of information retrieval is represented by the Boolean retrieval method. In these cases, Boolean equations of keywords are entered by users and all objects in the information domain that meet the criteria are retrieved. Most search engines use exact matching; the information domain is the Web. Even the more sophisticated search engines that let users input natural-language queries are typically parsing the language to retrieve the keywords and Boolean equations.

Best-match information retrieval uses vector space and probabilistic retrieval methods that essentially try to understand what information a user wants, sometimes based on past searches or other stored user parameters, and then present the information to the user that seems closest to what the user wants. An example of this would be the book suggestions that Amazon.com presents to customers, based on the customer's search criteria and past searches.

Figure 1: Information retrieval.

Figure 1 is a graphical representation of information retrieval, where D is the information domain, Q is the user's query, and O is the object retrieved by the query. D_U is the subset of the domain that meets the user's information need based on the retrieval process. Each arrow from an object to the query represents the relationship R_i between the query and object. For all retrieval methods, D_U is the set of all objects, such that R_i>0:


D<sub>U</sub>={O<sub>i</sub> : R<sub>i</sub> > 0 for all i}

However, the relationship can be further refined depending on whether the retrieval method is Boolean, Probabilistic, or Vector. For a Boolean retrieval method, R_i is a scalar 1 for all i. In other words, a Boolean method only retrieves objects that exactly match the query:


R<sub>i</sub> = 1 for all i

For a probabilistic retrieval method, R_i equals r(Q|O_i), which is the probability that a user's query is met by retrieved object O_i. A probabilistic method retrieves information based on some threshold probability that the object is what the user wants:


R<sub>i</sub> = r(Q|O<sub>i</sub>) for all i

For a vector space retrieval method, R_i equals (Q,O_i), which is the correlation between a user's query and some object O_i. A vector space method retrieves information that is similar to another object that the user desires:



R<sub>i</sub> = (Q,O<sub>i</sub>) for all i

In addition to relationships between a query and the retrieved objects, note that there are relationships between various retrieved objects, represented by the arrows R_ik between objects O_i and O_k. I make use of this fact for post-process filtering.

Information Display

Once the information objects are retrieved from the domain, they must be displayed to users. There are two types of criteria that can be used for this display in order to rank the retrieved objects.

Internal criteria are criteria derived from the relationships determined during the retrieval process.
External criteria are criteria determined in a new step unrelated to the retrieval process.

Of course, combinations of internal and external criteria can also be used.

For best-match retrieval methods, the objects can be displayed in order according to their relationship to the query. Objects with higher probabilities or higher correlation values are displayed first. The relationships are used as the criteria for displaying the objects. For exact-match retrieval methods, internal criteria do not provide a way to display the results because all retrieved objects have a relationship of 1.

External criteria are often used to display the results of search engines. Perhaps the best-known example of external display criteria is the PageRank method used by Google that ranks pages according to how many other pages are linked to it. Other methods include the Hub-Threshold Kleinberg algorithm. I use the term P_i to generally represent ranking methods.

Regardless of which kind of ranking criteria is used, there is often also a display threshold. Retrieved objects that have a ranking below the display threshold are not shown to users. An object with a very low ranking is thought to be irrelevant and its relationship with the query is thought to be random, rather than due to any relationship that would be significant to users.

Previous 1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Getting Better Search Results

Information Retrieval

Information Display

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Getting Better Search Results

Information Retrieval

Information Display

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content