Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Getting Better Search Results


Information Retrieval

Since the best-known information retrieval process is probably web searching, I use it as an example. Information retrieval can be classified into two types—exact match and best match.

The exact match type of information retrieval is represented by the Boolean retrieval method. In these cases, Boolean equations of keywords are entered by users and all objects in the information domain that meet the criteria are retrieved. Most search engines use exact matching; the information domain is the Web. Even the more sophisticated search engines that let users input natural-language queries are typically parsing the language to retrieve the keywords and Boolean equations.

Best-match information retrieval uses vector space and probabilistic retrieval methods that essentially try to understand what information a user wants, sometimes based on past searches or other stored user parameters, and then present the information to the user that seems closest to what the user wants. An example of this would be the book suggestions that Amazon.com presents to customers, based on the customer's search criteria and past searches.

Figure 1: Information retrieval.

Figure 1 is a graphical representation of information retrieval, where D is the information domain, Q is the user's query, and O is the object retrieved by the query. DU is the subset of the domain that meets the user's information need based on the retrieval process. Each arrow from an object to the query represents the relationship Ri between the query and object. For all retrieval methods, DU is the set of all objects, such that Ri>0:


D<sub>U</sub>={O<sub>i</sub> : R<sub>i</sub> > 0 for all i}


However, the relationship can be further refined depending on whether the retrieval method is Boolean, Probabilistic, or Vector. For a Boolean retrieval method, Ri is a scalar 1 for all i. In other words, a Boolean method only retrieves objects that exactly match the query:


R<sub>i</sub> = 1 for all i

For a probabilistic retrieval method, Ri equals r(Q|Oi), which is the probability that a user's query is met by retrieved object Oi. A probabilistic method retrieves information based on some threshold probability that the object is what the user wants:


R<sub>i</sub> = r(Q|O<sub>i</sub>) for all i

For a vector space retrieval method, Ri equals (Q,Oi), which is the correlation between a user's query and some object Oi. A vector space method retrieves information that is similar to another object that the user desires:



R<sub>i</sub> = (Q,O<sub>i</sub>) for all i

In addition to relationships between a query and the retrieved objects, note that there are relationships between various retrieved objects, represented by the arrows Rik between objects Oi and Ok. I make use of this fact for post-process filtering.

Information Display

Once the information objects are retrieved from the domain, they must be displayed to users. There are two types of criteria that can be used for this display in order to rank the retrieved objects.

  • Internal criteria are criteria derived from the relationships determined during the retrieval process.
  • External criteria are criteria determined in a new step unrelated to the retrieval process.

Of course, combinations of internal and external criteria can also be used.

For best-match retrieval methods, the objects can be displayed in order according to their relationship to the query. Objects with higher probabilities or higher correlation values are displayed first. The relationships are used as the criteria for displaying the objects. For exact-match retrieval methods, internal criteria do not provide a way to display the results because all retrieved objects have a relationship of 1.

External criteria are often used to display the results of search engines. Perhaps the best-known example of external display criteria is the PageRank method used by Google that ranks pages according to how many other pages are linked to it. Other methods include the Hub-Threshold Kleinberg algorithm. I use the term Pi to generally represent ranking methods.

Regardless of which kind of ranking criteria is used, there is often also a display threshold. Retrieved objects that have a ranking below the display threshold are not shown to users. An object with a very low ranking is thought to be irrelevant and its relationship with the query is thought to be random, rather than due to any relationship that would be significant to users.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.