Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Getting Better Search Results


Post-Retrieval Filtering

What I propose is another step after retrieval and display to further refine the results and reduce the number of retrieved objects to one that is reasonable to examine. There are several ways this can be accomplished using combinations of object filters, new query filters, negative query filters, threshold filters, and object relationship filters.

Object filtering is the process of letting users eliminate individual objects or whole sets of objects from the user information domain DU. This can be done by letting users specify objects to remove or categories of objects to remove. The removal process is dependent on the type of information being retrieved. When the retrieved objects are files, the criteria used to remove objects might be file name, location (path name), size, modification date, or creation date. With regard to CodeMatch, the general file filtering, specific file filtering, and folder filtering are examples of object filtering. In the case of a search engine, a user could specify to remove certain websites from the search results, or users could specify certain web page names to be removed. Users could also click on certain specific pages to be removed.

New query filtering refers to using a new query on the retrieved user domain DU to create a new domain DU' that is a subset of DU. Some search engines provide this kind of filtering by allowing the user to further search the retrieved results with a new Boolean expression of keywords.

Negative query filtering refers to applying a query to the retrieved information in order to eliminate objects. For instance, suppose the original query is a Boolean query to find all documents with the phrases "software," "source code," or "correlation." A query-based elimination filter would be one where the user eliminates all retrieved documents that contain the keyword "correlation." This would be equivalent to an original query to find all documents with the phrases "software" and "source code" but not "correlation." There are two reasons in particular that negative query filtering is useful:

  • If the user were to perform a new query to the search engine, it would require more resources (compute power, storage space, network bandwidth, and so on) than a negative query filter performed on the retrieved objects. For example, the negative query could be done locally on the client computer.
  • If the query is a best-match query rather than an exact-match query, the negative query filter can be used to get results that may be difficult for users to define with a single query.

Threshold filtering involves setting thresholds for displaying the retrieved objects to the user after the information has been retrieved. I define three kinds of threshold filters—relationship thresholds, ranking thresholds, and number thresholds. Combinations of these thresholds are also possible.

With a relationship threshold, the value used for determining the threshold is the Ri relationship between the query and the objects. In other words, any object Oi with relationship Ri that is less than threshold T gets eliminated from the results.

Ranking threshold filtering can be based on the information display ranking. For example, the Google PageRank criteria can be used such that any object Oi with rank Pi that is less than threshold T gets eliminated from the results.

Number threshold filtering is the process of simply reducing the number of objects in the user domain Du to one that is more manageable. It requires that the information retrieval method be a best-match method or that the information display process uses a ranking method. Otherwise, for the exactmatch method, all retrieved objects have equal relationships to the query, and eliminating a specific number of them would have to be arbitrary. Given a number threshold N, if the retrieval method is best match, the objects Oi are ordered from highest to lowest by their relationship Ri until the number of objects displayed is N. If the retrieval method is exact match but the display process uses a ranking method, the objects Oi are ordered from highest to lowest by their ranking Pi until the number of objects displayed is N.

Thresholds need not be minimum thresholds. Maximum thresholds and combinations of minimum and maximum thresholds may be appropriate if the user wishes to study various aspects of the retrieved information such as statistical distributions of the information. Effectively, these thresholds can be used as low pass, high pass, and band pass filters.

An object relationship filter lets users select an object Oi that they feel is characteristic of an object that the user wants to see or that the user does not want to see. All similar objects are then removed, or all dissimilar objects are removed, depending on whether the filter is a positive object relationship filter or a negative object relationship filter.

For a positive object relationship filter, the user selects an object Oi and specifies a minimum relationship value RM. Object Oi and all objects Ok such that the relationship Rik between objects Oi and Ok is greater than or equal to the minimum relationship value RM are eliminated from the user information domain DU. In this case, object Oi is selected as an example of an object that the user feels is not relevant. In the case of a search engine, the user would select a web page that the user feels is not relevant. That web page and all similar web pages would be eliminated from the search results.

For a negative object relationship filter, the user selects an object Oi and specifies a minimum relationship value RM. All objects Ok such that the relationship Rik between objects Oi and Ok is less than the minimum relationship value RM are eliminated from the user information domain DU. In this case, object Oi is selected as an example of an object that the user feels is particularly relevant. In the case of a search engine, the user would select a web page that the user feels is most relevant, and all dissimilar web pages would be eliminated from the search results.

Object-relationship filtering lets users select objects to be included/excluded from the results without understanding the details of why the object is relevant or is not relevant. This can be very important because unsophisticated users can look at the results of a web search and recognize when they find good results and bad results, but may not define those good and bad results in terms of keywords. Object-relationship filtering has great potential for e-commerce. Often consumers may not be able to easily create keywords to define the items they desire. But they know it when they see it. They can click on an exemplary item from a search and obtain a list of all similar items.

Conclusion

Better methods of information retrieval will always be needed and these methods are improving regularly. Better methods of information display are also important and there is a great demand for it as evidenced by the success of Google, one of whose major innovations was in the area of information display. Automatic filtering of retrieved information is a great goal and research is going on in that area also. However, automatic filtering may never be 100-percent accurate and human-aided filtering has many great benefits that have yet to be fully exploited.

References

Nicholas J. Belkin and W. Bruce Croft, "Information Filtering and Information Retrieval: Two Sides of the Same Coin?" Communications of the ACM, 35(12), 29-38, 1992.

Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," WWW7/Computer Networks 30(1-7): 107-117, 1998.

Peter W. Foltz and Susan T. Dumais, "Personalized Information Delivery: An Analysis of Information Filtering Methods," Communications of the ACM, 35(12), 51-60, 1992.

Lee Gomes, "Computer Scientists Pull a Tom Sawyer To Finish Grunt Work," The Wall Street Journal, June 27, 2007.

Peter Pirolli and Stuart K. Card, "Information foraging," Psychological Review, 106: 643-675, 1999.

G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.

Bob Zeidman, "Detecting Source-Code Plagiarism," Dr. Dobb's Journal, July 2004, 55-60.

Bob Zeidman, "Iterative Filtering of Retrieved Information to Increase Relevance," The 11th World Multi-Conference on Systemics, Cybernetics and Informatics, July 11, 2007.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.