Getting Better Search Results

Search engines are great, but more often than not that bring you too much useless information. That's when human-aided filtering can make the difference.


April 23, 2008
URL:http://www.drdobbs.com/architecture-and-design/getting-better-search-results/207401584

Bob is the president of Zeidman Consulting. He can be reached at [email protected].


Search engines are great. Put in keywords and out pop hundreds, thousands, sometimes millions of web pages. But then what? How can you effectively look at all of those pages? Maybe it's time to put people back in the equation. After all, we can still do a few things better than computers, like quickly filtering out irrelevant information. With the right tools, the computer can even help us do this more efficiently.

In this article (which is based on a more detailed paper that was presented at the 11th World Multi-Conference on Systemics, Cybernetics and Informatics), I use human-aided filtering to focus in on useful information. I have incorporated human-aided filtering into a tool for finding software plagiarism. After the tool finds similar sections of code in two programs, the human and the computer work together to pinpoint the results that are most relevant.

CodeMatch

For the past decade, I've been an expert witness in intellectual property cases and asked to examine software source code from a plaintiff or defendant to determine whether one has plagiarized code from the other. Over time, I've found that the few existing tools for plagiarism detection were too inaccurate for situations where hundreds of millions of dollars could be at stake. Consequently, I developed my own tool called "CodeMatch" (www.safe-corp.biz/products_codesuite.htm).

CodeMatch uses four algorithms to determine the correlation between source-code files for different programs:

After using CodeMatch on a number of cases, I found that although it had great accuracy, it shared one deficiency with the other tools—too much output. After examining the results, I often found information that was not relevant to the particular case on which I was working. Because a large comparison could take a week for results, it was impractical to rerun the comparison using new settings. I began to spend time manually filtering the results to obtain a more manageable and more relevant set of results. The main purpose of CodeMatch was to reduce the time I spent looking at lines of code. While it did reduce my time by at least an order of magnitude from manually examining code files, I now wanted to reduce the time I spent poring over the results. (My wife thinks this is a bit crazy since I get paid per hour.)

Superfluous Results

In reviewing the results of the comparison, often some specific files or specific source-code elements would show up throughout the results, skewing results and hiding important correlation information. For example, open-source files may have been used in one or both sets of files. In searching for plagiarized code, the open-source files would be highly correlated with each other, but these correlations were not important. Pieces of these files would show up throughout both sets of files and flagged as highly correlated.

Similarly, there were specific statements, comments, and identifiers that showed up in many places, but were not relevant to finding plagiarized code. Users searching for plagiarized code may find that two programs running on Linux both use the same system calls. Thus, files with these system calls will have a higher correlation. Common identifier names like "index," "count," and "result" showed up in many files, increasing correlation values, but were not necessarily signs of plagiarism.

Had these results been known upfront, some of them could have been eliminated before CodeMatch was run. However, given the number of files and the number of source-code elements, it was impractical to find these elements before performing the correlation. Also, the correlation itself pointed out many of these superfluous elements.

CodeMatch Post-Process Filtering

To make examining the correlation results more useful and let users focus on the kinds of correlation that are most important, I added the ability to filter the results. After CodeMatch produces a database of results, this filtering can be performed on the database:

After the filtering is performed on the database, the correlation scores between file pairs are adjusted accordingly. I found that for large file sets, this filtering reduced the manual process of reviewing the results in order to find plagiarized source-code files from days to hours or even minutes.

My experience using filtering with CodeMatch can be generalized to any kind of information retrieval process. To understand how filtering can be used, it is important to first understand the different kinds of information retrieval processes and information display methods.

Information Retrieval

Since the best-known information retrieval process is probably web searching, I use it as an example. Information retrieval can be classified into two types—exact match and best match.

The exact match type of information retrieval is represented by the Boolean retrieval method. In these cases, Boolean equations of keywords are entered by users and all objects in the information domain that meet the criteria are retrieved. Most search engines use exact matching; the information domain is the Web. Even the more sophisticated search engines that let users input natural-language queries are typically parsing the language to retrieve the keywords and Boolean equations.

Best-match information retrieval uses vector space and probabilistic retrieval methods that essentially try to understand what information a user wants, sometimes based on past searches or other stored user parameters, and then present the information to the user that seems closest to what the user wants. An example of this would be the book suggestions that Amazon.com presents to customers, based on the customer's search criteria and past searches.

Figure 1: Information retrieval.

Figure 1 is a graphical representation of information retrieval, where D is the information domain, Q is the user's query, and O is the object retrieved by the query. DU is the subset of the domain that meets the user's information need based on the retrieval process. Each arrow from an object to the query represents the relationship Ri between the query and object. For all retrieval methods, DU is the set of all objects, such that Ri>0:


DU={Oi : Ri > 0 for all i}


However, the relationship can be further refined depending on whether the retrieval method is Boolean, Probabilistic, or Vector. For a Boolean retrieval method, Ri is a scalar 1 for all i. In other words, a Boolean method only retrieves objects that exactly match the query:


Ri = 1 for all i

For a probabilistic retrieval method, Ri equals r(Q|Oi), which is the probability that a user's query is met by retrieved object Oi. A probabilistic method retrieves information based on some threshold probability that the object is what the user wants:


Ri = r(Q|Oi) for all i

For a vector space retrieval method, Ri equals (Q,Oi), which is the correlation between a user's query and some object Oi. A vector space method retrieves information that is similar to another object that the user desires:



Ri = (Q,Oi) for all i

In addition to relationships between a query and the retrieved objects, note that there are relationships between various retrieved objects, represented by the arrows Rik between objects Oi and Ok. I make use of this fact for post-process filtering.

Information Display

Once the information objects are retrieved from the domain, they must be displayed to users. There are two types of criteria that can be used for this display in order to rank the retrieved objects.

Of course, combinations of internal and external criteria can also be used.

For best-match retrieval methods, the objects can be displayed in order according to their relationship to the query. Objects with higher probabilities or higher correlation values are displayed first. The relationships are used as the criteria for displaying the objects. For exact-match retrieval methods, internal criteria do not provide a way to display the results because all retrieved objects have a relationship of 1.

External criteria are often used to display the results of search engines. Perhaps the best-known example of external display criteria is the PageRank method used by Google that ranks pages according to how many other pages are linked to it. Other methods include the Hub-Threshold Kleinberg algorithm. I use the term Pi to generally represent ranking methods.

Regardless of which kind of ranking criteria is used, there is often also a display threshold. Retrieved objects that have a ranking below the display threshold are not shown to users. An object with a very low ranking is thought to be irrelevant and its relationship with the query is thought to be random, rather than due to any relationship that would be significant to users.

Post-Retrieval Filtering

What I propose is another step after retrieval and display to further refine the results and reduce the number of retrieved objects to one that is reasonable to examine. There are several ways this can be accomplished using combinations of object filters, new query filters, negative query filters, threshold filters, and object relationship filters.

Object filtering is the process of letting users eliminate individual objects or whole sets of objects from the user information domain DU. This can be done by letting users specify objects to remove or categories of objects to remove. The removal process is dependent on the type of information being retrieved. When the retrieved objects are files, the criteria used to remove objects might be file name, location (path name), size, modification date, or creation date. With regard to CodeMatch, the general file filtering, specific file filtering, and folder filtering are examples of object filtering. In the case of a search engine, a user could specify to remove certain websites from the search results, or users could specify certain web page names to be removed. Users could also click on certain specific pages to be removed.

New query filtering refers to using a new query on the retrieved user domain DU to create a new domain DU' that is a subset of DU. Some search engines provide this kind of filtering by allowing the user to further search the retrieved results with a new Boolean expression of keywords.

Negative query filtering refers to applying a query to the retrieved information in order to eliminate objects. For instance, suppose the original query is a Boolean query to find all documents with the phrases "software," "source code," or "correlation." A query-based elimination filter would be one where the user eliminates all retrieved documents that contain the keyword "correlation." This would be equivalent to an original query to find all documents with the phrases "software" and "source code" but not "correlation." There are two reasons in particular that negative query filtering is useful:

Threshold filtering involves setting thresholds for displaying the retrieved objects to the user after the information has been retrieved. I define three kinds of threshold filters—relationship thresholds, ranking thresholds, and number thresholds. Combinations of these thresholds are also possible.

With a relationship threshold, the value used for determining the threshold is the Ri relationship between the query and the objects. In other words, any object Oi with relationship Ri that is less than threshold T gets eliminated from the results.

Ranking threshold filtering can be based on the information display ranking. For example, the Google PageRank criteria can be used such that any object Oi with rank Pi that is less than threshold T gets eliminated from the results.

Number threshold filtering is the process of simply reducing the number of objects in the user domain Du to one that is more manageable. It requires that the information retrieval method be a best-match method or that the information display process uses a ranking method. Otherwise, for the exactmatch method, all retrieved objects have equal relationships to the query, and eliminating a specific number of them would have to be arbitrary. Given a number threshold N, if the retrieval method is best match, the objects Oi are ordered from highest to lowest by their relationship Ri until the number of objects displayed is N. If the retrieval method is exact match but the display process uses a ranking method, the objects Oi are ordered from highest to lowest by their ranking Pi until the number of objects displayed is N.

Thresholds need not be minimum thresholds. Maximum thresholds and combinations of minimum and maximum thresholds may be appropriate if the user wishes to study various aspects of the retrieved information such as statistical distributions of the information. Effectively, these thresholds can be used as low pass, high pass, and band pass filters.

An object relationship filter lets users select an object Oi that they feel is characteristic of an object that the user wants to see or that the user does not want to see. All similar objects are then removed, or all dissimilar objects are removed, depending on whether the filter is a positive object relationship filter or a negative object relationship filter.

For a positive object relationship filter, the user selects an object Oi and specifies a minimum relationship value RM. Object Oi and all objects Ok such that the relationship Rik between objects Oi and Ok is greater than or equal to the minimum relationship value RM are eliminated from the user information domain DU. In this case, object Oi is selected as an example of an object that the user feels is not relevant. In the case of a search engine, the user would select a web page that the user feels is not relevant. That web page and all similar web pages would be eliminated from the search results.

For a negative object relationship filter, the user selects an object Oi and specifies a minimum relationship value RM. All objects Ok such that the relationship Rik between objects Oi and Ok is less than the minimum relationship value RM are eliminated from the user information domain DU. In this case, object Oi is selected as an example of an object that the user feels is particularly relevant. In the case of a search engine, the user would select a web page that the user feels is most relevant, and all dissimilar web pages would be eliminated from the search results.

Object-relationship filtering lets users select objects to be included/excluded from the results without understanding the details of why the object is relevant or is not relevant. This can be very important because unsophisticated users can look at the results of a web search and recognize when they find good results and bad results, but may not define those good and bad results in terms of keywords. Object-relationship filtering has great potential for e-commerce. Often consumers may not be able to easily create keywords to define the items they desire. But they know it when they see it. They can click on an exemplary item from a search and obtain a list of all similar items.

Conclusion

Better methods of information retrieval will always be needed and these methods are improving regularly. Better methods of information display are also important and there is a great demand for it as evidenced by the success of Google, one of whose major innovations was in the area of information display. Automatic filtering of retrieved information is a great goal and research is going on in that area also. However, automatic filtering may never be 100-percent accurate and human-aided filtering has many great benefits that have yet to be fully exploited.

References

Nicholas J. Belkin and W. Bruce Croft, "Information Filtering and Information Retrieval: Two Sides of the Same Coin?" Communications of the ACM, 35(12), 29-38, 1992.

Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," WWW7/Computer Networks 30(1-7): 107-117, 1998.

Peter W. Foltz and Susan T. Dumais, "Personalized Information Delivery: An Analysis of Information Filtering Methods," Communications of the ACM, 35(12), 51-60, 1992.

Lee Gomes, "Computer Scientists Pull a Tom Sawyer To Finish Grunt Work," The Wall Street Journal, June 27, 2007.

Peter Pirolli and Stuart K. Card, "Information foraging," Psychological Review, 106: 643-675, 1999.

G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.

Bob Zeidman, "Detecting Source-Code Plagiarism," Dr. Dobb's Journal, July 2004, 55-60.

Bob Zeidman, "Iterative Filtering of Retrieved Information to Increase Relevance," The 11th World Multi-Conference on Systemics, Cybernetics and Informatics, July 11, 2007.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.