Dr. Dobb's | Getting Better Search Results

Getting Better Search Results

Search engines are great, but more often than not that bring you too much useless information. That's when human-aided filtering can make the difference.

April 23, 2008
URL:http://www.drdobbs.com/architecture-and-design/getting-better-search-results/207401584

Bob is the president of Zeidman Consulting. He can be reached at [email protected].

Search engines are great. Put in keywords and out pop hundreds, thousands, sometimes millions of web pages. But then what? How can you effectively look at all of those pages? Maybe it's time to put people back in the equation. After all, we can still do a few things better than computers, like quickly filtering out irrelevant information. With the right tools, the computer can even help us do this more efficiently.

In this article (which is based on a more detailed paper that was presented at the 11th World Multi-Conference on Systemics, Cybernetics and Informatics), I use human-aided filtering to focus in on useful information. I have incorporated human-aided filtering into a tool for finding software plagiarism. After the tool finds similar sections of code in two programs, the human and the computer work together to pinpoint the results that are most relevant.

CodeMatch

For the past decade, I've been an expert witness in intellectual property cases and asked to examine software source code from a plaintiff or defendant to determine whether one has plagiarized code from the other. Over time, I've found that the few existing tools for plagiarism detection were too inaccurate for situations where hundreds of millions of dollars could be at stake. Consequently, I developed my own tool called "CodeMatch" (www.safe-corp.biz/products_codesuite.htm).

CodeMatch uses four algorithms to determine the correlation between source-code files for different programs:

Statement Correlation. A measure of the number of identical statements.
Comment Correlation. A measure of the number of identical comments.
Identifier Correlation. A measure of the number of identical and nearly identical identifiers.
Instruction Sequence Correlation. A measure of the longest sequence of identical instructions.

After using CodeMatch on a number of cases, I found that although it had great accuracy, it shared one deficiency with the other tools—too much output. After examining the results, I often found information that was not relevant to the particular case on which I was working. Because a large comparison could take a week for results, it was impractical to rerun the comparison using new settings. I began to spend time manually filtering the results to obtain a more manageable and more relevant set of results. The main purpose of CodeMatch was to reduce the time I spent looking at lines of code. While it did reduce my time by at least an order of magnitude from manually examining code files, I now wanted to reduce the time I spent poring over the results. (My wife thinks this is a bit crazy since I get paid per hour.)

Superfluous Results

In reviewing the results of the comparison, often some specific files or specific source-code elements would show up throughout the results, skewing results and hiding important correlation information. For example, open-source files may have been used in one or both sets of files. In searching for plagiarized code, the open-source files would be highly correlated with each other, but these correlations were not important. Pieces of these files would show up throughout both sets of files and flagged as highly correlated.

Similarly, there were specific statements, comments, and identifiers that showed up in many places, but were not relevant to finding plagiarized code. Users searching for plagiarized code may find that two programs running on Linux both use the same system calls. Thus, files with these system calls will have a higher correlation. Common identifier names like "index," "count," and "result" showed up in many files, increasing correlation values, but were not necessarily signs of plagiarism.

Had these results been known upfront, some of them could have been eliminated before CodeMatch was run. However, given the number of files and the number of source-code elements, it was impractical to find these elements before performing the correlation. Also, the correlation itself pointed out many of these superfluous elements.

CodeMatch Post-Process Filtering

To make examining the correlation results more useful and let users focus on the kinds of correlation that are most important, I added the ability to filter the results. After CodeMatch produces a database of results, this filtering can be performed on the database:

Statement filtering. A list of statements is created by users. Any correlation due to a statement on this list is eliminated.
Comment filtering. A list of comments is created by users. Any correlation due to a comment on this list is eliminated.
Identifier filtering. A list of identifiers is created by users. Any correlation due to an identifier on this list is eliminated.
General file filtering. A list of file names is created by users. Any correlation between any file whose name appears on the list and any other file is removed from the results database.
Specific file filtering. A list of file names with specific paths is created by users. Any correlation between a specific file on the list and any other file is removed from the results database.
Folder filtering. A list of folders is created by the user. Any correlation between a file in a folder on the list and any other file is removed from the results database.
Threshold filtering. Users can change threshold parameters, reducing the number of correlated file pairs that are displayed. Users can set minimum and maximum correlation scores to display and can set a maximum number of correlated files to display.

After the filtering is performed on the database, the correlation scores between file pairs are adjusted accordingly. I found that for large file sets, this filtering reduced the manual process of reviewing the results in order to find plagiarized source-code files from days to hours or even minutes.

My experience using filtering with CodeMatch can be generalized to any kind of information retrieval process. To understand how filtering can be used, it is important to first understand the different kinds of information retrieval processes and information display methods.

Information Retrieval

Since the best-known information retrieval process is probably web searching, I use it as an example. Information retrieval can be classified into two types—exact match and best match.

The exact match type of information retrieval is represented by the Boolean retrieval method. In these cases, Boolean equations of keywords are entered by users and all objects in the information domain that meet the criteria are retrieved. Most search engines use exact matching; the information domain is the Web. Even the more sophisticated search engines that let users input natural-language queries are typically parsing the language to retrieve the keywords and Boolean equations.

Best-match information retrieval uses vector space and probabilistic retrieval methods that essentially try to understand what information a user wants, sometimes based on past searches or other stored user parameters, and then present the information to the user that seems closest to what the user wants. An example of this would be the book suggestions that Amazon.com presents to customers, based on the customer's search criteria and past searches.

Figure 1: Information retrieval.

Figure 1 is a graphical representation of information retrieval, where D is the information domain, Q is the user's query, and O is the object retrieved by the query. D_U is the subset of the domain that meets the user's information need based on the retrieval process. Each arrow from an object to the query represents the relationship R_i between the query and object. For all retrieval methods, D_U is the set of all objects, such that R_i>0:


D_U={O_i : R_i > 0 for all i}

However, the relationship can be further refined depending on whether the retrieval method is Boolean, Probabilistic, or Vector. For a Boolean retrieval method, R_i is a scalar 1 for all i. In other words, a Boolean method only retrieves objects that exactly match the query:


R_i = 1 for all i

For a probabilistic retrieval method, R_i equals r(Q|O_i), which is the probability that a user's query is met by retrieved object O_i. A probabilistic method retrieves information based on some threshold probability that the object is what the user wants:


R_i = r(Q|O_i) for all i

For a vector space retrieval method, R_i equals (Q,O_i), which is the correlation between a user's query and some object O_i. A vector space method retrieves information that is similar to another object that the user desires:



R_i = (Q,O_i) for all i

In addition to relationships between a query and the retrieved objects, note that there are relationships between various retrieved objects, represented by the arrows R_ik between objects O_i and O_k. I make use of this fact for post-process filtering.

Information Display

Once the information objects are retrieved from the domain, they must be displayed to users. There are two types of criteria that can be used for this display in order to rank the retrieved objects.

Internal criteria are criteria derived from the relationships determined during the retrieval process.
External criteria are criteria determined in a new step unrelated to the retrieval process.

Of course, combinations of internal and external criteria can also be used.

For best-match retrieval methods, the objects can be displayed in order according to their relationship to the query. Objects with higher probabilities or higher correlation values are displayed first. The relationships are used as the criteria for displaying the objects. For exact-match retrieval methods, internal criteria do not provide a way to display the results because all retrieved objects have a relationship of 1.

External criteria are often used to display the results of search engines. Perhaps the best-known example of external display criteria is the PageRank method used by Google that ranks pages according to how many other pages are linked to it. Other methods include the Hub-Threshold Kleinberg algorithm. I use the term P_i to generally represent ranking methods.

Regardless of which kind of ranking criteria is used, there is often also a display threshold. Retrieved objects that have a ranking below the display threshold are not shown to users. An object with a very low ranking is thought to be irrelevant and its relationship with the query is thought to be random, rather than due to any relationship that would be significant to users.

Post-Retrieval Filtering

What I propose is another step after retrieval and display to further refine the results and reduce the number of retrieved objects to one that is reasonable to examine. There are several ways this can be accomplished using combinations of object filters, new query filters, negative query filters, threshold filters, and object relationship filters.

Object filtering is the process of letting users eliminate individual objects or whole sets of objects from the user information domain D_U. This can be done by letting users specify objects to remove or categories of objects to remove. The removal process is dependent on the type of information being retrieved. When the retrieved objects are files, the criteria used to remove objects might be file name, location (path name), size, modification date, or creation date. With regard to CodeMatch, the general file filtering, specific file filtering, and folder filtering are examples of object filtering. In the case of a search engine, a user could specify to remove certain websites from the search results, or users could specify certain web page names to be removed. Users could also click on certain specific pages to be removed.

New query filtering refers to using a new query on the retrieved user domain D_U to create a new domain D_U' that is a subset of D_U. Some search engines provide this kind of filtering by allowing the user to further search the retrieved results with a new Boolean expression of keywords.

Negative query filtering refers to applying a query to the retrieved information in order to eliminate objects. For instance, suppose the original query is a Boolean query to find all documents with the phrases "software," "source code," or "correlation." A query-based elimination filter would be one where the user eliminates all retrieved documents that contain the keyword "correlation." This would be equivalent to an original query to find all documents with the phrases "software" and "source code" but not "correlation." There are two reasons in particular that negative query filtering is useful:

If the user were to perform a new query to the search engine, it would require more resources (compute power, storage space, network bandwidth, and so on) than a negative query filter performed on the retrieved objects. For example, the negative query could be done locally on the client computer.
If the query is a best-match query rather than an exact-match query, the negative query filter can be used to get results that may be difficult for users to define with a single query.

Threshold filtering involves setting thresholds for displaying the retrieved objects to the user after the information has been retrieved. I define three kinds of threshold filters—relationship thresholds, ranking thresholds, and number thresholds. Combinations of these thresholds are also possible.

With a relationship threshold, the value used for determining the threshold is the R_i relationship between the query and the objects. In other words, any object O_i with relationship R_i that is less than threshold T gets eliminated from the results.

Ranking threshold filtering can be based on the information display ranking. For example, the Google PageRank criteria can be used such that any object O_i with rank P_ithat is less than threshold T gets eliminated from the results.

Number threshold filtering is the process of simply reducing the number of objects in the user domain D_u to one that is more manageable. It requires that the information retrieval method be a best-match method or that the information display process uses a ranking method. Otherwise, for the exactmatch method, all retrieved objects have equal relationships to the query, and eliminating a specific number of them would have to be arbitrary. Given a number threshold N, if the retrieval method is best match, the objects O_i are ordered from highest to lowest by their relationship R_i until the number of objects displayed is N. If the retrieval method is exact match but the display process uses a ranking method, the objects O_i are ordered from highest to lowest by their ranking P_i until the number of objects displayed is N.

Thresholds need not be minimum thresholds. Maximum thresholds and combinations of minimum and maximum thresholds may be appropriate if the user wishes to study various aspects of the retrieved information such as statistical distributions of the information. Effectively, these thresholds can be used as low pass, high pass, and band pass filters.

An object relationship filter lets users select an object O_i that they feel is characteristic of an object that the user wants to see or that the user does not want to see. All similar objects are then removed, or all dissimilar objects are removed, depending on whether the filter is a positive object relationship filter or a negative object relationship filter.

For a positive object relationship filter, the user selects an object O_i and specifies a minimum relationship value R_M. Object O_i and all objects O_k such that the relationship R_ik between objects O_i and O_k is greater than or equal to the minimum relationship value R_M are eliminated from the user information domain D_U. In this case, object O_i is selected as an example of an object that the user feels is not relevant. In the case of a search engine, the user would select a web page that the user feels is not relevant. That web page and all similar web pages would be eliminated from the search results.

For a negative object relationship filter, the user selects an object O_i and specifies a minimum relationship value R_M. All objects O_k such that the relationship R_ik between objects O_i and O_k is less than the minimum relationship value R_M are eliminated from the user information domain D_U. In this case, object O_i is selected as an example of an object that the user feels is particularly relevant. In the case of a search engine, the user would select a web page that the user feels is most relevant, and all dissimilar web pages would be eliminated from the search results.

Object-relationship filtering lets users select objects to be included/excluded from the results without understanding the details of why the object is relevant or is not relevant. This can be very important because unsophisticated users can look at the results of a web search and recognize when they find good results and bad results, but may not define those good and bad results in terms of keywords. Object-relationship filtering has great potential for e-commerce. Often consumers may not be able to easily create keywords to define the items they desire. But they know it when they see it. They can click on an exemplary item from a search and obtain a list of all similar items.

Conclusion

Better methods of information retrieval will always be needed and these methods are improving regularly. Better methods of information display are also important and there is a great demand for it as evidenced by the success of Google, one of whose major innovations was in the area of information display. Automatic filtering of retrieved information is a great goal and research is going on in that area also. However, automatic filtering may never be 100-percent accurate and human-aided filtering has many great benefits that have yet to be fully exploited.

References

Nicholas J. Belkin and W. Bruce Croft, "Information Filtering and Information Retrieval: Two Sides of the Same Coin?" Communications of the ACM, 35(12), 29-38, 1992.

Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," WWW7/Computer Networks 30(1-7): 107-117, 1998.

Peter W. Foltz and Susan T. Dumais, "Personalized Information Delivery: An Analysis of Information Filtering Methods," Communications of the ACM, 35(12), 51-60, 1992.

Lee Gomes, "Computer Scientists Pull a Tom Sawyer To Finish Grunt Work," The Wall Street Journal, June 27, 2007.

Peter Pirolli and Stuart K. Card, "Information foraging," Psychological Review, 106: 643-675, 1999.

G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.

Bob Zeidman, "Detecting Source-Code Plagiarism," Dr. Dobb's Journal, July 2004, 55-60.

Bob Zeidman, "Iterative Filtering of Retrieved Information to Increase Relevance," The 11th World Multi-Conference on Systemics, Cybernetics and Informatics, July 11, 2007.