Design

Getting Better Search Results

By Bob Zeidman, April 23, 2008

Search engines are great, but more often than not that bring you too much useless information. That's when human-aided filtering can make the difference.

Bob is the president of Zeidman Consulting. He can be reached at [email protected].

Search engines are great. Put in keywords and out pop hundreds, thousands, sometimes millions of web pages. But then what? How can you effectively look at all of those pages? Maybe it's time to put people back in the equation. After all, we can still do a few things better than computers, like quickly filtering out irrelevant information. With the right tools, the computer can even help us do this more efficiently.

In this article (which is based on a more detailed paper that was presented at the 11th World Multi-Conference on Systemics, Cybernetics and Informatics), I use human-aided filtering to focus in on useful information. I have incorporated human-aided filtering into a tool for finding software plagiarism. After the tool finds similar sections of code in two programs, the human and the computer work together to pinpoint the results that are most relevant.

CodeMatch

For the past decade, I've been an expert witness in intellectual property cases and asked to examine software source code from a plaintiff or defendant to determine whether one has plagiarized code from the other. Over time, I've found that the few existing tools for plagiarism detection were too inaccurate for situations where hundreds of millions of dollars could be at stake. Consequently, I developed my own tool called "CodeMatch" (www.safe-corp.biz/products_codesuite.htm).

CodeMatch uses four algorithms to determine the correlation between source-code files for different programs:

Statement Correlation. A measure of the number of identical statements.
Comment Correlation. A measure of the number of identical comments.
Identifier Correlation. A measure of the number of identical and nearly identical identifiers.
Instruction Sequence Correlation. A measure of the longest sequence of identical instructions.

After using CodeMatch on a number of cases, I found that although it had great accuracy, it shared one deficiency with the other tools—too much output. After examining the results, I often found information that was not relevant to the particular case on which I was working. Because a large comparison could take a week for results, it was impractical to rerun the comparison using new settings. I began to spend time manually filtering the results to obtain a more manageable and more relevant set of results. The main purpose of CodeMatch was to reduce the time I spent looking at lines of code. While it did reduce my time by at least an order of magnitude from manually examining code files, I now wanted to reduce the time I spent poring over the results. (My wife thinks this is a bit crazy since I get paid per hour.)

1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Getting Better Search Results

CodeMatch

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Getting Better Search Results

CodeMatch

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content