Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Getting Better Search Results

Bob is the president of Zeidman Consulting. He can be reached at [email protected].

Search engines are great. Put in keywords and out pop hundreds, thousands, sometimes millions of web pages. But then what? How can you effectively look at all of those pages? Maybe it's time to put people back in the equation. After all, we can still do a few things better than computers, like quickly filtering out irrelevant information. With the right tools, the computer can even help us do this more efficiently.

In this article (which is based on a more detailed paper that was presented at the 11th World Multi-Conference on Systemics, Cybernetics and Informatics), I use human-aided filtering to focus in on useful information. I have incorporated human-aided filtering into a tool for finding software plagiarism. After the tool finds similar sections of code in two programs, the human and the computer work together to pinpoint the results that are most relevant.


For the past decade, I've been an expert witness in intellectual property cases and asked to examine software source code from a plaintiff or defendant to determine whether one has plagiarized code from the other. Over time, I've found that the few existing tools for plagiarism detection were too inaccurate for situations where hundreds of millions of dollars could be at stake. Consequently, I developed my own tool called "CodeMatch" (www.safe-corp.biz/products_codesuite.htm).

CodeMatch uses four algorithms to determine the correlation between source-code files for different programs:

  • Statement Correlation. A measure of the number of identical statements.
  • Comment Correlation. A measure of the number of identical comments.
  • Identifier Correlation. A measure of the number of identical and nearly identical identifiers.
  • Instruction Sequence Correlation. A measure of the longest sequence of identical instructions.

After using CodeMatch on a number of cases, I found that although it had great accuracy, it shared one deficiency with the other tools—too much output. After examining the results, I often found information that was not relevant to the particular case on which I was working. Because a large comparison could take a week for results, it was impractical to rerun the comparison using new settings. I began to spend time manually filtering the results to obtain a more manageable and more relevant set of results. The main purpose of CodeMatch was to reduce the time I spent looking at lines of code. While it did reduce my time by at least an order of magnitude from manually examining code files, I now wanted to reduce the time I spent poring over the results. (My wife thinks this is a bit crazy since I get paid per hour.)

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.