Search engines are great. Put in keywords and out pop hundreds, thousands, sometimes millions of web pages. But then what? How can you effectively look at all of those pages? Maybe it's time to put people back in the equation. After all, we can still do a few things better than computers, like quickly filtering out irrelevant information. With the right tools, the computer can even help us do this more efficiently.
In this article (which is based on a more detailed paper that was presented at the 11th World Multi-Conference on Systemics, Cybernetics and Informatics), I use human-aided filtering to focus in on useful information. I have incorporated human-aided filtering into a tool for finding software plagiarism. After the tool finds similar sections of code in two programs, the human and the computer work together to pinpoint the results that are most relevant.
CodeMatch
For the past decade, I've been an expert witness in intellectual property cases and asked to examine software source code from a plaintiff or defendant to determine whether one has plagiarized code from the other. Over time, I've found that the few existing tools for plagiarism detection were too inaccurate for situations where hundreds of millions of dollars could be at stake. Consequently, I developed my own tool called "CodeMatch" (www.safe-corp.biz/products_codesuite.htm).
CodeMatch uses four algorithms to determine the correlation between source-code files for different programs:
- Statement Correlation. A measure of the number of identical statements.
- Comment Correlation. A measure of the number of identical comments.
- Identifier Correlation. A measure of the number of identical and nearly identical identifiers.
- Instruction Sequence Correlation. A measure of the longest sequence of identical instructions.
After using CodeMatch on a number of cases, I found that although it had great accuracy, it shared one deficiency with the other toolstoo much output. After examining the results, I often found information that was not relevant to the particular case on which I was working. Because a large comparison could take a week for results, it was impractical to rerun the comparison using new settings. I began to spend time manually filtering the results to obtain a more manageable and more relevant set of results. The main purpose of CodeMatch was to reduce the time I spent looking at lines of code. While it did reduce my time by at least an order of magnitude from manually examining code files, I now wanted to reduce the time I spent poring over the results. (My wife thinks this is a bit crazy since I get paid per hour.)