Tools and algorithms for finding plagiarism in source code
Bob is a senior member of the IEEE and president of Zeidman Consulting, a contract research and development firm. He can be contacted at bobzeidmanconsulting.com.
Plagiarism detection programs have been around for a number of years. Examples of some widely used programs include Plague, the YAP programs (YAP, YAP2, and YAP3), JPlag, and MOSS. These programs have recently gotten more attention for two main reasons:
- The Internet and search engines (like Google) have made source code easy to obtain.
- The open-source movement lets programmers all over the world write, distribute, and share code. Moreover, the increased mobility of high-tech employees contributes to code leaving one company and showing up at another company, intentionally or not.
It follows that plagiarism detection programs have become more sophisticated in recent years. In his excellent paper "Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies" (http://ir.shef.ac.uk/cloughie/papers/plagiarism2000.pdf/), Paul Clough discusses tools and algorithms for finding plagiarism in generic text documents, as well as in programming- language source-code files. Since I am only interested in tools and algorithms for finding plagiarism in source code, I confine my discussion to those relevant tools.
Plague (http://www.csse.monash.edu.au/projects/plague/), YAP (http://www.cs.su.oz.au/~michaelw/YAP.html), and JPlag (http://wwwipd.ira.uka.de:2222/) can be described in general terms as using the following three-phase algorithm. Each program uses different methods for each phase, but essentially the steps are the same:
- Remove comments, identifier names, and whitespace.
- Replace code statements with a sequence of tokens.
- Compare token sequences to find similarity.
In the first phase, all comments are discarded. Also, identifier names such as variable name and function names are discarded. Whitespace is removed. In the second phase, programming-language statements are replaced with generic tokens that describe the basic functionality of the program, as in Figure 1. In the third phase, the token sequences of pairs of files are compared to find matches.
There are a number of problems with this type of approach:
- These programs are difficult to adapt to new programming languages because they are so dependent on expert knowledge of the programming language's source code they are examining.
- They are vulnerable to changing the order of code lines in the source code.
- They throw out useful information when they discard comments, variable names, function names, and other identifiers.
The second point is a particular problem because code sections can be rearranged and individual lines can be reordered to fool these programs into giving lower scores or miscopied code altogether. This is one method that sophisticated plagiarists use to hide malicious code theft.
The third point is a problem because comments, variable names, function names, and other identifiers can be useful in finding plagiarism. These identifiers can be used to pinpoint copied code immediately. Even in many cases of intentional copying, comments are inadvertently left in the copied code and can be used to find matches. Common misspellings or the use of particular words throughout the program in two sets of source code can help identify them as having the same author, even if the code structures themselves do not match. In Cadence v. Avant!, in which Avant! was found guilty of stealing code from Cadence Design Systems, the code for Avant!'s ArcCell and Aquarius programs included words in comments that were consistently misspelled in the same way as the comments in the code for Cadence's Symbad program. It appeared that sections of the code had been rewritten but many comments remained. It was thus proven that Avant! had not started its code development from scratch but had gotten a head start by illegally appropriating source code. ("'Smoking gun' presented at Avant! restitution hearing," EE Times, June 27, 2001.) Ignoring comments and identifier names is a common problem with plagiarism tools.
The MOSS program (http://www.cs.berkeley.edu/~aiken/moss.html) works differently than the others. Its algorithm can be described by these phases:
- Remove all whitespace and punctuation from each source-code file and convert all characters to lowercase.
- Divide the remaining nonwhitespace characters of each file into k-grams, which are contiguous substrings of length k, by sliding a window of size k through the file. In this way, the second character of the first k-gram is the first character of the second k-gram, and so on.
- Hash each k-gram and select a subset of all k-grams to be the fingerprints of the document. The fingerprint includes information about the position of each selected k-gram in the document.
- Compare file fingerprints to find similar files.
As with the other programs, MOSS discards comments, variable names, function names, and other identifiers that can be very useful in finding plagiarism.
MOSS considers source-code files as if they were generic text files. Because of this, much structural information that can be used to find matches is lost. For example, whitespace, punctuation, uppercase characters, and nonalphanumeric symbols have significant meaning in programming languages but are thrown out by MOSS.
To speed execution time, MOSS uses a statistical algorithm to find matches. While this results in the program completing faster than if it used a deterministic algorithm, it also results in a small but nonzero chance of missing plagiarized program pairs.
CodeMatch, a program I wrote, takes an entirely different approach to plagiarism detection than the programs previously described. First, CodeMatch compares features of each pair of source-code files completely, rather than using a sampling method for comparing a small number of hashed samples of code. This may require the program to run for hours, or in some cases days, to find plagiarism among large sets of large files. Given the stakes in many intellectual property theft cases, this more accurate method is worth the processing time involved. And it is certainly less expensive than hiring experts on an hourly basis to manually pore over code.
Second, CodeMatch makes use of knowledge of programming languages and program structures to improve the matching results. There is a small amount of information needed in the form of a list of common programming-language statements that CodeMatch must recognize. This list is specific to the programming language being examined. In addition, CodeMatch needs information on characters that are used to identify comments and characters that are used as separators.
CodeMatch uses a combination of five algorithms to find plagiarism: Source Line Matching, Comment Line Matching, Word Matching, Partial Word Matching, and Semantic Sequence Matching. Each algorithm is useful in finding different clues to plagiarism that the other algorithms may miss. By using all five algorithms, chances of missing plagiarized code is significantly diminished. Before any of the algorithm processing takes place, some preprocessing is done to create string arrays. Each file is represented by three arraysan array of source lines that consists of lines of functional source code and does not include comments; an array of comment lines that consists of comments and does not include functional code; and an array of identifiers found in the source code. Identifiers include variable names, constant names, function names, and any other words that are not keywords of the programming language.
For each file pair, the CodeMatch Word Matching algorithm counts the number of matching words that are not programming-language keywords. To determine whether a word is a programming-language keyword, comparison is done with a list of programming-language keywords. For example, the word while in a C source code file would be ignored as a keyword by this algorithm. In some programming languages such as C and Java, keywords are case sensitive. In other programming languages like Basic, keywords are not case sensitive. CodeMatch has a switch to turn case sensitivity on/off depending on the language being examined. So, for a case-sensitive language like C, the word While would not be considered a language keyword and would not be ignored. In a case-insensitive language such as Basic, however, the word While would be considered a language keyword and would be ignored. In either case, when comparing nonkeyword words in the file pairs, case is ignored so that the word Index in one file would be matched with the word index in the other. This case-insensitive comparison is done to prevent being fooled by simple case changes in plagiarized code.
Partial Word Matching
The CodeMatch Partial Word Matching algorithm examines each nonkeyword word in the source code of one file and finds all words that match a sequence within one or more nonkeyword words in the other file. Like the word-matching algorithm, this one is also case insensitive. Figure 2 shows the results of this algorithm. In Figure 2(a), the nonkeyword words from the two files are displayed. In Figure 2(b), every word from one file that can be found as a sequence within a word from the other file is listed. So the word abc in file 1 can be found within words aabc, abc1111111, and abcxxxyz in file 2. Note that word pdq is not listed in the array of partially matching words because it matches a complete word in both files and was already considered in the word-matching algorithm. Also note that x is not listed in the array because one-character words are ignored for being too common.
Source Line Matching
The CodeMatch Source Line Matching algorithm compares each line of source code from both files, ignoring case. I refer to functional program language lines as source lines and exclude comment lines. Also, sequences of whitespace are converted to single spaces so that the syntax structure of the line is preserved. Note that a line of source code may have a comment at the end. The comment is stripped off for this comparison. Source lines that contain only programming-language keywords are not considered matching. For source lines to be considered matches, they must contain at least one nonkeyword such as a variable name or function name. Otherwise, lines containing basic operations would be reported as matching. Figure 3 shows two files along with line numbers and the source lines that are considered matching by this algorithm.
Comment Line Matching
The CodeMatch Comment Line Matching algorithm compares each line of comments from both files, again ignoring case. As before, sequences of whitespace are converted to single spaces so that the syntax structure of the line is preserved. Note that a line of source code may have a comment at the end. The source code is stripped off for this comparison. The entire comment is compared, regardless of whether there are keywords in the comment or not. Figure 4 shows two files along with line numbers and the comment lines that are considered matching by this algorithm.
Semantic Sequence Matching
The CodeMatch Semantic Sequence Matching algorithm compares the first word of every source line in the pair of files, ignoring blank lines and comment lines. This algorithm finds sequences of code that appear to perform the same functions despite changed comments and identifier names. The algorithm finds the longest common semantic sequence within both files. Look at the example code in Figure 5. The sequence of lines in both files is considered matching because the first word is identical when blank lines, whitespace, and punctuation are removed. This algorithm yields a score representing the number of source lines in the longest semantically matching sequence in the two files.
Finally, a single score is given for the similarity of the file pairs. If a file pair has a higher score, it implies that these files are more similar and may be plagiarized from each other or from a common third file. Each of the scores from each of the five individual algorithms is weighted and added to give the Total Match Score. These weights can be adjusted to give the optimal results.
Over time, these weights have been adjusted by comparing results of files in which code has been copied and disguised. However, unlike the other algorithms described in this article, CodeMatch is not intended to give a specific cutoff threshold for file similarity. I recognize that there are many kinds of plagiarism and many ways of fooling plagiarism detection programs. For this reason, CodeMatch produces an HTML output report with a list of file pairs ordered by their Total Match Scores. Users can click on a file-pair hyperlink to bring up a detailed HTML report showing exact matches that occurred between the selected files. In this way, experts are directed to suspicious similarities and allowed to make their own judgments. CodeMatch is not a tool for precisely pinpointing plagiarized code, but rather a tool to assist an expert in finding plagiarized code. CodeMatch reduces the effort needed by the expert by allowing him to narrow his focus from hundreds of thousands of lines of code in hundreds of files to dozens of lines of code in dozens of files.
Experts in intellectual property litigation have used the commercial version of CodeMatch to successfully search for plagiarized source code. A demo version of CodeMatch is available at no charge from DDJ (see "Resource Center," page 3) and my web site (http:/www.zeidmanconsulting.com/). The demo version compares two files and lists all their similarities as determined by the five algorithms. The commercial version compares lots of files in different directories and ranks them according to most similar pairs. Then you can click on a pair to see their similarities as determined by the five algorithms.
Thanks to the following people who have years of engineering experience designing software and hardware. Each has contributed suggestions for the CodeMatch program and ran tests to help determine its usefulness: Bob Wedig, Mike Potel, Charlie Neuhauser, Brian Berg, and Rich Belgard.
Whale, Geoff. "Identification of Program Similarity in Large Populations," The Computer Journal, Vol. 33, Number 2, 1990.
Wise, Michael J. "YAP3: Improved Detection of Similarities in Computer Program and Other Texts," Proceedings of SIGCSE '96.
Prechelt, Lutz, Guido Malpohl, and Michael Philippsen. "Finding Plagiarisms Among a Set of Programs with JPlag," Journal of Universal Computer Science, Vol. 8, no. 11, 2002.
Schleimer, Saul, Daniel Wilkerson, and Alex Aiken. "Winnowing: Local Algorithms for Document Fingerprinting," Proceedings of SIGMOD 2003, 2003.
Clough, Paul. "Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies," Research Memoranda, CS-00-05, Department of Computer Science, University of Sheffield, UK, 2000.