Superfluous Results
In reviewing the results of the comparison, often some specific files or specific source-code elements would show up throughout the results, skewing results and hiding important correlation information. For example, open-source files may have been used in one or both sets of files. In searching for plagiarized code, the open-source files would be highly correlated with each other, but these correlations were not important. Pieces of these files would show up throughout both sets of files and flagged as highly correlated.
Similarly, there were specific statements, comments, and identifiers that showed up in many places, but were not relevant to finding plagiarized code. Users searching for plagiarized code may find that two programs running on Linux both use the same system calls. Thus, files with these system calls will have a higher correlation. Common identifier names like "index," "count," and "result" showed up in many files, increasing correlation values, but were not necessarily signs of plagiarism.
Had these results been known upfront, some of them could have been eliminated before CodeMatch was run. However, given the number of files and the number of source-code elements, it was impractical to find these elements before performing the correlation. Also, the correlation itself pointed out many of these superfluous elements.
CodeMatch Post-Process Filtering
To make examining the correlation results more useful and let users focus on the kinds of correlation that are most important, I added the ability to filter the results. After CodeMatch produces a database of results, this filtering can be performed on the database:
- Statement filtering. A list of statements is created by users. Any correlation due to a statement on this list is eliminated.
- Comment filtering. A list of comments is created by users. Any correlation due to a comment on this list is eliminated.
- Identifier filtering. A list of identifiers is created by users. Any correlation due to an identifier on this list is eliminated.
- General file filtering. A list of file names is created by users. Any correlation between any file whose name appears on the list and any other file is removed from the results database.
- Specific file filtering. A list of file names with specific paths is created by users. Any correlation between a specific file on the list and any other file is removed from the results database.
- Folder filtering. A list of folders is created by the user. Any correlation between a file in a folder on the list and any other file is removed from the results database.
- Threshold filtering. Users can change threshold parameters, reducing the number of correlated file pairs that are displayed. Users can set minimum and maximum correlation scores to display and can set a maximum number of correlated files to display.
After the filtering is performed on the database, the correlation scores between file pairs are adjusted accordingly. I found that for large file sets, this filtering reduced the manual process of reviewing the results in order to find plagiarized source-code files from days to hours or even minutes.
My experience using filtering with CodeMatch can be generalized to any kind of information retrieval process. To understand how filtering can be used, it is important to first understand the different kinds of information retrieval processes and information display methods.