Channels ▼

Web Development

Catching Cheats with the Perl Compiler

March, 2004: Catching Cheats With the Perl Compiler

Catching Cheats With the Perl Compiler

The Perl Journal March, 2004

By Deborah Pickett

Debbie teaches Perl and assembly at Monash University in Australia. She can be reached at

Laziness, impatience, hubris. Perl users have been raised to believe that these are the virtues of a good programmer, but they have a dark side. They are also the character flaws of the cheat and plagiarist:

Laziness: I can't be bothered learning how to program in this language.

Impatience: If I copy off my friend, then I'll be able to do stuff I actually enjoy doing sooner.

Hubris: I won't get caught.

The issue of plagiarism doesn't often come up in the world of Perl, perhaps because of the Perl community's commitment to open source and giving credit where it's due. But it's a different story in the introductory Perl programming course that I teach at Monash University. Here, the assignments I set for my students must be the students' own work, and students who copy others' work without giving credit are considered to be cheating the system. Transgressors are punished, swiftly and mercilessly.

At least they would be if a tool existed for comparing Perl programs with each other. There are plenty of tools for comparing C and Java and other languages, but I couldn't locate any for Perl. Ironically, a package my university uses to compare C code, called "Moss," uses Perl, but doesn't compare Perl source code itself.

Perhaps this absence of a comparison tool is partly due to the aforementioned lack of need, but it surely must also be because Perl is a notoriously difficult language to parse. Simple substring comparison isn't good for detecting similarities in code because people change indentation, comments and variable names. To properly get a picture of what the program is doing, it's necessary to parse the source.

Only perl can Parse Perl

There are two choices when it comes to parsing Perl. The first option is to write a Perl grammar in whatever yacc-like notation you prefer, and generate a parser that accepts that grammar. While this is easy in C, it's close to impossible in Perl because of the language's syntactic idiosyncrasies. However, this may not be too great a handicap since typical Perl programs, as written by neophytes, don't use such features; it may be possible to parse a decent subset of Perl using off-the-shelf tools such as Parse::RecDescent. A big advantage of this approach is that whitespace, comments, and other nontokens that the Perl parser ignores could be examined, too, for hints of common source-code ancestry.

The second solution is to make perl (the executable) parse Perl (the language), something that, by definition, it will always get right. There are two ways: The first is to use Perl's -Dx command-line option, which spits out an ugly syntax dump of a program, and parses the output into some other form. A few years ago, this would have been the only option. But with the introduction of the B::* suite of compiler back ends, there is a better choice: Create a new subclass of B that picks salient features in the source code's parse tree, and pipe the program through it. Unfortunately, some features of the source, such as whitespace, will be lost because the Perl tokenizer strips these before the parse tree is built.

I think that a robust solution to the code-similarity problem needs to use some of each of the aforementioned two approaches. For a quick-and-dirty solution, however, I opted to make use of the Perl compiler and wrote a module called B::Fingerprint, which turns a program into a reasonably short and descriptive string, which, in turn, can be analyzed using more traditional string-comparison tools.

The M.O. of a Plagiarist

Plagiarists typically start with a working, completed piece of code written by someone else, and either try to work it into their own broken code or scrap their own code and spend the rest of their time trying to make the original code look different. Because they don't have a great deal of confidence in the language, they tend to make small, incremental changes to the code and hope the program still works (as a rule, plagiarists aren't terribly good at testing code).

The most common transformations are:

  • Rewriting the comments;
  • Indenting the code differently;
  • Changing variable names, and
  • Reordering subroutines in the program.

Somewhat rarer changes include changing if to unless and reordering a bunch of independent initialization statements.

Any technique that compares programs for evidence of copying should try to downplay the effects of these transformations and look at the program's deeper structure, which will probably be left untouched.


As its prefix suggests, B::Fingerprint is a compiler back end. Back ends are modules that can examine or manipulate the opcode tree of a Perl program, and usually finish up printing something interesting about the program. Perhaps the most well known is B::Deparse, which emits a human- (and perl-) readable rendition of Perl code. That something like B::Deparse can even exist means that there is a great deal of information available in the opcode tree for B::Fingerprint to examine.

Some back ends (such as B::Deparse) are interested in the tiny details that make up a piece of code. Others, like B::Showlex (which identifies the lexical variables that a subroutine uses), are interested in only one part of the code. B::Fingerprint, on the other hand, needs to give a broad overview of all of the code, so that similarities between two programs will engender similar fingerprints. In this case, a fingerprint is a long string that characterizes the program.

To understand how to detect when programs come from the same source, you have to use an almost forensic technique. You have the scene of the crime (the programs) but nothing else. The rest you have to assemble yourself from the evidence. So it helps to understand what usually happens to a piece of code when someone tries to cover their tracks. B::Fingerprint manages to work because it completely ignores the things that a plagiarist usually changes. For instance, B::Fingerprint doesn't care about variable names at all; all it knows is that a scalar was used here in the code. Even if you change all the scalar variable names in a program to $fish, the fingerprint will be unchanged.

From a technical viewpoint, B::Fingerprint walks the opcode tree of the program, printing a symbol for each tree node it sees. Perl opcodes come in about a dozen different kinds; for instance, there are binary operators that correspond to two-argument Perl operations like addition, and list operators that appear anywhere a sequence of operators needs to be evaluated in some order, such as in a Perl list or a sequence of Perl statements. Each opcode type that B::Fingerprint sees produces a different character of output in the fingerprint. Some operator types have child nodes; these are always printed as suffixes, between braces.

Here's the fingerprint for B::Fingerprint itself:

perl -MO=Fingerprint B/ 


Probably the only salient feature you could pick out easily is the string of 24 "$" characters, corresponding to the list of initializers for the %opclass variable.

Comparing Fingerprints

Creating the fingerprints of programs is only half of the problem. It's still necessary to compare two fingerprints to see how similar they are (hence, how similar the original programs are). Doing this well turns out to be surprisingly difficult.

One metric that can be used to establish how similar programs are is to take one fingerprint, and find out how many changes need to be made to it to arrive at the other program's fingerprint. This isn't always symmetrical (so, for instance, program A can be 80 percent the same as program B, but B may be only 65 percent the same as A), but it's capable of ranking similar pairs of fingerprints above dissimilar ones.

Ideally, the comparison algorithm should be able to distinguish small changes from large changes. However, "small" and "large" don't necessarily relate to lines of code affected. For instance, changing the order of subroutines in a file is a trivial modification, even though several hundred lines may have been relocated. Wrapping an if condition around a block is a more significant change to a program, though it may result in only a small change to its fingerprint.

The algorithm I settled on is Walter Tichy's string-to-string block-move algorithm, used in his RCS source-code revision control package. The challenges faced in keeping track of a program's revisions are similar to those involved in detecting plagiarism: You want to keep the deltas between revisions as short as possible, so it is a good idea to try to eliminate the parts of each revision that are the same. So, it turns out that the block-move algorithm is also good at detecting the less innocuous kinds of "revision" that happen in a case of plagiarism.

The Block-Move Algorithm

The block-move algorithm acts rather like a repeated cut-and-paste operation. Given two strings, A and B, it tries to reconstruct string B using only substrings from A. For example, if string A contained "full hands" and string B contained "handfuls," then B could be built with three block moves:

0 1 2 3 4 5 6 7 8 9
f u l l   h a n d s

  <li>From position 5, for 4 characters (<i>hand</i>)</li>
  <li>From position 0, for 3 characters (<i>ful</i>)</li>
  <li>From position 9, for 1 character (<i>s</i>)</li>

In RCS, these numbers constitute the delta from A to B, and are all that is actually stored in the revision control directory. For my purposes, all I care about is that it took three block moves to create a string of length 8. This ratio is a good indicator of how much of B came from A: the lower the ratio, the more similar the code is.

(In rare cases, there might be a character in B that doesn't appear in A at all. In RCS, such characters have to be encoded directly into the delta; in my application, they will simply count as an additional block move.)

Suffix Trees

In the aforementioned example, three block moves are necessary to create B from A. It's important to get the best figure here because it's possible to get a higher number by choosing blocks badly. Thus, the block-move algorithm needs to be greedy, always choosing the longest possible block to copy at each point. Greediness guarantees an optimum result and, in terms of the block-move algorithm, means that it must scan one string (A) for the longest prefix from another string (B).

A naive implementation of the longest-prefix problem is likely to run very slowly, as there are many choices to make at each character—so it makes sense to transform the search string (A) into some appropriate data structure to accelerate the process. The appropriate data structure turns out, in this case, to be a suffix tree. A suffix tree is a form of trie, which is an n-ary tree optimized for fast lookup. To give you an idea, a trie containing the strings camel, cat, catfish, dog, dromedary, and fish is shown in Figure 1.

A suffix tree for a string A is simply a trie of all substrings in A from each character to the end of the string (i.e., substr($A, $i) foreach $i 0..length $A). A suffix tree for "abracadabra" is given in Figure 2.

Armed with a suffix tree of A, it is now possible to determine the longest substring of A that matches the beginning of B: Simply walk down the tree, matching characters, turning at each node according to the next character in B. When the next character in B isn't available at the current node in the tree, the substring is complete and guaranteed to be the longest. To count the number of block moves, simply repeat the procedure from the tree root on the remainder of B until there is nothing left. The number of block moves is equal to the number of times you visited the root node. Because each character is examined only once, this takes time proportional to the length of B.

The program compare (See Listing 1) accepts a number of file names as arguments and constructs fingerprints for each of them with B::Fingerprint. For each fingerprint, it then constructs a suffix tree, storing it in a hash. (I took the code for creating the suffix trees from the Allison web page listed in References.) With this "forest" of suffix trees, the program calculates the number of block moves required to convert every fingerprint into every other fingerprint. It then prints out the most similar cases.


To test the program, I ran it on a selection of assignment submissions from my 190 Perl course students. I already knew of one case of plagiarism in these assignments, so I hoped to find that one near the top of the resulting list. The assignment source code was, on average, 300 lines long, resulting in fingerprints of about 2000 characters. It took my three-year-old laptop about 400 MB and half an hour to finish processing every pair of fingerprints. Sure enough, back came the plagiarism case I already knew, along with at least 10 other cases involving more than 20 students. There was a lot more laziness, impatience, and hubris in my course than I'd expected.

Interviews with the flagged students revealed that compare had been spot on. Explanations I received from the students ranged from outright copying to working together on the program structure before going off and coding separately. There was only one obvious false positive, and that I classified as such by looking at the source code and deciding that, though there was probably a shared heritage, I didn't have enough evidence to convict.


I should reiterate that compare isn't completely automatic; I did need to examine the source code of the programs that compare flagged, and look for other signs of commonality between the programs. In this respect, compare is nothing more than a quick way of weeding out all the negatives in the n2 pairs in any set. But the fact that I originally detected only 10 percent of the plagiarism cases on a visual inspection suggests that this is still a useful tool.

A couple of years ago, students at Monash University did a similar kind of project comparing C files. It sort of worked, but not nearly as well. So why does it work so well with Perl? I think there are two reasons. First, there's more than one way to do it. Perl has such a rich syntax and such a wide variety of approaches to solving a problem that the likelihood of any two given programs using the same algorithm is smaller with Perl than C. Second, compare doesn't compare Perl source code, but compiled Perl syntax trees. The transformation that Perl's compiler makes to a program's source code makes the resulting fingerprint a truer representation of the program's execution order, reducing the impact of the source's sometimes nonlinear execution (compare if (condition) {code} to code if condition).

Further Work

Now that I've released B::Fingerprint and compare, it's only a matter of time before students learn to pipe their future assignments through it before submission, just to see if I'm going to catch them. This doesn't worry me greatly; the amount of effort needed to change a program so that it no longer resembles the original is large enough that the programmer will learn something about Perl through pure osmosis. Nonetheless, I have some backup plans in case compare's success rate falls.

For instance, the block-move algorithm is only able to perform exact matches. If two strings are identical except for one character in the middle, then the block count increases. A better solution would be to allow approximate matches. This turns out to be a significantly harder problem, however, as some classes are in the computational too-hard basket called "NP-complete" (see the Lopresti and Tomkins paper in References). Approximate matching would likely increase the quantity of false positives, too.

Related to this is the fact that the block-move algorithm can't accurately tell me how much, as a percentage, of one program can be found in another, which is perhaps a more useful metric than the one compare reports. This isn't because the information is lost in the creation of the suffix tree, but rather because the block move algorithm is greedy and always picks the longest substrings. This means that blocks can and often do overlap, and while this results in the optimal number of block moves, those blocks don't necessarily produce the best coverage of the fingerprint. I briefly experimented with the aspect of coverage but it turned out to be an unreliable measurement under the greedy block-move model.

compare reports back on pairs of similar programs, but often there are cliques of students who all work together on a piece of code. It'd be nice if some clustering analysis could be performed on the results, so that I don't have to figure out the "study groups" manually.

On another front, it's worth noting that B::Fingerprint cannot detect whitespace and commenting. Indentation style and other cues (some of which I classify as trade secrets) are often big giveaways that code has changed hands and simply undergone a search-and-replace regime. Comparing whitespace and commenting will greatly reduce the false positives to the point where it may even be possible to trust double-checking cases to a program.

Finally, there's a lot more information available in a syntax tree than B::Fingerprint extracts. For instance, scalar literals have a value that is often an important part of the algorithm, and variable names, while they can be changed, are usually modified globally over a function. Comparing these aspects of the syntax tree will probably require an overhaul of the comparison algorithm and might even necessitate switching to a hierarchical tree-comparison algorithm rather than the flat block move that I am presently using.

Each of these enhancements will probably highlight slightly different pairs of similar programs, so a robust plagiarism detector will likely contain a combination of them.


compare is capable of comparing a fairly large number of Perl programs to each other. It reports back on pairs that are likely to be related, with human inspection required. On real-world sample data, it correctly identified 10 percent of the population as not being original work.

B::Fingerprint and compare are available for download at


L. Allison, "Suffix Trees," ~lloyd/tildeAlgDS/Tree/Suffix/ (contains an explanation of Ukkonen's algorithm and pseudocode, which I copied with permission).

D. Lopresti and A. Tomkins. "Block Edit Models for Approximate String Matching," Theoretical Computer Science (1997), vol. 181, no. 1, pages 159-179.

Moss (Measure of Software Similarity): http://www.cs.berkeley .edu/~aiken/moss.html

W.F. Tichy, "The String-to-String Correction Problem with Block Moves," ACM Transactions on Computer Systems (1984), vol. 2, no. 4, pages 309-321.

W.F. Tichy, "RCS: A System for Version Control," Software—Practice and Experience (1991), vol. 15, no. 7, pages 637-654.

E. Ukkonen, "On-line Construction of Suffix Trees," Algorithmica (1995), vol. 14, no. 3, pages 249-260.


Listing 1

#!/usr/bin/perl -w

# compare: compare N Perl programs with each other.
# usage:
#   compare [-n max] file ...
# where
#   max is the maximum number of pairs of similar programs to report.

use strict;

use Getopt::Std;

our %opts;
getopts("n:", \%opts);

# How many cases to report?
our $topcases = $opts{"n"};

# Suffix-tree-building code, adapted from
# based on
# E. Ukkonen's linear-time suffix tree creation algorithm.  Used with
# permission.
  my $infinity = 999999;  # Just has to be longer than any string passed in.

  sub buildTree
    my $fp = shift;
    # Build root state node.
    my $rootState = { };
    my $bottomState = { };
    my ($sState, $k, $i);
    for ($i = 0; $i < length $fp; $i++)
      addTransition($fp, $bottomState, $i, $i, $rootState);
    $rootState->{sLink} = $bottomState;
    $sState = $rootState;
    $k = 0;
    # Add each character to the suffix tree.
    for ($i = 0; $i < length $fp; $i++)
      ($sState, $k) = update($rootState, $fp, $sState, $k, $i);
      ($sState, $k) = canonicalize($fp, $sState, $k, $i);

    return $rootState;
  sub update
    my ($rootState, $fp, $sState, $k, $i) = @_;

    my ($oldRootState) = $rootState;
    my ($endPoint, $rState) = testAndSplit($fp, $sState, $k, $i-1, substr($fp, $i, 1));
    while (!$endPoint)
      addTransition($fp, $rState, $i, $infinity, { });

      if ($oldRootState != $rootState) { $oldRootState->{sLink} = $rState; }

      $oldRootState = $rState;
      ($sState, $k) = canonicalize($fp, $sState->{sLink}, $k, $i-1);
      ($endPoint, $rState) = testAndSplit($fp, $sState, $k, $i-1, substr($fp, $i, 1));

    if ($oldRootState != $rootState) { $oldRootState->{sLink} = $sState; }

    return ($sState, $k);

  sub canonicalize
    my ($fp, $sState, $k, $p) = @_;

    if ($p < $k)
      return ($sState, $k);
    my ($k1, $p1, $sState1) = @{$sState->{substr($fp, $k, 1)}};
    while ($p1 - $k1 <= $p - $k)    
      $k += $p1 - $k1 + 1;
      $sState = $sState1;
      if ($k <= $p)
        ($k1, $p1, $sState1) = @{$sState->{substr($fp, $k, 1)}};
    return ($sState, $k);
  sub testAndSplit
    my ($fp, $sState, $k, $p, $t) = @_;

    if ($k <= $p)
      my ($k1, $p1, $sState1)  = @{$sState->{substr($fp, $k, 1)}};

      if ($t eq substr($fp, $k1 + $p - $k + 1, 1))
        return (1, $sState);
        my $rState = { };
        addTransition($fp, $sState, $k1, $k1 + $p - $k, $rState);
        addTransition($fp, $rState, $k1 + $p - $k + 1, $p1, $sState1);
        return (0, $rState);
      return (exists $sState->{$t}, $sState);

  sub addTransition
    my ($fp, $thisState, $left, $right, $thatState) = @_;

    $thisState->{substr($fp, $left, 1)} = [$left, $right, $thatState];

$| = 1;

# Perl executable.
our $perl = $^X;

# All fingerprints, keyed by filename.
our %fp;

# Suffix trees of all fingerprints, keyed by filename.
our %tree;

# Stop comparing fingerprints after this many blocks.
our $ceiling;

# Get all fingerprints.
foreach my $filename (@ARGV)
  # This is OK as long as characters in name of $file are safe.
  my $fingerprint = `$perl -MO=Fingerprint $filename`;
  if (! $?)
    # Remember this file's fingerprint.
    $fp{$filename} = $fingerprint;
    # Insert the fingerprint into the suffix tree forest.
    $tree{$filename} = buildTree($fingerprint);

# Now compare each pair of fingerprints.
my @result;
my $count = 0;
foreach my $file1 (keys %fp)
  foreach my $file2 (keys %fp)
    next if $file1 eq $file2;

    my $length1 = length $fp{$file1};
    my $length2 = length $fp{$file2};

    # Progress meter.
    print int ($count++ / ((keys %fp) * (keys %fp)) * 100), "% complete\r" if -t STDOUT;

    # Do we have a maximum number of cases to report?
    if (defined $topcases && @result >= $topcases)
      $ceiling = $result[-1]{ratio} * $length2;
      undef $ceiling;
    # Compare the files.
    my $blocks = compare($file1, $file2);
    push @result, {
      file1 => $file1,
      length1 => $length1,
      file2 => $file2,
      length2 => $length2,
      blocks => $blocks,
      ratio => $blocks / $length2,

    # Ripple down new element in @result to keep it sorted.
    # If keeping only the top N cases, this is quicker than
    # sorting afterwards.
    if (defined $topcases)
      my $pos;
      my $new = $result[-1];
      # Insertion sort algorithm.
      for ($pos = @result - 2; $pos >= 0; $pos--)
        if ($new->{ratio} < $result[$pos]->{ratio})
          # Ripple up an element.
          $result[$pos+1] = $result[$pos];
          # Found the right place.
      # Insert the new item at its place.
      $result[$pos+1] = $new;

      # Lose the (now) last element?
      if (@result > $topcases) { pop @result; }

# If collecting all cases, sort so that more similar code is near start
# of list.
if (!defined $topcases)
  @result = sort {$a->{ratio} <=> $b->{ratio}} @result;

# Present results.
foreach my $result (@result)
    $result->{ratio}, " ", $result->{blocks},
    "/", $result->{length2}, ": ",
    $result->{file1}, " => ", $result->{file2},

sub compare
  my ($file1, $file2) = @_;

  # We're trying to reconstruct fingerprint $fp2 from $fp1, so need
  # suffix tree from $fp1.
  my $tree1 = $tree{$file1};
  my $fp1 = $fp{$file1};
  my $fp2 = $fp{$file2};

  # Number of blocks counted so far.
  my $blocks = 0;

  my $pos2 = 0;
  # Keep going while there's any of fingerprint 2 to do.
  BLOCK: while ($pos2 < length $fp2)
    # Find a path through $tree1 that matches the part of $fp2
    # we're up to.
    if (!exists $tree1->{substr($fp2, $pos2, 1)})
      # This character doesn't exist at all in $tree1.  Next block.
      next BLOCK;

    # There's an entry in the suffix tree.
    for (my $state = $tree1->{substr($fp2, $pos2, 1)}; # Start at root.
      defined $state;                    # Stop if finished a leaf node.
      $state = $state->[2]{substr($fp2, $pos2, 1)})  # Next node.
      # Walk through characters in this state, comparing with $fp2.
      for (my $count = 0; $count <= $state->[1] - $state->[0]; $count++)
        # Are there any more characters, and if so, do they match?
        if ($state->[0] + $count < length $fp1 &&
          $pos2 < length $fp2 &&
          substr($fp1, $state->[0] + $count, 1)
            eq substr($fp2, $pos2, 1))
          # Got a match, move on to the next character.
          # Characters don't match; this is the end of a block.
          next BLOCK;
      # Finished this state, and it all matched.  Go do the next one.
    # Count the blocks as we go.
    if (defined $ceiling && $blocks > $ceiling)
      # Exceeded the ceiling, return.

  return $blocks;
Back to article

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.