Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Finding Duplicate Files


December, 2003: Finding Duplicate Files

Julius is a freelance network consultant in the Philippines. He can be contacted at [email protected].


Numerous programs in the UNIX system (and its clones) are small and yet excel at what they do. Individually, they can't do much, but when used together they can make a large task easier to implement, debug, and maintain. In keeping with this UNIX philosophy, I'll show you how to use a little UNIX utility, find, and two small Perl scripts to accomplish the job of finding duplicate files in a given directory.

Using find

find is a small but powerful utility that is available on all UNIX/Linux systems. The following command, for example, tells find to descend into /tmp (and recursively descend into all subdirectories it encounters), and print to the standard output the names of all files and subdirectories it finds:

find /tmp -name "*" 

find's output is similar to this:

/tmp
/tmp/tex2pdf-root
/tmp/Gladman
/tmp/Gladman/sha2.c
/tmp/Gladman/uitypes.h
/tmp/Gladman/test.c
/tmp/Gladman/sha2.h
/tmp/Gladman/a.out
/tmp/guile-1.6.4

The best feature of find, just like any good UNIX tool, is that its output can be redirected as input to another program. So, instead of displaying its output to the screen, you can use a pipe to "hand over" its output to the next program for processing, as in

find /tmp -name "*" | ./md5

The symbol "|" is the pipe, and the command above tells us that md5 is the next program that takes in find's output.

The md5 Script

md5 is a short Perl script that uses the MD5 one-way hash function to generate a unique message digest for a given input file. The md5 script is shown in Listing 1. The MD5 algorithm guarantees that every file has its own unique digest. (Well, not quite. But mathematically speaking, the probability of two different files having identical digests is 1 in 264, which is practically nil.) So, if MD5 says that two files have the same digest, you can be sure, with a very high degree of confidence, that they are indeed identical.

The output of the command:

find /tmp -name "*" | ./md5

gives us something like:

294da4cc09dd0024a1f21a7f810dfe60  /tmp/Gladman/sha2.c
7218b118cc8de06b13bd4d997cb00250  /tmp/Gladman/uitypes.h
72f0f6f900800fbd7b14af552b40d8a6  /tmp/Gladman/test.c
b0b8bae1aaf4169ec6dffb1a78b13372  /tmp/Gladman/sha2.h
009b12ab19c4e5640f2bd263f4970d70  /tmp/Gladman/a.out

Each line of md5's output consists of a 32-character MD5 digest, followed by two single spaces, and lastly, a filename.

Line 7 of md5 uses the "diamond operator" (<>) to receive whatever that is output by find. Although invisible to the naked eye, each line of find's output ends in a newline, something we don't need. md5 strips off these newlines via the chomp command (line 8). Line 9 checks whether or not the input line is a regular file (the -f switch).

Most novice Perl programmers would be puzzled by lines 7-9, since <>, chomp, and -f seem to operate on a global variable that was not explicitly defined anywhere in the script. Let me explain lines 7-9 in more detail.

By default, every input line read by <> is loaded into the special variable, $_. When chomp is called without any argument, it is understood that we mean to discard the newline character of $_. The -f switch (guess what?) also acts on $_ implicitly. $_ is updated every time a new input line is read.

Lines 11-13 generate an MD5 digest of the filename, and line 15 prints the digest and the filename.

The finddup Script

finddup takes the output of md5, again, through the use of a pipe. finddup then produces an associative array, %dups, using the digests computed by md5 as keys. See Listing 2 for the whole finddup script.

On lines 11 and 12 of finddup, we again employ the familiar <> and chomp, respectively. On line 13, the first 32 characters of $_ are assigned to $digest, while the rest of $_ gets assigned to $filename.

On line 15, $filename is saved in %dups, using $digest as key. Normally, when the contents of %dups are finally printed out, duplicate files for every key are displayed on one line only, with just a single space separating them. This is ugly to look at. To make the output pleasing to the eye (I hope), I saved a newline every time $filename is push-ed into %dups (line 14). This way, the filenames are displayed in one nice-looking column. In doing this, however, notice that every key of %dups now contains at least two elements (the filenames and the newlines).

On lines 18-24, finddup loops through %dups, and checks for keys with multiple elements. On line 19, if a certain key (the MD5 digest) contains more than two elements (take the newlines into account), we know that duplicate files exist for this digest, and they get printed to the standard output (line 21).

If finddup is called with the --verbose or -v options switched on, the digests for duplicate files are printed as well (line 20). Line 22 is there for cosmetic purposes only.

Sample Output

On my Linux machine, I made a directory and called it testdir. I then created six small test files, some with duplicates.

The command:

find ./testdir -name "*" | ./md5 | ./finddup —verbose

or

find ./testdir -name "*" | ./md5 | ./finddup -v

produces these results:

1  900150983cd24fb0d6963f7d28e17f72
2     ./testdir/ba3e2571
3     ./testdir/9cd0d89d
4     ./testdir/a9993e36
5  
6  d4b7c284882ca9e208bb65e8abd5f4c8
7     ./testdir/1a3472da
8     ./testdir/f2c8d22e
9     ./testdir/hello world

The three files listed on lines 2-4 are all identical; their MD5 digest is printed on line 1. Likewise, line 6 shows the digest of identical files printed on lines 7-9.

Unfortunately, you'll find out that, using either of the commands above, it's not possible to get the digest of a file that starts with a dot ("."). There's a way, though, but I'll leave this as an exercise for the interested reader. (Hint: check the find man page.)

TPJ



Listing 1

1  #!/usr/local/bin/perl
2  use diagnostics;
3  use strict;
4  use warnings;
5  use Digest::MD5;
6  
7  while (<>) {
8      chomp;
9      if (-f) {
10          open INFILE, $_;
11          my $context = new Digest::MD5;
12          $context->addfile(*INFILE);
13          my $digest = $context->hexdigest;
14          close INFILE;
15          print "$digest  $_\n";
16      }
17  }
Back to article


Listing 2
1  #!/usr/local/bin/perl
2  use diagnostics;
3  use strict;
4  use warnings;
5  use Getopt::Long;
6  
7  my %dups = ();
8  my ($digest, $filename, $verbose) = ();
9  GetOptions("verbose" => \$verbose);
10  
11  while (<>) {
12      chomp;
13      ($digest, $filename) = /(.{32})(.*)/;
14      push(@{$dups{$digest}}, "\n");
15      push(@{$dups{$digest}}, $filename);
16  }
17  
18  foreach $digest (sort keys %dups) {
19      if (scalar(@{$dups{$digest}}) > 2) {
20          print "$digest" if ($verbose);
21          print "@{$dups{$digest}}\n";
22          print "\n" if ($verbose);
23      }
24  }
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.