Removing Duplicate Files Across Disk Drives

, February 01, 1991

February 1991/Removing Duplicate Files Across Disk Drives

Removing Duplicate Files Across Disk Drives

Jerzy Tomasik

Jerzy Tomasik develops scientific application software for Dynamic Solutions Corp. His prior experience includes designing scientific experiments and developing data analysis tools in C, Oracle, and RS/1 under MS-DOS and VMS operating systems. Mr. Tomasik started programming in FORTRAN as a student in Krakow, Poland. You may contact him at 205 S. Redwood Ave. #29, Brea, CA 92621.

All disk drives, regardless of their size, have one thing in common. Sooner than you think, they will run out of storage space. Sometimes all the files on your hard disk are really necessary, but more often you'll find a lot of files you could remove. Some excellent disk maintenance utilities are available, but they all lack one function. Few things are more frustrating than finding multiple copies of the same file scattered among different subdirectories. There are .EXE files that you have compiled, linked, and copied to your utilities directory, leaving an extra copy in the development directory. There are INSTALL and README files left from software installation that you no longer need. TURBO C Professional Development package creates UNPACK.EXE and README.COM in three different directories.

For developers, multiple copies of source files scattered around the disk and different projects can be a real nightmare. Maybe you would like to put them in a library, but how do you find them all? The problems for a single-user disk drive pale in comparison to what you would face if you were responsible for a network drive. Invariably, every user will have a personal copy of something that could easily be shared. And don't forget that the smallest file size is the cluster size — on my disk a one-byte file occupies 4,096 bytes. Multiple copies of a very small file add up quickly.

Since none of the popular disk maintenance utilities are able to weed out the duplicates, I started thinking what I would need to make this task easy. I believe in simple, reusable tools, and the power of piping, redirection and batch files, but I could not come up with a scheme that would generate output appropriate for the task. The MS-DOS DIR command does not even scan subdirectories. CHKDSK/V scans subdirectories, but it does not provide any file information beyond pathname and name. Third party and shareware directory utilities are usually a great improvement over DIR, but they all produce output that requires extensive processing. Clearly, I needed a custom program.

Design

The general flow of the program is simple:

read all files on the disk, recursing for subdirectories
find all duplicate names
print the duplicates, sorted by date/time and directory name.

MS-DOS supports hierarchical directory structure, where any directory (except the root) has only one parent. Any directory can have zero or more children. Both files and subdirectories of any directory are stored in a DOS directory structure. (Advanced MS-DOS Programming by Ray Duncan describes DOS internals in detail.)

DOS provides interrupt 21H, functions 4EH and 4FH, for locating directory entries. Function 4EH finds the first file, and 4FH finds subsequent files matching a wildcard pattern. Your program must communicate with these functions through a Data Transfer Area (DTA), which can be set by interrupt 21H, function 1AH.

Microsoft C provides library functions _dos_findfirst and _dos_findnext to shelter the programmer from calling assembly routines. (Turbo C calls these functions findfirst and findnext.) An additional advantage of calling these functions is that _dos_findfirst implicitly sets up the DTA. Subsequent calls to _dos_findnext search for files according to what is stored in the DTA. After I discuss algorithms for scanning the disk, the role of the DTA will become clearer.

There are two ways to process subdirectories while scanning the disk. When reading directory entries you may find files or subdirectories. When you find a subdirectory you have two options. Option one requires that you immediately descend to that subdirectory and process it, returning to the parent upon completion. The program must be able to handle this recursively. An alternative way saves new subdirectories, never interrupting processing of the current directory. The directories are saved in a structure that contains a flag indicating processed directories. Whenever the program completes processing of a directory, it marks that directory as processed and searches the directory list for another unprocessed directory.

The first method requires that you save the current DTA, call the _dos_findfirst with the new path, process the new directory, and restore the DTA when you are done. After you restore the DTA, instead of calling _dos_findfirst, continue calling _dos_findnext. Essentially you pick up where you left off. The second method relieves you from saving and restoring the DTA. This is not difficult with the help of Microsoft or Turbo C libraries, but I find it to be too close to the operating system. This method will process all directories in the order of nesting, starting with root. It will process all first-level sub-directories, then second-level subdirectories, continuing until it reaches the deepest level.

If there is any performance difference between the two methods, it is minimal. On my 386/25 machine with an 80Mb drive, the program scans the entire disk faster than the Norton Utilities FileFind.

After choosing the directory processing algorithm, I had to design data structure(s), not a trivial task under Intel 80x86 segmented architecture. Storing full pathnames with each file (DirEntry structure) is out of the question due to size limitations. It is also unnecessary. Each directory entry can maintain a pointer to a record in a directory list (DirList structure), which allows storage of only a single copy of each pathname. (Note: You could further reduce the required storage space by storing only directory name and a pointer to the parent, instead of the entire path. With deeply nested subdirectories and long pathnames, the gain may be significant.)

Even without the pathnames, the DirEntry structure occupies 28 bytes under the compact memory model. My 80Mb disk drive currently holds about 5,000 files, which translates to over 140Kb of RAM, since malloc allocates 28 bytes for each structure plus a block header for heap management.

Notice that the DirEntry will not be stored as a contiguous block in the memory, since DirList entries will be interspersed throughout the same memory area. malloc allocates space for DirList or DirEntry, and you have no control of the allocation order. Whatever is found on the disk gets a chunk of memory.

Although there are many ways to find multiple records, I chose to sort all the records and find identical adjacent records. The library version of qsort requires a contiguous array as input. Since the DirEntry records are not contiguous, you would have to add pointers to previous and next record to each DirEntry, then write a custom qsort.

Instead of using a doubly linked list for DirEntry and writing my own sort routine, I opted for an array of pointers (PtrList) to the DirEntry records.

In relational terminology, you have two tables, DirList and DirEntry with a "one to many" relationship. (For each record in the DirList, you have zero or more records in DirEntry, although you actually use the relation in the opposite direction, i.e., DirEntry to DirList.) The PtrList is an index for the DirEntry table. PtrList can be sorted using the qsort function. All you need is a comparison function.

The PtrList is an array limited to 15,000 files, although you will run out of DOS base memory before reaching that limit.

Language

I currently use Turbo C at work and Microsoft C at home. Since both are very good compilers, I don't have a strong preference. Unfortunately, while both compilers are very close to ANSI C, they differ significantly in the MS-DOS specific area. The functionality is similar, but function names, argument order, structure definitions, and compiler defines are different. Porting a program from one to another could prove time consuming. I attempted using a C preprocessor to hide the differences.

This exercise proved to be worthwhile. I managed to hide all the differences in a header file (Listing 1) . The executable code (Listing 2) contains no compiler-specific references. Several solutions I've found are worth passing on, since they may save a lot of your time.

I have seen many listings that laboriously define their own macros with names like MICROSOFT_C or TURBO_C. Both compilers have predefined values, which are there for you to use. Microsoft defines _MSC_VER and TURBO defines __TURBOC__. By using these values you don't have to modify the code or pass a compile-time define. This can be extremely valuable if you plan to distribute source code supporting more than one compiler.

Another useful trick is using macros to substitute not only function names but also argument order, as done for findfirst and _dos_findfirst. I chose to use Turbo nomenclature in the code, but the choice is arbitrary. You may define your own names and provide defines for all the compilers you want to support.

Notice also how structures ffblk and find_t are translated along with their members. These defines are a little more dangerous, since they may start colliding with your variable names (in this case size and name are likely candidates). If you decide to use this technique for larger programs, I recommend using a systematic approach to name substitutions. Using this systematic approach guarantees that the names you want to translate are unique.

To make your code even more maintainable, you could put compiler-specific defines in their own headers, and include them in your program headers as:

#if defined( _MSC_VER )
#include <local\msc.h>
#elif defined( __TURBOC__ )
#include <local\turboc.h>
#endif

Conclusion

The first programming language I learned was FORTRAN, followed by BASIC. When I started learning C, I had difficulties with the pointers. However, my difficulties were in seeing what the pointers were used for, even after I learned how to use them. Most of the pointer examples in popular C books show string functions, and BASIC has great string handling without the need for pointers, so I admit I had my doubts. Try to imagine implementing this program without pointers and structures. I'm sure it can be done, but readability and performance will suffer.

This program can serve as a skeleton for other custom programs, as it is essentially a barebone directory scanning program. You could enhance the program by processing wildcards other than *.*, offering multiple disk drive support, and/or spawning a user program (e.g., for file comparison). As a more advanced exercise, you could add virtual memory support.

Previous 1 2 3

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Removing Duplicate Files Across Disk Drives