Optimization Versus Flexibility — An Example
Last week, I discussed how optimization can sometimes mislead programmers into writing code that works better in the lab than in the field, perhaps because the problems that programs encounter in the field are larger than those that are used during testing. As luck would have it, since I wrote that article, I encountered a real-world example of that phenomenon.
- Wrangling Actionable Insights from Organizational Data
- Top 5 Ways Process and Content Management Helps Insurers Increase Profitability
- Intrusion Prevention Systems: What to Look for in a Solution
- 5 Reasons to Choose an Open Platform for Cloud
I put a fair amount of energy into electronic music, which, these days, often means dealing with sample libraries. A sample library is a collection of files, each of which represents a sound. There are many programs on the market that take such sounds and combine them in musically interesting ways.
It is not unusual for each sample in a library to be in a separate file. As a result, the samples on my computer are scattered among literally hundreds of thousands of files. Most of these files are tiny, although a few are quite large. This situation naturally raises a question: How does one back up such data? More generally, how does one go about backing up a few hundred thousand files?
The author of every backup system must decide how to store the data that it backs up. For example, does each original file have a corresponding file in the backup archive, or are multiple files combined into a smaller (or larger!) number of backup files? Does the folder structure in the original have a counterpart in the backup? Different authors answer such questions in different ways.
One factor that affects the answer is how the operating system behaves when you put a lot of files in a single folder. For example, if a backup program has to store 1,000,000 files, it might put them in 100 folders, each of which contains 100 folders, each of which contains 100 files. Alternatively, it might store chunks of 1,000 files each in archives (such as Zip files), and then put those archives in a single folder. At an extreme, it might even try to create a single folder with 1,000,000 files in it.
The choice of strategy affects the backup program's performance in a way that might be dramatic if the operating system doesn't perform well with very large folders. A particularly important factor is what happens when a program tries to create a new file in a folder that has a lot of files in it already. The system has to search the folder to see whether there is already a file with that name, and that search time depends on the number of files already in the folder.
To see why this dependency is crucial, think about what happens if the system looks at every file in a folder to determine whether the newcomer is a duplicate. The first file in a folder takes one comparison to check; the second takes two comparisons, and so on. So the number of comparisons needed to put n files in a folder is 1+2+…+n, which is O(n2). If, on the other hand, the system can check for duplicates in constant time, then putting n files in a folder takes only O(n) time.
Now let's return to the concrete problem of backing up several hundred thousand files. If such a task is to complete in a reasonable amount of time, it is essential for the time that it takes the operating system to add a file to a folder to grow much more slowly than the number of files already in the folder.
One backup program I tried simply copies all of the files to be backed up into one of 16 folders, apparently at random. More precisely, it somehow (perhaps by hashing the file's contents) creates a name for the copy of the file. This name includes a long string of hex digits. The first of these digits is used to select a folder with a single-hex-character name; each of these folders contains a bunch of files.
One of my backups using this program had about 30,000 files in each folder, which covered my nearly half a million files overall. Watching this backup program at work convinced me that the file-system designers must have done quite a bit toward allowing the file system to handle a large number of files in a single folder.
However, when I switched from using a directly connected external disk to using a network storage unit, the roof caved in. I'd start running my backup, and 18 hours later, I'd see that it was still running. Moreover the data-transfer rate had slowed from tens of megabytes per second to a few thousand bytes per second.
The source of the problem became clear when I looked at one of the folders on my network storage device: It had the expected 30,000 files or so in each of its 16 folders, and when I tried to create a new file in one of these folders, it took nearly five seconds. In other words, adding another 1,000 files to these 30,000 would take well over an hour just to create the files, not counting transferring any data into those files.
In short, some aspect of the network storage device is killing performance for folders with a large number of files. At the moment, I don't know which aspect it is: It might be the device itself, or the network protocol, or something else I haven't anticipated. What is important for the purpose of this article is:
- The operating-system developers decided that creating lots of files in a single folder should run quickly.
- The backup-program developers observed that the operating system handled this operation efficiently, and took advantage of that efficiency.
- The efficiency does not carry over to the network-storage device.
- As a result, it is unacceptably slow to use this particular program to back up a very large number of files on this particular network-storage device.
Two interesting questions come out of this state of affairs, one general and one specific.
The general question is what assumptions software developers should make about how facilities such as file systems perform. Wouldn't it be nice as a user to be able to assume that you can put as many files as you like in a folder without worrying about performance degradation? If it's not fair to make that assumption, wouldn't it be nice for file-system developers to be able to assume that folders won't grow to absurd sizes?
Unfortunately, like most general questions, these have no easy answers. No matter what you do, a network file-storage device is not going to have exactly the same performance characteristics as a disk directly attached to the computer. So the question is not whether there will be performance tradeoffs, but rather which performance tradeoffs are acceptable. In this case, the backup software designers and the network storage designers made decisions about these tradeoffs, each of which was reasonable by itself, but which interacted with each other in unfortunate ways. All I can do about it is to report this problem to the authors and hope for the best.
Here is the specific question: What options do I have if I want my backups to run more quickly? After thinking about for a bit, I realized that I could subdivide my collection of files into eight smaller collections, each roughly the same size. As a result, instead of having 30,000 files in each backup folder, I now have about 4,000 files in each. I haven't measured precisely yet, but if adding each file takes an amount of time proportional to the number of files already in the folder, I would expect this subdivision to reduce the total overhead by a factor of eight or so. That pragmatic strategy will at least let me back up my files while I figure out whether to file a bug report with the manufacturer of the backup software or the network storage device.