Is Optimization Immoral?
Dynamic memory allocation in C++ (and C) can be surprisingly expensive. A typical memory allocator keeps a list of available free memory blocks. Each time you ask to allocate memory, it searches the list for an appropriate block, subdivides it if necessary, and returns its address. When you free the memory, that memory must go back into the data structure for future reuse. Moreover, it often makes sense to check whether a block being freed is adjacent to one or two blocks that are already free, so that the memory can be coalesced into a single larger block. As a result, allocating and freeing even a small block of memory can cost dozens, or even hundreds, of instructions, and can easily dominate the execution time of applications that use a lot of dynamic memory.
White PapersMore >>
- Real results: Speeding quality application delivery with DevOps [in financial services]
- Defense Against the Dark Arts
Once upon a time, I wrote a simple addition to the memory allocator I was using at the time. This addition took advantage of the fact that unlike C, C++ generally supplies the size of the requested memory both when that memory is allocated and when it is freed. Here's how the code worked:
- I assumed that all memory allocation would be in units of words rather than bytes, so I started by rounding the requested size up to the next multiple of a word.
- Every distinct block size had its own list of available blocks. In other words, there was a list of blocks one word long, another list of blocks two words long, and so on until the size reached an arbitrary upper limit.
- Whenever the user asked to allocate memory, my code would round the size up to a word multiple; determine whether the size exceeded the limit; if the size was greater than the limit, the program would use the system's memory allocator; otherwise, it would find the free list associated with blocks that size. If there was already a block available on the free list, it would return the first such block. Otherwise, it would allocate a single large block, break it into smaller blocks, put all but one of those blocks on the free list, and return that last block as the result.
- Whenever the user asked to free memory, my code would simply put the memory onto the appropriate free list for the block's size.
During execution, this code would maintain literally hundreds of free lists, one for each distinct block size. Many of those lists would be empty, so the only overhead for those lists would be the list headers. None of the memory in the nonempty free lists would ever be returned to the operating system, thereby saving the time that would otherwise be needed to free that memory. I felt that this code would save the time involved in allocating and freeing a large number of small blocks (as, for example, a string library might do), while still freeing large blocks on request. I thought that this was a reasonable compromise, as code that allocates a large block of memory will presumably spend a lot of time working with the contents of that memory anyway.
I was proud to see that there were some applications that became 10 times faster when I added this code to the memory allocator. Moreover, I never saw any evidence that not freeing small blocks made any applications use an unacceptable amount of additional memory.
Several months later, I described this code to a colleague, who said: "I think that what you're doing is immoral." Not surprisingly, I asked him to explain. Although he didn't completely convince me, I think that his argument is interesting enough to be worth considering.
Think about the performance of this allocator as I've described it. For blocks smaller than my arbitrary limit, the allocator runs very quickly: It finds the appropriate list, checks whether it's empty, and then — most of the time — simply takes a block off the head of the list and returns it. If it is necessary to create a new list, or add a collection of blocks to an existing list, there is the additional overhead of a single call to the system allocator; but that happens only once in a while. In general, it is reasonable to assume that it is very time-efficient to allocate or free memory.
When the block size exceeds my arbitrary threshold, the performance characteristics change abruptly: Now, the system allocator is called for every block. That's the bad news; the good news is that now, memory is returned to the system, whereas it is not returned for smaller blocks.
In effect, the optimizing allocator has added a performance breakpoint: The program's performance is sharply different for sizes larger than the threshold than it is for smaller sizes. My colleague argued that such performance breakpoints are evil, because when people who use the code learn about the breakpoints, they will change their own behavior to compensate for the breakpoints. Once they do that, the people who use their code will change their own behavior, and so on. The original performance breakpoint may well result in ripple effects that complicate entire systems. As a result, my colleague argued that it is more important to design systems so that their performance changes smoothly with input size than it is to make them run very quickly on particular input sizes. Nevertheless, my memory allocator really did make some programs run 10 times faster.
Such situations often yield difficult design decisions — especially because software is often used with much larger input than that with which it is tested. Therefore, an optimization such as my memory allocator might make an application appear to run much more quickly during testing, only to have it perform unacceptably slowly in production. On the other hand, even a program that deals with a very large input may do so by requesting a large number of small blocks of memory — a case that this particular memory allocator would handle well.
In short, rather than considering performance breakpoints as always being immoral, I am inclined to think of them as yet one more factor among the many to consider as part of designing any significant piece of software. I'd like to invite readers to mention other examples of performance breakpoints that they may have encountered.