If Order Relations are Such a Pain, Why Bother?
My last several posts have talked about how hard it is to get order relations right. Whenever a problem turns out to be harder to solve than we expected, it makes sense to ask whether perhaps we were solving the wrong problem in the first place. What is it about order relations that makes it so important to bother with them?
- Consolidation: The Foundation for IT Business Transformation
- Build a Business Case: Developing Custom Apps
- Thwart off Application-Based Security Exploits: Protect Against Zero-Day Attacks, Malware, Advanced Persistent Threats
- How to Improve Customer Analytics: Best Practices
To understand one answer to that question, let's take a look at a paper that I published nearly 25 years ago. It describes a small class library that, to my knowledge, was the first time anyone had implemented associative arrays in C++. This library appeared several years before C++ templates, so I had to abuse the C preprocessor to make it work; nevertheless, it is similar in several ways to the current standard C++
map facility. One such important similarity is the requirement that the index type for an associative array have an order relation.
Many other programming languages that implement associative arrays do so via hash tables. The idea behind a hash table is to have a way of computing a pseudo-random integer from each value that is used as an index. The main requirement is that a given value must always yield the same integer as a hash; there is no requirement that distinct values must always yield distinct hashes.
Hash tables usually perform well, provided that the programmer can give even a rough estimate of how large the table is apt to grow. Until a hash table is nearly full, the time to access an element is usually
O(1). The order-based associative arrays that I implemented had an access time of
O(log n), where
n is the number of elements in the table. In general, then, a hash table will be faster than an order-based data structure.
However, order-based structures have three important advantages over hash tables, and these advantages were enough to convince me to choose order-based structures.
First, when you use the elements of an ordered container as a sequence, rather than associatively, the elements are already in order. In the paper, I gave an example of a program that reads a file and counts how many times each distinct word appears. When the program prints the list of words and corresponding counts, those words already appear in alphabetical order; there is no need to look at the output and think "$#!+ I forgot to sort the words!" before rewriting the code.
Second, it is easy to treat values that are conceptually equal as if they are "unordered," thus causing them to be treated interchangeably. This property is useful, for example, for storing sets of file names on a case-insensitive filesystem. Using an order-based container makes it easy to arrange that
File will be considered the same name, even though they appear different. Moreover, whichever name is actually used will appear in the program's output. It is more difficult to make hash-based data structures behave this way, because they often use an equality test to distinguish between two different values that happen to have the same hash code. Such a container will treat
File as different from each other unless they really compare equal.
However, the real reason that I preferred order-based containers was pragmatic. Using either an order-based or a hash-based container requires the user to supply an appropriate order or, respectively, hash function on the values in the container. We have seen that order relations are hard to get right. The good news is that the cost of getting an order relation wrong is usually that the program gives nonsensical results or fails in another obvious way. In contrast, a typical cost of getting a hash function wrong is that the program works, but runs very slowly.
There was one easy case that I was particularly eager to avoid. Suppose you want to use an associative container, and in order to do so, you must supply a hash function. Won't you be tempted to write a hash function that ignores its argument completely and returns a constant value? Such a function will work; it's just that it will cause every value to have the same hash code. The effect of such a function will be to change the container's average access time from
O(n), or the time to traverse the entire container from
O(n2). Such changes are particularly insidious because they're often not enough to notice when the program is run on small test cases, but kill performance when the program is used in production. Users who find out about such performance-killing properties tend to blame the library that they're using, rather than their own laziness.
In other words, order relations as the basis for containers are useful not only because they make the containers more pleasant to use, but because a program that uses order-based data structures either works or doesn't; whereas a program that uses hash-based data structures instead may well work, but work horrendously slowly.
One can readily argue that there is nothing wrong with a library offering its users the opportunity for greater performance, while warning them that carelessness on their part is apt to compromise that performance. These days, with programmers generally comfortable with using associative containers with user-defined types, it might even be a plausible decision.
But this was 1988. Most of the programmers who would first use this library had never seen an associative array before. I believed — and still believe — that it was better to design the library in a way that would cause users' programs to fail outright unless the users got the details right. Moreover, I felt that it was better to have a library with reasonable performance in all cases than to have one with excellent performance most of the time but horrendous performance at the extremes.
This last principle — that reasonable performance all the time is better than excellent performance that deteriorates horrendously at the margins — deserves a more detailed discussion; I'll return to it in a future article.