All Sorts of Sorts
P.J. Plauger might best be known as the coauthor (with Brian W. Kernighan) of The Elements of Programming Style, and as an active figure in the development of both the C and C++ Standards. But over the years, he has also consistently written for numerous programming magazines and journals, contributing hundreds of articles to the likes of Embedded Systems Programming and C/C++ Users Journal, among others. Compiled into a three-volume Programming on Purpose series of books on software design, personalities, and technologies, his columns for Computer Language magazine are classics. This excerpt on sorts is taken from the January 1992 issue.
ALL SORTS OF SORTS
by P.J. Plauger
I am amazed at how seldom these properties are spelled out for comparison functions. I suppose they are sufficiently obvious to most programmers that they are hard to see as important semantic constraints, It's only when you violate one or more and spend three days debugging a program that you are reminded of their importance.
A classic failure is to change ordering functions in the middle of the stream. For example, the Standard C library has a function called bsearch. You specify an array in sort, an item to look up, and a comparison function. bsearch returns a pointer to an array element that compares equal or a null pointer if it can find none. As the b in the name implies, bsearch speeds its search by performing a binary chop. Thus, it can search an array with 1,000 elements with at most 10 comparisons.
The binary chop algorithm assumes the array is sorted by the same criterion as is enforced by the comparison function you specify on the call to bsearch. You can use another Standard C library function called qsort to ensure this result. One argument to qsort is a comparison function that behaves much like the one for bsearch. With a bit of care, you can ensure that both functions use exactly the same comparison function.
What many people do, however, is initialize the search table statically, right in the C source code. They sort the initializers for each element by hand. Or they use a sort utility that comes with their development system of choice. Usually, that tactic works fine. Probably, the sort utility and your comparison function specify the same ordering. The key is often just a text string that sorts in one obvious way. Right?
Move the code to a machine with a different character set, and you can be surprised. Character data in sort on the original host no longer sorts the same on the new target. The comparison function yields answers inconsistent with the ordering of the table. Disaster ensues.
I have learned to be wary of arrays whose order depends on character codes. I still initialize them statically, but I no longer make them constant. Instead, I sort such [arrays using the] same comparison function I use for later searches. (I also comment each such table clearly so future maintainers retain this discipline.) My method saves all sorts of nasty surprises.
You find similar problems among the UNIX utilities. The sort command lets you specify all kinds of special options for the comparison function. You can specify multiple keys or ordering criteria. (Later keys are tested only when earlier ones compare equal.) Each key can specify a different subfield of the text lines to be sorted. Each subfield can be interpreted in different ways. The main thing wrong with all this flexibility is its complexity. It usually takes me between 10 minutes and an hour of experimentation to get all the keys right for a specialized sort.
Another thing is wrong, however. Several other UNIX utilities also require a comparison function. For example, you drop duplicate adjacent lines from a file by piping the sorted file through uniq. And you identify lines common to two files (or unique to just one file) by sorting both files and passing them through comm. uniq and comm have some notion of before and equal when comparing two text lines. If their findings differ from the notion imposed by the earlier sort, the utilities misbehave.
My knowledge of UNIX is admittedly out of date. But the last time I looked, the utilities uniq and comm lacked all the ordering options of sort. I have slammed into that incompatibility with annoying regularity over the years. I was inspired many years ago to write library functions for parsing sort keys and use them in comparisons. I still have versions of sort, uniq, and comm that can agree on fairly ornate their worth many times over.
The comparison functions you use with sorts and searches are critical pieces of code. They tailor standard algorithms for particular applications. Coded properly, comparison functions help you order data sets efficiently and robustly. Done wrong, they cause all sorts of pernicious errors.
The set of all useful comparison functions is very large. I have used dozens of distinct ordering criteria with various sort utilities over the years. I would gladly have used more had the sort options been more flexible. I have written more comparison functions in C than I care to count. It is clearly hard to guess the parameters you need for a flexible parametric comparison function. It is also onerous to require users to specify comparison functions by writing C code.
Both those problems came back to haunt me recently. As I mentioned [in my last column], my latest magnum opus is a book on the Standard C library. One interesting feature added to C is the concept of locale-dependent collation. In other words, you can specify how to compare two text strings using ordering rules that vary across cultures (and subcultures).
The workhorse function is called strcoll. You call it with pointers to two null-terminated strings. It returns the usual three-way value you need to order strings by pairwise comparisons. If you expect to repeat such comparisons often, another function can speed the process. You call strxfrm to translate a string into an alternate form. Byte-by-byte comparisons of two such transformed strings yield the same ordering as calling strcoll with the untransformed strings.
The C Standard says an implementation must contain the functions strcoll and strxfrm. It says changing the locale category LC_COLLATE can change the behavior of these functions. But that's all it says. It gives no hints about:
- How to specify various "collations"
- What names to give them
- How to implement the functions
- How to change their behavior when the locale changes.
Those were the issues I tackled when I chose to implement a complete version of the Standard C library. My approach to locales was to introduce a locale file, which is a text file you can prepare with your favorite text editor. You use it to describe one or more locales. Each specifies a number of culture-dependent parameters. For example, Table 1 is a sensible locale entry for Australia.
Some fields speak for themselves. Some are downright cryptic. I don't pretend that casual users can or should learn the mysteries of locale files. More likely, a system administrator or programmer would modify an existing locale file for each peculiar need. The point is that you can capture many peculiarities of a given culture with just a few hundred bytes of text. A program can read this file at program startup and adapt in several critical ways to the preferences of the locals. Or it can adapt repeatedly to various locales as it runs, if it is more ambitious. Or it can tailor a specialized locale by choosing French dates, say, and accountants' conventions for formatting monetary values. A program shared in a multiprocessing system, such as UNIX, can even adapt simultaneously in different ways to different users. This example specifies how the locals prefer to format monetary and nonmonetary amounts and how they write their dates. That information is of widest interest to programmers writing adaptable applications. The locale file can also specify:
- Additions to various character classes, such as letters with accent marks or additional punctuation characters
- The encoding of multibyte character strings and their corresponding wide-character codes for large character sets
- The ordering rules for strcoll and strxfrm. I want to focus on this last item.
Despite what I said previously, strcoll is not the fundamental function. You can write strcoll in terms or strxfrm, but not the other way around. Each ordering rule is thus defined in terms of a string transformation. How the transformed strings sort determines the behavior of the ordering rule. So you want some flexible way to specify string transformations.
Perhaps you can see the quandary I faced. I want people to be able to specify a large assortment of ordering rules by adding text to a local file. That rules out writing a predetermined set of functions. Each may be very fast at enforcing a given ordering rule, but it is also limited. Choosing among a finite set of ordering rules is not a happy solution.
I have already described my frustration at using parametric sort packages. And I can't say I've done any better than others with my own designs. The sort-key parser I wrote was nice, but it was hardly more flexible than the UNIX sort utility. Unless the locale file reader includes a C compiler, it is hard to provide adequate flexibility. And having written the odd C compiler over the years, I knew better than to try anything that ambitious.
Nevertheless, I did design and write several parametric versions of strcoll and strxfrm. Each worked for an interesting subset of cases. Each foundered on the shoals of cultural diversity. The problem was that I knew too much about what people wanted, but I didn't know enough about how to deliver.
What I knew was courtesy of IBM and the POSIX subcommittee on internationalization. Both organizations have studied collation rules used around the world. Both have a stake in helping computer users implement those rules. Let me give you a few examples.
The easiest collation rule to implement is one where strxfrm transforms each string unchanged. That makes strcoll equivalent to the old warhorse C function strcmp. The ordering is determined by the codes assigned to the execution character set. I chose that behavior for the C locale you get a program startup. I figured it was most culturally neutral.
A rule almost as simple is to make letter comparisons case insensitive. You get this behavior by translating, say, all upper-case letters to lower-case. The mapping still produces one character for each character in the input string. That leaves you one step away from a dictionary sort. Such an ordering rule is case insensitive, but it also ignores any punctuation. In the extreme version of a dictionary sort, you discard all characters other than letters. Thus, "can't" and "cant" sort equal, and "cane" sorts before "can't" regardless of the values assigned punctuation codes. (I assume letters sort in alphabetical order.)
[In my last column], I mentioned a peculiarity I discovered with Swedish telephone directories. Seems they treat "I" and "J" as equivalent in determining sort order. You can handle such rules by an obvious variation on the dictionary sort. Simply map equivalent characters to the same character in the transformed string.
Most languages other than English have characters with accent marks. Occasionally, these are encoded as a sequence of two characters--an accent mark followed by the letter. That derives from the days of mechanical typewriters with "dead" keys for the accent marks. The carriage spaced only after you struck the letter following. The modern trend, however, is to define separate character codes for each combination of letter and accent. Display a binary file on your terminal screen, and you will probably see some of them.
The ordering rules for accented characters (and other funny characters) strain the bounds of creativity. Some simply sort the same as their unaccented cousins. That's easy to deal with. Some sort as if they were some other combination of letters. (The German S-tset, which looks roughly like a Greek beta, often sorts as if it were "ss"). That's still not bad.
Where life gets exciting is when you get to the disambiguating rules. Seems some cultures are happy to order accented characters the same as unaccented, provided the rest of the two strings differ. If they prove to be equal, one form of the letter precedes the other. That's somewhat messier to capture in a transformed string.
What you have to do is transform the string more than once. Take one pass over the string replacing each character with its sorting equivalent. Then tack a distinctive marker on the end. Then take a second pass over the string adding characters for each one that may have to be disambiguated.
Here is an example you can read. Let's say you want to order strings by dictionary order. That means ignoring case distinctions among letters and effectively discarding nonletters. But you also want distinct strings to compare unequal. Effectively, you want to take a second look at strings that sort equal by the dictionary rule. On the second pass, you want to perform a comparison of the original text.
One way to do this is to:
- Change all upper-case letters to lower-case and discard all the nonletters
- Append a period to this string
- Append the original string following
- the period. Thus, the string "Can't Happen" becomes "canthappen. Can't Happen." It will sort very close to "can't happen," but will not compare equal to it.
The reports I read from IBM and POSIX go on for pages. Every time you think you know all possible sorting rules, some central European country throws you another breaking fast ball. It's fun reading if you're perverse.
My final solution was almost as extreme as allowing C code in locale files. I went back to the simplest programmable system that is also powerful, fast, and flexible. I coded strcoll and strxfrm in terms of a table-driven, finite-state machine. What you put in the locale file is the specifications for the code tables. Table 2, for example, is how you specify the modified dictionary sort.
Again, I don't pretend this is readable. I won't even try to interpret it for you at present. That's my topic [for my next column].
All I will say for now is that this approach has proved fruitful. I have been able to encode a wide variety of comparison rules with this (albeit cryptic) scheme. It is satisfying terse, so as not to built up a locale file. And I think it is even reasonably efficient.