Channels ▼
RSS

Web Development

Sorting Out the Linguistic Mess


December, 2004: Sorting Out the Linguistic Mess

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.


I have a friend who works for a web consultancy putting together all kinds of web sites and applications for clients. One of his latest projects was an internationalized dictionary. He was having some problems with the Japanese terms, so he wanted me to have a look at it.

Well, we got the terms sorted out, but sensing the opportunity of a subcontract, I said "You know, a Japanese dictionary wouldn't be in this order." And it wouldn't. He had sorted the words with a simple sort routine, which would sort them according to their Unicode codepoint. Japanese dictionaries don't work like that; Japanese people don't have a mental mapping of characters to Unicode codepoints they can consult when they want to look up a word.

But Japanese is an extreme case: Even European languages sort in ways we might not expect. The point is that, in general, sort just isn't good enough for sorting. Collation (the technical term for sorting) of nonEnglish text is tricky, but we'll look at a few ways to make it easier.

ArbBiLex—Simple Two-Level Sort

How complicated your sorting needs to be depends on the languages you're dealing with. We'll look first at a nice, simple language like Spanish. Spanish sorts just like English, except that it has "ch" as a separate letter after "c," "ll" after "l," "--" after "n," and "rr" after "r."

To be honest, this ordering is pretty purist. In 1994, the Real Academia Espantildena officially dropped the separate diglyph letters "ch," "ll," and "rr," partly because it made sorting difficult for computer programmers. We're now going to prove them wrong.

There are two approaches we could take to this collation problem. We could replace all the new characters with sentinels: We replace "ch" with "c{" or something similar, meaning "chorizo" will become "c{orizo." Now we do a standard lexicographic sort, and "c{orizo" will sort after "cz" but before "d." Then once we've finished, we can replace them all back again. This is messy, and requires us to pick our sentinels carefully so that they don't interfere with the rest of our text.

The alternate approach is to write our own sort routine. However, that's sufficiently boring so Sean Burke has written a module to write sort routines. Sort::ArbBiLex allows you to construct efficient sort functions that understand nonASCII orders and multicharacter glyphs.

The easiest way to use Sort::ArbBiLex is to specify on the import line the name of a subroutine that you'd like it to create, and then a space-separated list of characters that makes up your alphabet.

So, for our Spanish alphabet, we can say:

use Sort::ArbBiLex ( spanish_sort => 
"a A b B c C ch CH d D e E f F g G h H i I j J k 
K l L ll LL m M n N --  o O p P q Q r R rr RR s
S t T u U v V x X y Y z Z");

This will import a function called spanish_sort into our namespace, which will act just like the built-in sort, but will sort in the order described.

That was the easy bit. We call it a one-level sort, since you're only interested in looking at one character at a time. I forgot to say that Spanish also has some accent marks, which may well appear in your text. These don't actually affect the sorting at all: "á" sorts the same as "a."

What we want to be able to do is group the accented and/or capitalized letters together for the purposes of sorting. This produces a two-level sort. We could describe this with the following Perl data structure:

    [qw/ a A á /],
    [qw/ b B /],
    [qw/ c C /],
    [qw/ ch CH /],
    [qw/ d D /],
    [qw/ e E é  /],
    ...
]

Fortunately, instead of giving Sort::ArbBiLex a plain string as a description of an alphabet, we can also pass an array of arrays just like this:

use Sort::ArbBiLex ( spanish_sort => 
[
[qw/ a A á /],
    [qw/ b B /],
    [qw/ c C /],
    [qw/ ch CH /],
    [qw/ d D /],
    [qw/ e E é  /],
    ...
]
);

That's one language (and many others like it) out of the way.

Sorting Thai

Unfortunately, not all languages are this easy. Thai, for instance, is an alphabetic language—that is, each character represents a sound—but it has some odd sorting rules. Initial vowels before the first consonant are not counted when sorting Thai words, and tonal accents should be ignored.

In order to sort Thai words, then, we're going to have to use one of the best known Perl sorting techniques, the Schwartzian transform. This comes in handy every time you have to transform a value in some complex way to sort it, and then get the original value back. First we construct a list of pairs: the original value, and the transformation. Let's say we're going to sort a list of English words but discount all their vowels:

@pairs = map {
  my ($vowelless = $_) =~ tr/aeiou//d;
  [ $_, $vowelless ]
} @words;

We use map, naturally, to transform a list. Now, if we wanted to get the original words back, we could say:

@originals = map { $_->[0] } @pairs

picking up the first element of each pair. But before we do that, we could sort the list on the transformed value:

@sorted = map { $_->[0] }
          sort { $a->[1] cmp $b->[1] } @pairs

using the second element.

But @pairs is just a temporary variable used for the transformation, so we could say this:

@sorted = map { $_->[0] }
         sort { $a->[1] cmp $b->[1] } @pairs
          map { ($copy = $_) =~ tr/aeiou//d;
                [ $_, $copy ]
              } @words;

This map-sort-map idiom is the hallmark of the Schwartzian. We can generalize this for our linguistic sorting: If we have a function sort_key that transforms a word into some kind of regular, sortable string, then we can sort any language, like so:

@sorted = map { $_->[0] }
         sort { $a->[1] cmp $b->[1]  } @pairs
          map { [ $_, sort_key($_) ] } @words;

All we need to do now is find an adequate sort_key for Thai. Here's one I made earlier. It starts with some definitions gleaned from the Unicode Standard:

my $thai_tone = qr/[\x{0e48}-\x{0e4b}]/;
my $thai_vowel = qr/[\x{0x40-\x{0e45}]/;

and now:

sub sort_key {
    my $word = shift;
    $word =~ s/$thai_tone//g;
    $word =~ s/^($thai_vowel)+//;
    return $word;
}

(Another Thai sorting algorithm, more useful for languages without Perl's string handling, can be found at http://linux.thai.net/ ~thep/tsort.html.)

Sorting Han Languages

The languages we've looked at so far have been alphabetic—each character has a well-defined "reading." Unfortunately, not all languages are like that. The so-called Han languages—Chinese, Japanese, and to a much lesser extent, Korean and Vietnamese—use the Chinese character set "hanzi" ("kanji" in Japanese). There are between forty and eighty thousand kanji, and their readings are dependent on context: The characters can have multiple readings and can also be used to convey a meaning as well as a sound.

This naturally makes it rather difficult to sort words: If you want to sort them by sound, you have to work out what that sound is, and this can't necessarily be looked up in a table because multiple sounds are possible. There are two ways around this; the Chinese way and the Japanese way.

The Chinese way is very simple: Don't even bother trying to sort by sound; sort by some other property. So some indexes of Chinese books sort their terms based on the number of strokes used to write each hanzi character. There's a freely available kanji dictionary that contains stroke-count information, so if we want to do the same, we can use the Lingua::JP::KanjiDic module to interface to it:

use Lingua::JP::KanjiDic;
my $reader = Lingua::JP::KanjiDic->new;

sub sort_key {
    my $word = shift;
    my $key;
    for my $char (split //, $word) {
        my $kanji = $reader->lookup($char);
        next unless $kanji;
        $key .= chr(96 + $kanji->strokes);
    }
    return $key;
}

Here, we look up each character in the dictionary, which returns a kanji object. The object contains all sorts of information, including the stroke count. We convert it into a letter: Character 97 is "a," so a one-stroke hanzi would be "a." There aren't any characters with more than 26 strokes, so we have a nice readable key representing the number of strokes in each character of the word.

The Japanese method, however, is a bit more difficult. The Japanese use two alphabets as well as kanji, and sorts words—even kanji—based on the phonetic order used in these alphabets. So, in order to sort a list of Japanese words, you have to find out how they're pronounced, and really, there's no easy way to do this.

All is not lost. The folks at Nara University have produced a piece of free software called chasen; this is a morphological analyzer that takes apart sentences and words, puts them in their uninflected dictionary form, and also reports their pronunciation. Even better, they have produced a Perl interface to it, Text::ChaSen. We can use this to turn kanji words into hiragana, one of the phonetic alphabets:

use Encode;
my $kanji = qr/[\x{4e00}-\x{9fff}]/;

sub kanji_to_kana {
    my $string = shift;
    return $string unless /$kanji/;
    require Text::ChaSen;
    my $string_euc = encode("euc-jp", $string);
    Text::ChaSen::getopt_argv('chasen-perl', '-F', 
                              '%a0');
    $string = decode("euc-jp", 
           Text::ChaSen::sparse_tostr($string_euc))
           || return $string;
    chomp $string;
    # Turn any katakana found into hiragana.
    $string =~ 
       tr/\x{30a1}-\x{30ff}/\x{3041}-\x{309f}/;
    $string;
}

There are two things to note here. First, chasen expects its input in the Japanese EUC character set, not in Unicode, so we use Encode to convert between the two. Second, chasen returns its output as katakana strings, whereas we want to deal with a single phonetic alphabet. We settled on hiragana because that's more often used for nonkanji words as well. A simple translation maps between the two alphabets.

There are several other rules that need to be followed before this turns into a usable sort key. For instance, some characters, even in the phonetic alphabet, are phonetic mutations of other characters: "ga," for instance, is a softer form of "ka," and these are sorted as identical.

All of the rules are wrapped up in the Lingua::JA::Sort::ReadableKey module: This exports a japanese_sort_order subroutine that we can use in our Schwartzian transform.

Putting It All Together

To solve the problem of the multilingual dictionary, we used a variety of sort key routines, one for each language we were concerned with. We also defined the order in which each language would appear in a full list, so that if we were sorting lists containing words from more than one language, they would be sorted by language first. The array looked something like this:

my @languages = 
  qw(English French German Thai Chinese Japanese);
my %language_order = 
  map { $languages[$_] => $_ } 0..$#languages;

%language_order is a hash where we can look up, say, Chinese, and get the order 4. Then we had a correlation between each language and a subroutine that would generate a key for a word in that language:

my %key_generators;
@key_generators{@languages} = (
    sub { $_[0] },
    \&french_key,
    \&german_key,
    \&thai_key,
    \&chinese_key,
    \&japanese_sort_order
);

The final sort received not just a list of words, but a list of word objects containing the word and its language. The sorting process then went like this:

sub lexicographic_sort {
  map { $_->[0] }
  sort {
          $a->[2] <=> $b->[2]
                   ||
          $a->[1] cmp $b->[1]
       }
   map {
       [
          $_, 
          $key_generators{$_->language}->($_->word),
          $language_order{$_->language}
       ] 
       } @_;
}

Now, this is not necessarily obvious, so we'll look at it a step at a time. The overall idiom is our Schwartzian transform again. However, this time, instead of a pair, we've created an array of three-element arrays. This is because we're sorting on two factors: the order of the language of the word and the key generated by the key generator.

For each object, we look for the key generator for its language, and then pass the word to it; this will generate a key in the appropriate language. Then we look up the language order, so that we can sort first by language. You'll see the idiom

$a->{some_property} cmp $b->{some_property}
                    ||
$a->{another_property} cmp $b->{another_property}

quite often to do multiple-level sorting. It means that if the comparison with the first property is a tie (the comparison returns 0), then the || operator does not return a result, but moves on to the second comparison and returns its result instead. In our case, we're testing the language, then the key within that language. Finally, we transform the three-element array back into the original object, and the job is done.

More About Sorting

We've deliberately looked at some of the nastier languages to sort, and you may not be dealing with some of these. If your sorting needs are relatively simple, the standard Unicode Collation Algorithm may be good enough. This is detailed in Unicode Technical Report 10, at http://www.unicode.org/reports/tr10/.

The UCA does not deal with all possible languages; instead, it has particular translation tables that can be slotted into the algorithm to explain how a particular language sorts words. There's a Perl implementation of this in the Unicode::Collate CPAN module. It takes tables such as the Default Unicode Collation Element Table, and uses these to sort and provide comparison operators for any kind of Unicode strings.

It doesn't, however, deal with nonalphabetic scripts like Chinese or Japanese. If you're having to deal with these, you still need some of the techniques we've discussed in this article to sort them.

One thing's for certain, though—sort won't do it.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV