Channels ▼
RSS

Design

Internationalization: From the Sublime to the Ridiculous


Internationalization is one of those difficult activities that lies in wait in the background. You can go for years without needing it; then one, you step lightly across the line of making your apps somewhat friendlier for users who don't speak English. You can sense the beast hiding in the bushes, calling you to go just a bit farther. Just a couple of steps…and boom! You're suddenly entangled in a battle that seemed so innocent to start, but will now consume your project for the rest of its days.

You might have started out with good intentions by carefully putting all literals in a resource bundle that could be swapped out at program start-up for one containing translations that match the user's native language. That's a good start, but insufficient. Maybe your language doesn't support locale-specific date and time routines or currency representation, and you don't have the time to read the hundreds of pages of documentation to figure out how configure those features correctly. But if you push through these challenges, you've attained a reasonable first effort — sufficient, in many cases, for readers who use a Latin-based alphabet to use.

You do support extended character sets by using UTF-8, right? No? Ah, well now you do have a serious problem and you're about to be swallowed up in the tortures of internationalization. (By the way, the term "internationalization" is most often abbreviated as i18n. The 18 is actually the number of letters between the initial i and the final n. Saying "letters" is a little risky here. Depending on how you look at them, they might be referred to as glyphs or code points — which are not the same thing as letters, although in this case, they do refer to exactly the same result.)

Code points, in particular, are important because they refer to an entry in the Unicode standard. In my estimation, Unicode was one of the great triumphs of technology as originally formulated: a complete inventory of all characters from all languages in use anywhere on Earth. It was later expanded to include some dead languages and a few much needed symbols. In its early releases, it represented an extraordinary coordination between groups the world over. At last, there was a definitive list of characters, so that fonts could be designed that provided the necessary characters at fixed, known points in the character set (the code points).

It will come as no surprise that politics played a significant role in what to include and at which code point. The politics were much more complex and personalized than Apple squaring off with Microsoft regarding which characters to use in the upper 128 8-bit characters of TrueType fonts. (The remnants of this kerfuffle are still with us, as some documents designed on Macs show up with stray characters, especially for opening and closing quotation marks, when displayed on Windows systems.)

National preferences were another factor. As explained in the excellent book, Fonts and Encodings, symbols in previous alphabets, such as ISO-8859-1 (a forebear to Unicode) were often chosen because of the insistence of one country or the individual preferences of representatives. To wit: "¤ the 'universal currency sign.' The Italians were the ones to propose this as a replacement for the dollar sign in certain localized and 'politically correct' versions of ASCII. This author has never seen this symbol used in text and cannot imagine any use for it." Similarly, the widely used French ligature, œ, did not make it into the initial release of the alphabet because the French delegate was an engineer and convinced that the ligature was useless, although it appears throughout French and is known to all native speakers as a different character than the combined letters O and E. Well, apparently not all native speakers.

Many of these kinks were ironed out in the formalized Unicode document. For a long time, the participants in the process hoped that Unicode could fit all necessary code points into 64K characters, which would simplify the encodings. But the standard started including non-alphabetic and non-numeric symbols. At which point, almost any viable symbol could be and was added. Every mah-jong tile now has a code point. If you were writing a book on mah-jong, you might appreciate having a font with all the symbols. But what possible value could there be for including a symbol for pig nose (1F43D), woman with rabbit ears (1F46F), or — most indecorously — a steaming pile of poo (1F4A9)? Yes, indeed, all those characters have been voted into the standard. I dare not think about the nature of the dialog on these matters. I hope instead that there was none and a happy jokester just "slipped one through."

The inclusion of ridiculous and unneeded glyphs undercuts Unicode's importance and belittles its otherwise significant contribution to internationalization. In addition, their presence makes the possibility of a choice of full Unicode fonts improbable but for the sheer size of the font and extent of the effort and the time wasted on producing glyphs that have no value.

If your app can find alphabets for the languages you want to support, you'll then have to take on the even more difficult challenge of bidirectional scripts — that is, if you have any intention of selling the app in the Middle East and in East Asia. Not only is the birectional aspect very hard, but the ligatures required in Arabic are immensely challenging. Only true experts can help you there.

Internationalization is so difficult that developers are motivated to support the smallest subset of possible languages and cultures possible. And even then, the support most often provided is that requiring the least effort. The quagmire of robust internationalization is left to expert groups at companies that can afford them, with the net result that many useful apps don't find their way to speakers of lesser-used languages — and a persistent and important digital divide results.

— Andrew Binstock
Editor in Chief
alb@drdobbs.com
Twitter: platypusguy


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video