Channels ▼
RSS

Design

Internationalization: From the Sublime to the Ridiculous


Internationalization is one of those difficult activities that lies in wait in the background. You can go for years without needing it; then one, you step lightly across the line of making your apps somewhat friendlier for users who don't speak English. You can sense the beast hiding in the bushes, calling you to go just a bit farther. Just a couple of steps…and boom! You're suddenly entangled in a battle that seemed so innocent to start, but will now consume your project for the rest of its days.

You might have started out with good intentions by carefully putting all literals in a resource bundle that could be swapped out at program start-up for one containing translations that match the user's native language. That's a good start, but insufficient. Maybe your language doesn't support locale-specific date and time routines or currency representation, and you don't have the time to read the hundreds of pages of documentation to figure out how configure those features correctly. But if you push through these challenges, you've attained a reasonable first effort — sufficient, in many cases, for readers who use a Latin-based alphabet to use.

You do support extended character sets by using UTF-8, right? No? Ah, well now you do have a serious problem and you're about to be swallowed up in the tortures of internationalization. (By the way, the term "internationalization" is most often abbreviated as i18n. The 18 is actually the number of letters between the initial i and the final n. Saying "letters" is a little risky here. Depending on how you look at them, they might be referred to as glyphs or code points — which are not the same thing as letters, although in this case, they do refer to exactly the same result.)

Code points, in particular, are important because they refer to an entry in the Unicode standard. In my estimation, Unicode was one of the great triumphs of technology as originally formulated: a complete inventory of all characters from all languages in use anywhere on Earth. It was later expanded to include some dead languages and a few much needed symbols. In its early releases, it represented an extraordinary coordination between groups the world over. At last, there was a definitive list of characters, so that fonts could be designed that provided the necessary characters at fixed, known points in the character set (the code points).

It will come as no surprise that politics played a significant role in what to include and at which code point. The politics were much more complex and personalized than Apple squaring off with Microsoft regarding which characters to use in the upper 128 8-bit characters of TrueType fonts. (The remnants of this kerfuffle are still with us, as some documents designed on Macs show up with stray characters, especially for opening and closing quotation marks, when displayed on Windows systems.)

National preferences were another factor. As explained in the excellent book, Fonts and Encodings, symbols in previous alphabets, such as ISO-8859-1 (a forebear to Unicode) were often chosen because of the insistence of one country or the individual preferences of representatives. To wit: "¤ the 'universal currency sign.' The Italians were the ones to propose this as a replacement for the dollar sign in certain localized and 'politically correct' versions of ASCII. This author has never seen this symbol used in text and cannot imagine any use for it." Similarly, the widely used French ligature, œ, did not make it into the initial release of the alphabet because the French delegate was an engineer and convinced that the ligature was useless, although it appears throughout French and is known to all native speakers as a different character than the combined letters O and E. Well, apparently not all native speakers.

Many of these kinks were ironed out in the formalized Unicode document. For a long time, the participants in the process hoped that Unicode could fit all necessary code points into 64K characters, which would simplify the encodings. But the standard started including non-alphabetic and non-numeric symbols. At which point, almost any viable symbol could be and was added. Every mah-jong tile now has a code point. If you were writing a book on mah-jong, you might appreciate having a font with all the symbols. But what possible value could there be for including a symbol for pig nose (1F43D), woman with rabbit ears (1F46F), or — most indecorously — a steaming pile of poo (1F4A9)? Yes, indeed, all those characters have been voted into the standard. I dare not think about the nature of the dialog on these matters. I hope instead that there was none and a happy jokester just "slipped one through."

The inclusion of ridiculous and unneeded glyphs undercuts Unicode's importance and belittles its otherwise significant contribution to internationalization. In addition, their presence makes the possibility of a choice of full Unicode fonts improbable but for the sheer size of the font and extent of the effort and the time wasted on producing glyphs that have no value.

If your app can find alphabets for the languages you want to support, you'll then have to take on the even more difficult challenge of bidirectional scripts — that is, if you have any intention of selling the app in the Middle East and in East Asia. Not only is the birectional aspect very hard, but the ligatures required in Arabic are immensely challenging. Only true experts can help you there.

Internationalization is so difficult that developers are motivated to support the smallest subset of possible languages and cultures possible. And even then, the support most often provided is that requiring the least effort. The quagmire of robust internationalization is left to expert groups at companies that can afford them, with the net result that many useful apps don't find their way to speakers of lesser-used languages — and a persistent and important digital divide results.

— Andrew Binstock
Editor in Chief
alb@drdobbs.com
Twitter: platypusguy


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-46fc81f8e62e8963ac180572f3f58fbd
2013-09-11T21:24:47

Localization is a great challenge for developers but also a great opportunity, and one that should be considered seriously. Translation is not the entirety of localization, but it is a good portion of it. If you're a developer, I recommend checking out Ackuna. It's the only free and accurate translation service for developers. It is a crowdsourced translation community, and if you upload your project to the site it automatically parses the strings. Once the translation is complete, you can download your project in the completed language pair. It's pretty simple, and easy to use.

Granted, it's not full localization, but it is a convenient tool in the battle for i18n.

I enjoyed this article Andrew, and although some don't seem to believe it covers all the difficulties of localization I think it illustrates, quite well, the inherent difficulty the language barrier imposes on all aspects of localizing software.


Permalink
AndrewBinstock
2013-08-14T02:52:42

You don't need to login to view comments. Only to post. The second half of your comment contains important points that will address much of the immediate basic needs for i18n--but only for English speakers. And there's the rub.


Permalink
ubm_techweb_disqus_sso_-4b470ed8f91cfe1a3b7deeed6b95cbf7
2013-08-14T00:08:00

Not sure why I needed to log in just to view comments. This is an incredibly important subject, but the author makes a mess of it by mixing all kinds of issues instead of keeping it simple.
He confuses internationalization with translation and other issues. This is a US approach. What he does not make clear is that most of the world can deal with English/European languages, but cannot understand US formats.
- Dates in all the world are DD/MM/YY or YYYY-MM-DD.
- Currencies should be separated from amounts.
- Measurements might be in US or Metric.
- Addresses might not require states or zip codes.
- Telephone number might include country codes and not adhere to US formats.
- The application should support Unicode from the start.
If developers would consider these points at the beginning, they would overcome the issues that they will run into with internationalization, and provide an upgrade path to full global support.


Permalink
AndrewBinstock
2013-08-11T06:38:43

Native language (NL) translation is a rather different topic. The problems you lay out are standard ones in NL processing. Not sure why you picked these in particular. There are literally thousands to choose from.
Not sure what you mean by "Fix up the writing so English will be better..."


Permalink
ubm_techweb_disqus_sso_-3bb4162d60a73e47be0e384e4b8c5419
2013-08-10T11:21:34

One of the biggest problems is the use of American English.
The first irritant is the American use of "on".

In all foreign languages, and also non-USA English countries, the word on means "the opposite of off" or on something physical. The book is on the table. Americans say "I will talk to you on xyz algorithm or on xyz subroutine.

The word to use is one of "about, concerning, appropos".
Then there is the use of see. See the contents. Here is an example. use cat xyz to see the contents.

we see file xyz and view the contents. See and view (French say regarder (to see or look at) and voir (to view). And finally when there are references to a sentence with many subjects, as "Abc causes def to react by doing ghi". This phenomon is bad. The question is, to what does phenomon refer? Does it refer to Abc or ghi or def?

Fix up the writing so English will be better and thus allowing the translation to be more precise.

Keep smiling.


Permalink
ubm_techweb_disqus_sso_-f5556c08afb65764ed6cd9f93eab3cf0
2013-08-07T18:51:15

My favorite gotcha when programming Java servlets is HTTP POST requests. The servlet specification requires that if no charset is specified, ISO-8859-1 is assumed. The HTTP specification requires that clients use the same charset as the referring page, typically UTF-8, and not specify a charset. And I'm pretty sure they default to UTF-8 when in doubt. So every time I write a servlet that handles POST requests, it needs to explicitly override the default behavior.


Permalink

Video