JVM Languages

The Java Internationalization API

By Carol A. Jones, January 01, 1998

"Internationalization" is the process of preparing programs to run in other languages. Carol examines Java's Internationalization API and shows how you can use it to design global software.

Dr. Dobb's Journal January 1998: The Java Internationalization API

Global software for the global village

Carol is a senior software engineer at IBM, working on Java development tools. She can be reached at [email protected].

If you are writing Java applets for the Internet, people all over the world will be interacting with your code. But what happens to your user interface if the words you display are much longer in another language? How do you sort words in German or Japanese? What if you need to display a message and the grammar is different in French?

The answers to questions such as these involve "internationalization" -- the process of preparing programs to run in other languages. To enable internationalization for Java programmers, the JDK 1.1 includes the Internationalization API. In this article, I'll examine this API and show you how to use it to design global software.

The Java Internationalization API

The Java Internationalization API is a comprehensive set of APIs for creating multilingual applications. The JDK 1.1 internationalization features include:

Classes for storing and loading language-specific objects.
Services for formatting messages, dates, times, and numbers.
Services for comparing and collating text.
Support for finding character, word, and sentence boundaries.
Support for display, input, and output of Unicode characters.

Central to Java internationalization is the concept of a Locale object, which identifies a specific cultural region, including information about the country or region and its spoken language. Java code that uses a Locale to tailor information for users is called "locale sensitive." For example, displaying a date is a locale-sensitive operation, because dates are formatted differently in almost every country.

Most operating systems have some way for users to indicate their locale. Windows 95 does this through the control panel, under the Regional Settings icon. In Java, you can get the Locale object that matches the user's control-panel setting using myLocale = Locale.getDefault();. You can also create Locale objects for specific places by indicating the language and country you want, such as myLocale = new Locale("fr", "CA"); for "Canadian French."

The strings you pass to the Locale constructor are two-letter language and country codes, as defined by ISO standards.

Isolating the User Interface

The first step in making an international Java program is to isolate all elements of your Java code that will need to change in another country. This includes user-interface text -- label text, menu items, shortcut keys, messages, and the like. You might also have images that need to be changed, either because they include text that is part of the image drawing, or because the image doesn't make much sense in another culture.

The ResourceBundle class is an abstract class that provides an easy way to organize and retrieve locale-specific strings or other resources. It stores these resources in an external file, along with a key that you use to retrieve the information. You'll create a ResourceBundle for each locale your Java program supports.

There are two kinds of ResourceBundle classes -- the PropertyResourceBundle and ListResourceBundle. Cliff Berg examined property-resource bundles in "How do I Write an International Application?" ("Java Q&A," DDJ, July 1997, available electronically, see "Resource Center," page 3). In this article, I'll look at the other side of the resource-bundle coin -- the ListResourceBundle.

ListResourceBundle works the same as PropertyResourceBundle, but the file format is actually Java source code that you compile into a Java class. The localized objects can be any Java objects (not just strings), so you can use ListResourceBundles to store image data, numbers, dates, or any other Java objects you need. The code and search algorithms for loading ListResourceBundles are the same as for PropertyResourceBundles, but ListResourceBundles are faster than PropertyResourceBundles; see Listing One.

Java loads your resources based on the locale argument to the getBundle method. It searches for matching files with various suffixes, based on the language, country, and any variant or dialect to try to find the best match. Java tries to find a complete match first, then works its way down to the base filename as a last resort. Example 1 illustrates the search order.

You should always supply a base resource bundle with no suffixes, so that your program will still work if the user's locale does not match any of the resource bundles you supply. The default file can contain the U.S. English strings. Then you should provide properties files for each additional language you want to support.

If you use both the language and country codes when you name resource bundles, you might be more restrictive than you need to be. For example, if you name a resource bundle MyString_de_DE.class (also available electronically), those resources will only be loaded on systems that are configured for standard German. A system that is configured for Swiss German will use the default bundle instead. It's best to omit the country code if possible, so that more users will find an appropriate match. If you had named the resource bundle MyStrings_de.class instead, then German-speaking users in both countries would see German strings instead of English ones. Of course, this is only desirable if the words translate the same way for both countries.

Formatting Text

Besides the text itself, the words in your user interface may have to adapt to other kinds of language changes. For example, when you display a date, time, or number, the format is likely to be different from country to country. If you are displaying a message where data is inserted into the message text, the entire sentence structure will probably be different in other languages. The Format classes in the java.text package help you solve these problems.

Message Formatting

Because the entire sentence structure of messages can change in other languages, you should never display strings that are built up by concatenating individual words, even if you are translating the individual words. For example, suppose you created a string in English as shown in Example 2(a). The same structure would not work in Spanish; Example 2(b) shows what the Spanish equivalent should be.

Notice that the position of the inserted text is different. To handle this situation, use the MessageFormat class, which provides a means to produce concatenated messages in a language-neutral way. It works by taking an array of objects, formatting each one, and then inserting the formatted strings into a pattern string at the appropriate places. For instance, the code in Example 3(a) produces the string Example 3(b).

The {0} syntax indicates a placeholder where an object will be formatted and substituted into the string. The number matches the index of the array where the appropriate object is stored. When you translate the format string to a different language, you can rearrange placeholders to any place in the string. The placeholders can have additional information about how the substituted data should be formatted. Some of the options are time, date, currency, percent, or integer, with lots of special options for each. If you need finer control over the formatting of a date, time, or numeric values, you can use the SimpleDateFormat and NumberFormat classes.

User-Interface Layout

When you build the user-interface components of your Java applet, the translated strings will probably take up a different amount of space than the English ones. They might take less space, but often they will take more; see Figure 1. If you haven't left enough room for longer strings, the user interface might become unreadable. The best way to avoid this is by using the Java layout manager to arrange your user-interface components.

Comparing and Sorting Strings

Often, you'll need to compare strings, perhaps for sorting, or just to check for a certain value. If the strings you are comparing contain natural language text, you can't just compare the numeric value of the characters. Although this works in English, it doesn't work in most other languages. In French, for example, accent differences are sorted from the end of the word, so "pêche" comes before "péché." In German, the letter "o" with an umlaut (ö) is sorted as "oe," so "Töne" comes before "Tofu." In Asian writing systems, there are several different methods of sorting that can be based on pronunciation or on the number of strokes used in a character.

Instead of relying on numerical comparisons, use the Collator class, which performs locale-sensitive String comparison. You can set a Collator's strength property to determine what kinds of differences are considered significant in comparisons. There are four strengths: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. The meaning of each strength is locale dependent. In English, primary strength ignores differences in case; "A" and "a" are equal, for instance. Secondary strength ignores accent characters, and tertiary strength means that only actual letter differences are significant, such as "A" and "B." Listing Two shows how to compare strings at each strength for different locales. (Also see StrengthDemo.class and StrengthDemo.java, available electronically.)

For comparing Strings once, the compare method provides the best performance. The compare method only examines as many characters as it needs, which allows it to be faster when doing single comparisons. When sorting a list of Strings, however, it is generally necessary to compare each String multiple times. In this case, the CollationKey class gives better performance.

A CollationKey represents a String as a series of bits that can be compared bitwise under the rules of a specific locale. Comparing two CollationKeys returns the relative order of the Strings they represent. This allows fast comparisons once the keys are computed. The cost of computing the keys is recouped in faster comparisons when strings need to be compared many times. Even for small lists of items, the performance of collation keys can be ten times better than using the Collator's compare method.

Listing Three shows how CollationKeys might be used to sort a list of Strings.

Finding Text Boundaries

The Internationalization API includes classes that help you detect natural language boundaries, including character, word, sentence, and line breaks. Many languages have special rules for determining how a sentence ends, where a line can be broken in the process of line wrapping, or even what is considered a single character.

To appreciate the difficulty of this problem, consider these language differences:

Chinese, Japanese, Korean, and Thai do not necessarily have spaces between words.
Spanish has punctuation at the beginning of a sentence; Thai doesn't use punctuation.
Most languages can have punctuation in the middle of a sentence, such as a decimal point or quotation marks.
Abbreviations that are considered one word can have periods in them.
In some Asian languages, certain sets of characters are not allowed at the beginning of a line; other sets of characters are not allowed at the end of a line.
Many languages have accented characters, which are a base character plus a diacritical mark that together are considered one character.

The class you use to parse natural language text is the BreakIterator class. For each kind of boundary you want to detect, you use a different instance of the class, either the character instance, word instance, sentence instance, or line instance. Then you set the text you want to parse into the enumeration, and iterate through it as you would for any ordinary Enumeration object; see Listing Four. (Also see BreakDemo.class and BreakDemo.java, available electronically.)

Unicode

Historically, character encoding has been the most difficult area of internationalization. Some standards represent a character as one byte, and some two or more bytes. Passing data back and forth requires complex mappings and conversions.

Java 1.1 makes all of this straightforward, because it uses the Unicode 2.0 character encoding standard. In Unicode, every character occupies two bytes. Ranges of character encodings represent different writing systems or other special symbols. For example, Unicode characters in the range 0x0000 through 0x007F represent the basic Latin alphabet, and characters in the range 0xAC00 through 0x9FFF represent the Han characters used in China, Japan, Korea, Taiwan, and Vietnam.

Since Java uses Unicode internationally, it can represent any character of any commonly used language. The DataInputStream and DataOutputStream classes automatically handle the input and output of Unicode characters for you. These classes also include two methods that help you write text more efficiently: readUTF and writeUTF. UTF is a multibyte encoding format, which stores some characters as one byte and others as two or three bytes. If most of your data is ASCII characters, it is more compact than Unicode, but in the worst case, a UTF string can be 50 percent larger than the corresponding Unicode string. Overall, it is fairly efficient.

Despite the advantages of Unicode, there are some drawbacks: Unicode support is limited on many platforms because of the lack of fonts capable of displaying all the Unicode characters.

Conclusion

With careful planning and with a good understanding of the Internationalization API, you can create global software more easily and at lower cost. But even if you are not localizing your software, these Java classes give you more powerful ways to parse and format text and more efficient ways to compare and sort text; and they will help you create more flexible user-interface layouts.

DDJ

Listing One

import java.util.*;

</p>
public class MyStrings_de_DE extends ListResourceBundle {
   public Object[][] getContents() {
    return contents;
   }
   static final Object[][] contents = {
      // LOCALIZE THIS
      {"FontName",  "Schriftart"},
      {"Size",      "GröBe"}, 
      {"Bold",      "Fett"},
      {"Italic",    "Kursiv"},
      {"Color",     "Farbe"},
      {"Red",       "Rot"},
      {"Green",     "Grun"},
      {"Blue",      "Blaü"},
      {"Sample",    "Vorschau"}
      // END OF MATERIAL TO LOCALIZE
   };
}

Back to Article

Listing Two

// This sample illustrates use of the Collator classimport java.text.*;
import java.util.*;


</p>
class StrengthDemo {


</p>
  static Collator collator;
  static int result;
  static String s1 = "pêche";
  static String s2 = "péché";


</p>
  public static void main(String argv[]) {
            
    System.out.println("United States");
    collator = Collator.getInstance(new Locale("en","US"));
    collator.setStrength(Collator.PRIMARY);
    doSorts();
    collator.setStrength(Collator.SECONDARY);
    doSorts();
    collator.setStrength(Collator.TERTIARY);
    doSorts();


</p>
    System.out.println();


</p>
    System.out.println("France");
    collator = Collator.getInstance(new Locale("fr","FR"));
    collator.setStrength(Collator.PRIMARY);
    doSorts();
    collator.setStrength(Collator.SECONDARY);
    doSorts();
    collator.setStrength(Collator.TERTIARY);
    doSorts();
  }
  public static void doSorts()
  {
    if (collator.getStrength() == Collator.PRIMARY)
      System.out.println("Primary: ");
    else if (collator.getStrength() == Collator.SECONDARY)
      System.out.println("Secondary: ");
    else if (collator.getStrength() == Collator.TERTIARY)
      System.out.println("Tertiary: ");
    result = collator.compare(s1, s2);
    if (result == 0)
      System.out.println(s1 + " equals " + s2);
    else if (result < 0)
      System.out.println(s1 + " is before " + s2);
    else if (result > 0)
      System.out.println(s1 + " is after " + s2);
  }
}
Output
------
United States
Primary:
pêche equals péché
Secondary:
pêche is after péché
Tertiary:
pêche is after péché


</p>
France
Primary:
pêche equals péché
Secondary:
pêche is before péché
Tertiary:
pêche is before péché

Back to Article

Listing Three

//--------------------------------------------------------------------------// Method to sort a vector of strings. This sort uses CollationKeys, where 
// are about 10 times faster than using Collator.compare. This is true even 

// including the overhead of setting up the extra arrays, etc.
// It is a simple bubble sort, which is faster than other algorithms
// for very small vectors and for vectors that are almost in order already
// This sample illustrates use of CollationKey class


</p>
import java.text.*;
import java.util.*;


</p>
class SortDemo {
    static Vector v = new Vector();
  public static void main(String args[]) {
      v.addElement("Dan");
        v.addElement("Alice");
        v.addElement("Liza");
        v.addElement("Edward");
        v.addElement("Zane");
        v = sort(v);
        for (int i=0; i<v.size(); i++)
          System.out.println(v.elementAt(i));
  }
    public static Vector sort (Vector unsorted) {
      CollationKey temp;
        int i,j;
        int size = unsorted.size();
        Vector sorted = new Vector(size);


</p>
        Collator collator = Collator.getInstance(Locale.getDefault());
        CollationKey[] keys = new CollationKey[size];
        for (i=0; i<size; i++)
          keys[i] = collator.getCollationKey((String)(unsorted.elementAt(i)));
        for (i=0; i<size; i++)
            for (j=i; j<size; j++)
               if( keys[i].compareTo( keys[j] ) > 0 ) {
                    temp = keys[j];
                    keys[j] = keys[i];
                    keys[i] = temp;
                }    
        for (i=0; i<size; i++)
            sorted.addElement(keys[i].getSourceString());
        return sorted;
    }
}
Output
------
Alice
Dan
Edward
Liza
Zane

Back to Article

Listing Four

// This sample illustrates use of the BreakIterator classimport java.text.*;
import java.util.*;


</p>
class BreakDemo {
    static String string = 
        "He said \"How tall are you?\" and I said I'm 5\'5\" tall, etc.";
    static BreakIterator boundary;
    public static void main(String args[]) {


</p>
    //print each word
    boundary = BreakIterator.getWordInstance(Locale.getDefault());
    boundary.setText(string);
    printBreaks("word");


</p>
    //print each sentence
    boundary = BreakIterator.getSentenceInstance(Locale.getDefault());
    boundary.setText(string);
    printBreaks("sentence");
 }
 static void printBreaks(String type) {
    int start = boundary.first();
    int end = boundary.next();
    while (end != BreakIterator.DONE) {
        String part = string.substring(start,end);
        if (!part.equals(" "))
          System.out.println(part + " is a " + type);
        start = end;
        end = boundary.next();
    }
 }
}  
Output
------
He is a word
said is a word
" is a word
How is a word
tall is a word
are is a word
you is a word
? is a word
" is a word
and is a word
I is a word
said is a word
I'm is a word
5'5 is a word
" is a word
tall is a word
, is a word
etc. is a word
He said "How tall are you?"  is a sentence
and I said I'm 5' is a sentence
5" tall, etc. is a sentence


</p>

Back to Article

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

JVM Languages

The Java Internationalization API

Global software for the global village

The Java Internationalization API

Isolating the User Interface

Formatting Text

Message Formatting

User-Interface Layout

Comparing and Sorting Strings

Finding Text Boundaries

Unicode

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

JVM Languages Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

JVM Languages

The Java Internationalization API

Global software for the global village

The Java Internationalization API

Isolating the User Interface

Formatting Text

Message Formatting

User-Interface Layout

Comparing and Sorting Strings

Finding Text Boundaries

Unicode

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

JVM Languages Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content