Gregor is a research engineer and teaching assistant at the University of Maribor in Slovenia. He can be contacted at [email protected]
In this era of globalalization, internationalization is becoming an increasingly important web engineering area. The most important aspect of internationalization (or "i18n" as it is referred to, where "18" is the number of letters between the "i" and the "n" in "internationalization") is the notion of character encoding. Modern web technologies like Java and XML are well-suited to i18n because of their Unicode character encoding support.
Unicode is an industry standard designed to let text and symbols from all the writing systems of the world be consistently represented and manipulated by software. Unicode realizes this key i18n feature by providing a unique number for every character, thus enabling a single software product or a single web site to be targeted across multiple platforms, languages, and countries without reengineering.The preferred character encoding used in web environments is UTF-8, which represents a variable-length Unicode character encoding transformation format.
Despite of available i18n technologies, most of web pages lack to adequately represent more or less exotic characters, where the root of the problem usually lies in complex and multitiered web application architectures. Besides, the sociological aspect of the i18n problem shouldn't be neglected. While most of technologies and web content are presented in English, the most common and verified character encoding is the Latin alphabet (ISO 8859-1 on UNIX and Cp1252 on Windows).
In this article, I addresses character encoding based i18n problems by providing guidelines and Java-based examples on how to enable UTF-8 support in multitiered Java-based Web applications. The guidelines I present here are the result of my experiences in developing complex multilingual Web applications.
Using Unicode and UTF-8
Most traditional character encodings, such as those defined by the ISO 8859 standard, are 8-bit. This means that they can only represent 256 different characters. For the most part, this character-set size is satisfactory for bilingual computer processing (usually using Roman characters and the local language). However, in multilingual software environments, there are far more than 256 characters required. Just think of the WorldWide Web where Cyrillic, Hebrew, Arabian, Chinese, and new characters such as the Euro symbol (€) are often required in a single hypertext document.
The solution to this problem is the adoption universal character encoding -- Unicode. Unicode provides the basis for processing, storing, and interchanging text data in any language in all modern software and information technology protocols. Unicode provides a unique code point -- a number, not a glyph -- for each character. This means that it represents a character in an abstract way and leaves the visual rendering (symbol, font, size, or shape) to underlying applications, such as web browsers or word processors. In its latest version 5.0, Unicode has already defined more than 100,000 encoded characters.
There are several possible representations of Unicode data indicated by the Unicode Transformation Format (UTF). UTF is an algorithmic mapping from every Unicode code point to a unique byte sequence. Several UTF exist (see Table 1), where UTF-8 is the most widely used. UTF-8 is a variable-length character encoding able to represent any character in the Unicode standard, yet the initial encoding of bytecodes and character assignments for UTF-8 is consistent with ASCII (but not Latin-1, because the characters greater that 127 differ from Latin-1). For these reasons, it is becoming the preferred encoding for e-mail, web pages, and other systems, where characters are stored or streamed.
A Conceptual View of Java Web Apps
Java is well-suited to internationalization because of its native Unicode character support and robust locale-specific functionality. However, while a web application represents a system of collaborating components (see Figure 1) several steps are required to enable its full Unicode support.
UTF-8 is the preferred encoding form of Unicode to use in web applications because ASCII markup remains in ASCII, which means smaller file sizes being transferred over the Internet. To enable UTF-8 in Java web apps, it is necessary to assure that all constituent components are capable to receive, process and output UTF-8 encoded data. This requires UTF-8 compliant components, as well as data encoded in UTF-8 (see Figure 2).
In Figure 1, a common Java web application consists of several components where, according to UTF-8, several types of components exists:
- Components with native UTF-8 support. These components are capable to manage UTF-8 encoded data without any modifications.
- Components which require configuring UTF-8 support. These components require some configuration or upgrades before they can manage UTF-8 encoded data.
- Components which do not support UTF-8. These components (usually legacy systems) are not capable to manage UTF-8 encoded data. A substitution or upgrade of these components is required for manage UTF-8 encoded data.