Dr. Dobb's | Unicode and Java Web Applications

Unicode and Java Web Applications

i18n isn't turnkey even with Java's native Unicode support

February 05, 2009
URL:http://www.drdobbs.com/web-development/unicode-and-java-web-applications/213201510

Gregor is a research engineer and teaching assistant at the University of Maribor in Slovenia. He can be contacted at [email protected].

In this era of globalalization, internationalization is becoming an increasingly important web engineering area. The most important aspect of internationalization (or "i18n" as it is referred to, where "18" is the number of letters between the "i" and the "n" in "internationalization") is the notion of character encoding. Modern web technologies like Java and XML are well-suited to i18n because of their Unicode character encoding support.

Unicode is an industry standard designed to let text and symbols from all the writing systems of the world be consistently represented and manipulated by software. Unicode realizes this key i18n feature by providing a unique number for every character, thus enabling a single software product or a single web site to be targeted across multiple platforms, languages, and countries without reengineering.The preferred character encoding used in web environments is UTF-8, which represents a variable-length Unicode character encoding transformation format.

Despite of available i18n technologies, most of web pages lack to adequately represent more or less exotic characters, where the root of the problem usually lies in complex and multitiered web application architectures. Besides, the sociological aspect of the i18n problem shouldn't be neglected. While most of technologies and web content are presented in English, the most common and verified character encoding is the Latin alphabet (ISO 8859-1 on UNIX and Cp1252 on Windows).

In this article, I addresses character encoding based i18n problems by providing guidelines and Java-based examples on how to enable UTF-8 support in multitiered Java-based Web applications. The guidelines I present here are the result of my experiences in developing complex multilingual Web applications.

Using Unicode and UTF-8

Most traditional character encodings, such as those defined by the ISO 8859 standard, are 8-bit. This means that they can only represent 256 different characters. For the most part, this character-set size is satisfactory for bilingual computer processing (usually using Roman characters and the local language). However, in multilingual software environments, there are far more than 256 characters required. Just think of the WorldWide Web where Cyrillic, Hebrew, Arabian, Chinese, and new characters such as the Euro symbol (€) are often required in a single hypertext document.

The solution to this problem is the adoption universal character encoding -- Unicode. Unicode provides the basis for processing, storing, and interchanging text data in any language in all modern software and information technology protocols. Unicode provides a unique code point -- a number, not a glyph -- for each character. This means that it represents a character in an abstract way and leaves the visual rendering (symbol, font, size, or shape) to underlying applications, such as web browsers or word processors. In its latest version 5.0, Unicode has already defined more than 100,000 encoded characters.

There are several possible representations of Unicode data indicated by the Unicode Transformation Format (UTF). UTF is an algorithmic mapping from every Unicode code point to a unique byte sequence. Several UTF exist (see Table 1), where UTF-8 is the most widely used. UTF-8 is a variable-length character encoding able to represent any character in the Unicode standard, yet the initial encoding of bytecodes and character assignments for UTF-8 is consistent with ASCII (but not Latin-1, because the characters greater that 127 differ from Latin-1). For these reasons, it is becoming the preferred encoding for e-mail, web pages, and other systems, where characters are stored or streamed.

Table 1: Properties of common Unicode transformation formats.

A Conceptual View of Java Web Apps

Java is well-suited to internationalization because of its native Unicode character support and robust locale-specific functionality. However, while a web application represents a system of collaborating components (see Figure 1) several steps are required to enable its full Unicode support.

[Click image to view at full size]

Figure 1: Typical web application.

UTF-8 is the preferred encoding form of Unicode to use in web applications because ASCII markup remains in ASCII, which means smaller file sizes being transferred over the Internet. To enable UTF-8 in Java web apps, it is necessary to assure that all constituent components are capable to receive, process and output UTF-8 encoded data. This requires UTF-8 compliant components, as well as data encoded in UTF-8 (see Figure 2).

[Click image to view at full size]

Figure 2: UTF-8 enabled component and data.

In Figure 1, a common Java web application consists of several components where, according to UTF-8, several types of components exists:

Components with native UTF-8 support. These components are capable to manage UTF-8 encoded data without any modifications.
Components which require configuring UTF-8 support. These components require some configuration or upgrades before they can manage UTF-8 encoded data.
Components which do not support UTF-8. These components (usually legacy systems) are not capable to manage UTF-8 encoded data. A substitution or upgrade of these components is required for manage UTF-8 encoded data.

Besides components, different types of data and files are manipulated in Java web applications -- XHTML, JSP, and JavaScript files, for instance. Most of them support different encodings and require configuration before they support UTF-8.

Enabling UTF-8 In Java Web Apps

Different Java web application architectures exist, where the most common architecture consist of a web server, servlet container, and database management system (DBMS). The client part is represented with a web browser (Figure 3).

[Click image to view at full size]

Figure 3: Typical Java web system consisting of a client and web application.

In Figure 3, a Java web application consists of several processing parts, where each of them processes different types of files or data. A fully UTF-8 compliant web application requires that all constituent elements support UTF-8. I now present guidelines for configuring these elements for UTF-8 support.

Web Client Tier. Web browsers represent the client side of a web application. The latest versions of popular browsers (like Internet Explorer, Mozilla Firefox, Safari, and Opera) support Unicode. Besides, the character encoding is by default defined automatically thus not requiring any end-user configuration. The best way to tell a browser that a file is UTF-8 encoded is by putting the character-set information in the HTTP response header:


Server: Apache-Coyote/1.1
Pragma: No-cache
Cache-Control: no-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Date: Thu, 05 Jul 2007 13:10:43 GMT

Web Server Tier. A HTTP response header is defined on the web server tier. Most web servers use the encoding of the operating system, defined in the system property file.encoding. On English systems, this property is usually defined as ISO-8859-1 (Unix-based systems Latin-1 charset) or Cp1252 (Windows Latin-1 charset). To ensure UTF-8 support, the file.encoding property has to be redefined during system startup. For example, Apache recommends changing the Tomcat startup script (catalina.bat or catalina.sh) to add the switch -Dfile.encoding=UTF-8 to the startup call to the Java executable. This ensures that the HTTP response encoding will default to UTF-8, though this can be overridden within the Java Servlet code as needed. Moreover, web servers such as Apache 2.0 for Windows NT use UTF-8 for all filename encodings.

Web servers process static hypertext documents (XHTML and HTML files) which should at the top of the <head> section include or adapt the tag:

Besides, (X)HTML documents can invoke JavaScript files or code blocks. UTF-8 encoding is explicitly defined in JavaScript block or file by including the charset attribute in the script's start tag:


<script src=" scriptFile.js" type="text/javascript"
charset="utf-8"></script>

Application Server Tier. Application servers are programs that handle all application operations between users and an organization's backend business applications or databases. With Java, application servers process compiled Java files and Java Server Pages (JSP) files.

Java files do not require any UTF-8 configuration, where JSP files enable UTF-8 encoding by placing a page directive at the top of the file and including pageEncoding and contentType attributes:


<%@ page contentType="text/html;charset=utf-8"
pageEncoding="utf-8" %>

This page directive should be used in all JSP files that are included with the <jsp:include> tag (not the <%@ include %> page directive). Moreover, if JSP file contains a (X)HTML <head> tag, it has to include UTF-8 page directive:


<meta http-equiv="content-type" content="text/html;
charset=utf-8">

If JSP sendRedirect() method is used for redirecting to another JSP file, query string parameters should be encoded with java.net.URLEncoder.encode() method. Some configuration is also required when submitting (X)HTML forms:


<form action="processData.jsp" method="post"
enctype="multipart/form-data; charset=utf-8">
 ...
</html:form>

The upper input form submits the form data in UTF-8. However, you will have a problem if you call request.getParameter in the JSP and the parameter contains special characters. This problem is solved with encoding filter, which lets you specify character encoding for all input fields in your JSP pages. In this way, any request.getParameter reads data in the proper form. Listing One is a Java Servlet-based encoding filter.


package si.unimb.filters;
import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
public class EncodingFilter implements Filter {
   private String encoding = "utf-8";
   public void destroy() {}
   public void doFilter(ServletRequest req,
      ServletResponse res, FilterChain fc) throws
      IOException, ServletException {
      req.setCharacterEncoding(encoding);
      fc.doFilter(req, res);
   }
public void init(FilterConfig fc) throws
      ServletException {
      String encodingParam =
      fc.getInitParameter("encoding");
      if (encodingParam != null)encoding =
      encodingParam;
   }
}

Listing One

It must be configured in web.xml (Listing Two) to be executed before every request.


<filter>
  <filter-name>EncodingFilter</filter-name>
    <filter-class>
      si.unimb.filters.EncodingFilter
    </filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>
  </filter>
  <filter-mapping>
  <filter-name>EncodingFilter</filter-name>
  <url-pattern>/*</url-pattern>
</filter-mapping>

Listing One

If a JSP file submits through JavaScript with the form method="GET", multilanguage query string parameters should be encoded by using the JavaScript encodeURI method. Similarly, all JSP files that use standard HTML hyperlink tags <a href="">, should encode multilanguage query string parameters by using the JavaScript encodeURI method. The pageEncoding attribute is supported only by JSP version 1.2 or later.

Application servers process also properties files which are commonly used for dictionary files (message bundles). Properties files do not provide a mechanism for indicating the encoding. Therefore they have to be encoded in a form that Java can interpret them correctly as Unicode characters using Unicode escapes. An Unicode escape indicates an Unicode character value and is interpreted by Java into that character. Here is an example of a dictionary file with keys (left), values in local language (middle) and their representations (right). Extended characters are defined with decimal code points:

The conversion of extended characters to Unicode escapes can be accomplished on the command line using the Java native2ascii converter which takes an -encoding switch to indicate the encoding of the file, the name of the source file and the name of the target file:


$ native2ascii -encoding UTF-8 SourceFile TargetFile

Database Tier. Database management systems (DBMS) require character-set information when a new database or table is created. Several databases support UTF-8 by default; Hypersonic (HSQLDB), for example. Otherwise the default character set has to be defined as, for example, in MySQL's configuration file (my.ini):


default-character-set=utf8

Besides, database drivers usually require extra configuration as for example when connecting to a MySQL database using a Java database connectivity (JDBC) driver:


Class.forName("org.hsqldb.jdbcDriver");
Connection db =
DriverManager.getConnection("jdbc:mysql://localhost/myData
base?useUnicode=true&characterEncoding=utf-8",
"username","password");

Conclusion

Even with native Unicode support in Java, you must perform several steps are required to configure a Java-based web application for complete Unicode support.