Channels ▼
RSS

JVM Languages

Unicode and Java Web Applications


Enabling UTF-8 In Java Web Apps

Different Java web application architectures exist, where the most common architecture consist of a web server, servlet container, and database management system (DBMS). The client part is represented with a web browser (Figure 3).

[Click image to view at full size]
Figure 3: Typical Java web system consisting of a client and web application.

In Figure 3, a Java web application consists of several processing parts, where each of them processes different types of files or data. A fully UTF-8 compliant web application requires that all constituent elements support UTF-8. I now present guidelines for configuring these elements for UTF-8 support.

Web Client Tier. Web browsers represent the client side of a web application. The latest versions of popular browsers (like Internet Explorer, Mozilla Firefox, Safari, and Opera) support Unicode. Besides, the character encoding is by default defined automatically thus not requiring any end-user configuration. The best way to tell a browser that a file is UTF-8 encoded is by putting the character-set information in the HTTP response header:


Server: Apache-Coyote/1.1
Pragma: No-cache
Cache-Control: no-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Date: Thu, 05 Jul 2007 13:10:43 GMT

Web Server Tier. A HTTP response header is defined on the web server tier. Most web servers use the encoding of the operating system, defined in the system property file.encoding. On English systems, this property is usually defined as ISO-8859-1 (Unix-based systems Latin-1 charset) or Cp1252 (Windows Latin-1 charset). To ensure UTF-8 support, the file.encoding property has to be redefined during system startup. For example, Apache recommends changing the Tomcat startup script (catalina.bat or catalina.sh) to add the switch -Dfile.encoding=UTF-8 to the startup call to the Java executable. This ensures that the HTTP response encoding will default to UTF-8, though this can be overridden within the Java Servlet code as needed. Moreover, web servers such as Apache 2.0 for Windows NT use UTF-8 for all filename encodings.

Web servers process static hypertext documents (XHTML and HTML files) which should at the top of the <head> section include or adapt the tag:

<P> <meta http-equiv="content-type"content="text/html; charset=utf-8"> <P>

Besides, (X)HTML documents can invoke JavaScript files or code blocks. UTF-8 encoding is explicitly defined in JavaScript block or file by including the charset attribute in the script's start tag:


<script src=" scriptFile.js" type="text/javascript"
charset="utf-8"></script>

Application Server Tier. Application servers are programs that handle all application operations between users and an organization's backend business applications or databases. With Java, application servers process compiled Java files and Java Server Pages (JSP) files.

Java files do not require any UTF-8 configuration, where JSP files enable UTF-8 encoding by placing a page directive at the top of the file and including pageEncoding and contentType attributes:


<%@ page contentType="text/html;charset=utf-8"
pageEncoding="utf-8" %>

This page directive should be used in all JSP files that are included with the <jsp:include> tag (not the <%@ include %> page directive). Moreover, if JSP file contains a (X)HTML <head> tag, it has to include UTF-8 page directive:


<meta http-equiv="content-type" content="text/html;
charset=utf-8">

If JSP sendRedirect() method is used for redirecting to another JSP file, query string parameters should be encoded with java.net.URLEncoder.encode() method. Some configuration is also required when submitting (X)HTML forms:


<form action="processData.jsp" method="post"
enctype="multipart/form-data; charset=utf-8">
 ...
</html:form>

The upper input form submits the form data in UTF-8. However, you will have a problem if you call request.getParameter in the JSP and the parameter contains special characters. This problem is solved with encoding filter, which lets you specify character encoding for all input fields in your JSP pages. In this way, any request.getParameter reads data in the proper form. Listing One is a Java Servlet-based encoding filter.


package si.unimb.filters;
import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
public class EncodingFilter implements Filter {
   private String encoding = "utf-8";
   public void destroy() {}
   public void doFilter(ServletRequest req,
      ServletResponse res, FilterChain fc) throws
      IOException, ServletException {
      req.setCharacterEncoding(encoding);
      fc.doFilter(req, res);
   }
public void init(FilterConfig fc) throws
      ServletException {
      String encodingParam =
      fc.getInitParameter("encoding");
      if (encodingParam != null)encoding =
      encodingParam;
   }
}

Listing One

It must be configured in web.xml (Listing Two) to be executed before every request.


<filter>
  <filter-name>EncodingFilter</filter-name>
    <filter-class>
      si.unimb.filters.EncodingFilter
    </filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>
  </filter>
  <filter-mapping>
  <filter-name>EncodingFilter</filter-name>
  <url-pattern>/*</url-pattern>
</filter-mapping>

Listing One

If a JSP file submits through JavaScript with the form method="GET", multilanguage query string parameters should be encoded by using the JavaScript encodeURI method. Similarly, all JSP files that use standard HTML hyperlink tags <a href="">, should encode multilanguage query string parameters by using the JavaScript encodeURI method. The pageEncoding attribute is supported only by JSP version 1.2 or later.

Application servers process also properties files which are commonly used for dictionary files (message bundles). Properties files do not provide a mechanism for indicating the encoding. Therefore they have to be encoded in a form that Java can interpret them correctly as Unicode characters using Unicode escapes. An Unicode escape indicates an Unicode character value and is interpreted by Java into that character. Here is an example of a dictionary file with keys (left), values in local language (middle) and their representations (right). Extended characters are defined with decimal code points:

The conversion of extended characters to Unicode escapes can be accomplished on the command line using the Java native2ascii converter which takes an -encoding switch to indicate the encoding of the file, the name of the source file and the name of the target file:


$ native2ascii -encoding UTF-8 SourceFile TargetFile

Database Tier. Database management systems (DBMS) require character-set information when a new database or table is created. Several databases support UTF-8 by default; Hypersonic (HSQLDB), for example. Otherwise the default character set has to be defined as, for example, in MySQL's configuration file (my.ini):


default-character-set=utf8

Besides, database drivers usually require extra configuration as for example when connecting to a MySQL database using a Java database connectivity (JDBC) driver:


Class.forName("org.hsqldb.jdbcDriver");
Connection db =
DriverManager.getConnection("jdbc:mysql://localhost/myData
base?useUnicode=true&characterEncoding=utf-8",
"username","password");

Conclusion

Even with native Unicode support in Java, you must perform several steps are required to configure a Java-based web application for complete Unicode support.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video