Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Unicode and Software Globalization


MAR94: Unicode and Software Globalization

Software-design guidelines for international application development

David is a senior consultant with Cap Gemini America. He specializes in Windows, NT, and OS/2 development. You can contact him on CompuServe at 70323,3510.


Developers have traditionally produced software for the domestic U.S. market, then remarketed it overseas after a few simple modifications. This practice, however, is fast becoming unacceptable as more and more non-U.S. users expect software to "speak" to them in their native tongue and conventions. Furthermore, non-U.S. computer markets are growing rapidly, forcing software developers to begin thinking internationally. Currently, a variety of character standards and proprietary extensions are being used to address application internationalization. In many cases, however, the result is often inefficient code that's difficult to maintain, hard to port, and expensive to produce--all problems the Unicode standard addresses.

The Unicode Standard

Unicode is an international character code standard which replaces single-byte ASCII and ANSI, multibyte ANSI, and nearly every other character-code standard currently defined. Unicode contains a one-to-one mapping of codes for ASCII and the ISO 8859/1 Latin character set used by Microsoft Windows, NextStep, and others. Additionally, Unicode contains a nearly complete superset of all fixed-size and multibyte standards currently in use and provides for symbol sets such as mathematical operators, technical symbols, geometric shapes, and dingbats. The initial release of Unicode, in fact, contained 26,000 characters. More importantly, when producing code, the Unicode character standard provides a global solution for application localization, which solves many of the problems associated with multibyte support. Specifically, it provides the ability to prepare your application code for supporting nearly all languages with little or no modification, at least in terms of data storage. (See the accompanying text box entitled, "The Unicode Consortium.")

The good news is that it's relatively simple to modify most applications to transparently support multiple character standards. I now support Unicode whenever possible in my applications. The degree of difficulty for supporting Unicode depends on which and how many character standards were supported by the original code, what future requirements are expected, and how much support is provided by the underlying graphics engine and operating system.

Microsoft's NT operating system uses Unicode internally to represent all characters and strings. In addition to long filenames and extended attributes, NT supports Unicode characters in filenames. Consequently, it's possible for a single filename to contain a mix of characters from hundreds of modern languages and even a few archaic ones. Fortunately, Unicode prevents our having to switch between character standards to process such filenames. This simplifies many string operations.

For example, in the common problem of splitting a filename off the end of a fully qualified path, you typically start at the rightmost character and traverse back to the first backslash, colon, or the beginning of the string; see Example 1.

For standards which use a variable number of bytes per character (such as Shift-JIS), it's impossible to cycle backwards through a string. You can't determine if any single byte is the first or second byte of a multibyte character or if it's a single-byte character. Consequently, algorithms must be rewritten to start at the left and work toward the right. But with Unicode (or other fixed-size character-code standards), you can traverse backwards, saving you from having to rewrite code for international support.

Implementing Unicode Support

A basic technique for implementing Unicode is transparent-character support. This is a technique I used to develop multiple tape backup applications for NT. The goal of the project was to write and maintain a single, common set of source code such that Unicode would be supported if the underlying operating system provided sufficient support, and any other single-byte character set would be supported otherwise. This support had to be as portable and transparent as possible. Code which must specifically process strings of a known character standard, therefore, had to be isolated to a few specific files and code sections.

In order to achieve this goal, the following conditions had to be met:

  • Transparent characters, macros, and functions had to be used.
  • Operations which expect a specific byte size had to be avoided or encapsulated.
I've discovered a number of nonobvious stumbling blocks in implementing transparent-character support, particularly if you're converting a non-Unicode application to either Unicode or transparent support. Also, you must consider how to program today for a non-Unicode environment, while preparing for an eventual migration to Unicode.

Multibyte character sets may be supported as well, but you must employ the same localization techniques you previously used. Those techniques need not be modified; you just follow the simple rules outlined here. However, localization for multibyte character sets may reduce the efficiency of your code.

Code written following the guidelines in Figure 1 will transparently support Unicode, ASCII, or any other fixed-size character standard for all major languages and cultures with only a minimal localization effort. The basic rule of transparent-character support is that the size of a character code is determined at compile time. If a preprocessor macro named UNICODE is defined in your compiler's command-line arguments and the platform supports Unicode, your application is compiled to support Unicode. Otherwise your application will support any 1-byte character set supported by the platform. You need only to define another macro and modify your header files to support another 2- or 4-byte character standard if your platform provides sufficient support for that standard. If you do not compile for Unicode, the particular character standard supported is determined by the strings contained in your application's resources and the default code page selected by the user. This, of course, presumes that all strings which require language translation are placed in resource files. All modules should be compiled with Unicode explicitly enabled or disabled and you should avoid mixing modules compiled for different character sizes whenever possible.

Programs written to support multiple-character standards transparently require a transparent-character type. Many assumptions typically made about character variables are no longer valid. You must take care to ensure that code is truly transparent and, if multiplatform support is required, portable. The solution provided by transparent-character support is based on never presuming the size of a character. Since a transparent character may be of any fixed size, a new type is used, TCHAR, which represents a character of unknown (but fixed) size. Other special types include:

  • LPTSTR, a pointer to a NULL-terminated transparent character string.
  • LPTCHAR, a pointer to a transparent character or array.
Figure 2 provides definitions of these types for Unicode and ASCII/ANSI.

You can't use the standard C char type to store transparent characters, because char is the smallest atomic type, which is usually a byte. Neither can you use the ANSI wchar_t type, since it's a "wide" character type that will always be at least two bytes in size--too large for ASCII and other single-byte standards.

Consequently, the first steps performed when converting your code for transparent character support are:

  1. Determine which data elements of type char, char *, or any other type derived from those are text (and not binary) data.
  2. Change all of those to use TCHAR, LPTSTR, or LPTCHAR.
These steps can be trivialized if you adapt and rigidly enforce the following coding standard early on in your product's development style: Define a type, such as BYTE, and always use that to store binary data, and never store any nontext data in a TCHAR. In the absence of such a standard, these steps can be tedious and time consuming.

Common Problems with

Pointer Arithmetic

While most character-pointer arithmetic is still valid, a number of widely accepted coding practices will no longer work when using transparent characters. These practices must be avoided or you may experience many wasted hours correcting them.

One of the most common invalid assumptions is that the length of a string plus one is equal to the number of bytes required to store it. This can be a particularly annoying and sometimes difficult problem to deal with when porting existing applications. You must learn to differentiate between string lengths and buffer sizes, or serious programming errors will result. These errors can be very difficult to resolve.

The next steps are usually easy, but can be difficult if a string length is stored for later use or passed to other functions. The steps are:

  • Search for all calls to strlen(), and, if the return value is used to determine a byte size, multiply by sizeof (TCHAR).
  • Search for all memory-allocation calls (malloc, GlobalAlloc, and so on), and change those which allocate space for characters based on a character count to use a byte length.
Array indexing and general pointer arithmetic to determine an offset usually work fine, so changes are not normally required: *(achPtr+nIndex)=chSomeVal;. In this example, the contents of the nIndexth entry of the array, achPtr, would be assigned the correct value for any fixed-size character standard. As a general rule, whenever dealing with operations involving transparent characters and arrays, simply keep in mind that the size of each element may be larger than a byte.

Another common violation is using a character as an index into a 256-element array. Since a Unicode character is two bytes in length, you must either increase the size of the array to 65536, or modify your algorithm. Also, remember that subtracting two pointers yields the number of bytes between those pointers, not the number of characters. Always multiply by sizeof (TCHAR) when calculating the byte size of a transparent character buffer. Likewise, always divide by sizeof (TCHAR) when calculating the number of transparent characters which can fit in a specified number of bytes.

Wide-character Functions

Just as Unicode and transparent characters required new types, a set of new functions is needed to process them. It's critical that, for any function employed to process single-byte characters and strings, you have an equivalent wide-character function. Windows NT and the Microsoft C libraries provide a good starting set of routines. For the Microsoft C library, all wide-character string routines utilize one of two simple naming conventions to distinguish them from standard ASCII variants. In the first convention, as specified by the ANSI C standard, all ASCII string functions beginning with str have wide-character equivalents which begin with wcs. For example, the wide-character version of strlen is wcslen, which, in both cases returns the number of nonzero characters in a string.

The other naming convention used is to identify wide-character functions by either prefixing or embedding a w in the function name. For example, the wide version of printf is wprintf, while vswprintf is the wide version of vsprintf. The header files included with the Microsoft C compiler for NT provide a complete list of these functions. The Microsoft Win32 API uses a different naming convention to distinguish ASCII and wide-character variants of the API procedures. All ASCII/ANSI versions of the Win32 API end with the letter "A." Similarly, all Unicode (or wide) versions end with a "W." Therefore, CreateWindowA expects 1-byte characters, while CreateWindowW expects Unicode characters. This is an excellent naming convention in that it is easy both to remember the name of any character-specific routine, and to identify all character-specific (that is, nontransparent) code at a glance.

Transparent-character Macros

What makes the Win32 naming convention particularly appealing is that, with only a few exceptions, whenever you use a Win32 API procedure without an A or a W suffix--that is, when you use the same name as the one used for Windows 3 applications--you are, by default, supporting transparent characters. Consequently, code modifications to Windows API procedures usually aren't required to globalize an existing application. The exceptions are those Win32 API file procedures considered obsolete but provided for backwards compatibility to Windows 3; OpenFile and _lopen, for example. Consequently, you must use the Win32 CreateFile family of functions to support Unicode filenames. For C library functions, however, things aren't so easy. The Win32 naming convention replaces wcs with tcs and the prefixed or embedded w with t to create transparent equivalents. To convert your code for transparent-character support, therefore, you must replace all calls to these string functions to use the transparent name instead. This usually entails a great deal of work.

There is an alternate method which requires less work for most existing applications. Instead of modifying the source code, you create a header file, which maps the standard string function names to the wide-character equivalents when compiling for Unicode, as in Example 2. By doing this for each standard string function used in your code, you can often reduce the number of changes required. However, this technique effectively removes all of the standard character functions when you compile for Unicode. If you need to process ASCII data, you'll have to write your own ASCII-specific replacements. For each of these functions--including all those which use filenames, atoi(), and some others--you must implement the wide version yourself, or avoid using them.

Mixing and ConvertingCharacter Standards

There are times when code must be written to support multiple character standards, either for backwards compatibility or for transferring data between platforms. Care must be taken when mixing transparent coding techniques with character-specific data types, or the resulting code will be very difficult to maintain.

Code that supports a specific character type is not transparent and should be isolated from transparent code, preferably by placing the code in a separate function. One method of handling this is to create two versions of your function--one that accepts Unicode strings and another for ASCII strings--along with a transparent mapping macro, as described earlier. An alternate method is to write a single function that takes and returns transparent characters. On entry to this function, the TCHAR parameters are converted to the required character standard and processed. Any resulting characters are then converted back to TCHAR, and the function returns; see Example 3.

Also, notice how the routines use the standard ANSI C functions wcstombs and mbstowcs, which convert single- or multibyte strings to wide-character strings and vice versa, using the current code page for the multibyte string. Keep in mind that a Unicode string converted using wcstombs may contain multibyte characters, which can complicate processing.

Reading and WritingUnicode Text Files

Unicode contains a special character code, called the "byte-order mark," which has a value of 0xfeff and is used to identify a text file or data stream as a collection of Unicode characters. It is important to use this mark whenever reading or writing text files. Additionally, this mark provides the information necessary to transfer data between Big- and Little-endian systems. Whenever a Unicode text file is created, or a stream of Unicode text is passed to another application or system, it is important to ensure that the first character written is the byte-order mark. Likewise, whenever you open a text file, or receive a stream of text, you should check for this mark. Then, you will need to convert the text as it is read to transparent characters before your application can use it. For this reason, you should encapsulate all file operations in a simple API.

The byte-order mark value was chosen because the possibility of 0xfeff occurring in the first two bytes of any non-Unicode text stream is highly unlikely. Also, its mirror image, 0xfffe, which is not a Unicode character, provides useful information for transferring data between Big- and Little-endian systems, since their byte orders are reversed. Therefore, if a text stream starts with 0xfffe, there is a good chance that it contains byte-swapped Unicode text, so you will need to reverse the byte order of each character as it is read.

The Future of Unicode

The Unicode standard is still new and not yet complete. The few issues that remain to be resolved will not significantly affect the development of transparent code. However, another character standard which defines a superset of the Unicode standard is under development--ISO 10646. Fortunately, applications developed to meet the guidelines presented in this article will require little or no modification to support this standard as well. ISO 10646 comes in two flavors: two-byte characters, which are essentially the same as Unicode, and 4-byte characters, which will eventually represent every character used in every known language. However, it's doubtful that ISO 10646 will ever become a common standard; it will likely be limited to special-purpose systems.

Unicode will probably become the predominant standard for all future systems. Unicode systems are currently under development by Microsoft, Apple, Novell, Next, Taligent, Metaphor, and others. By using transparent-character support in your applications, you will be able to exploit the benefits of these systems as they become available and quickly enter new language markets with a minimum of effort, while maintaining compatibility with current systems and standards.

Example 1: Splitting off the filename from a pathname.

for (pszTemp=szFName+strlen(szFName)-1;
     pszTemp>szFName && *pszTemp != '\\' &&  *pszTemp !=':';
     --szTemp);

Figure 1: Guidelines for transparent-character support.

1. Place all strings that require language translation in your application's resources.
2. Declare all text (not binary) data using TCHAR, LPTSTR, LPTCHAR.
3. Use transparent functions and macros for all TCHAR-type data.
4. Avoid using ASCII string functions which have no wide-character equivalent.
5. Check calls to <I>strlen() </I>and multiply by <I>sizeof(TCHAR) </I>if result is used as a byte size.
6. Ensure all memory allocations for strings are based on byte size, not character count.
7. Check all transparent-character pointer arithmetic for validity; avoid using characters as indexes to 256-element arrays; remember that subtracting two character pointers yields a byte size, not a character count; always divide or multiply by <I>sizeof(TCHAR) </I>to convert between character counts and byte sizes.
8. Isolate operations on specific character-code standards from transparent code.
9. Use <I>wcstombs() </I>and <I>mbstowcs() </I>to convert between Unicode and ASCII/ANSI-specific character codes.
10. Use the byte-order mark whenever processing text files and data streams.
11. Define UNICODE macro when compiling transparent modules for Unicode support.
12. Avoid mixing modules compiled to support different default character-code standards.

The Unicode Consortium

The Unicode Consortium is responsible for defining and promoting the Unicode character standard. Consortium members include all the major international computer systems providers: Adobe, Apple, Borland, DEC, Go, Hewlett Packard, IBM, Lotus, Microsoft, Next, Novell, Sun, Symantec, Taligent, Unisys, WordPerfect, Xerox, and others.

Unicode derives its name from its three main characteristics: universal--it addresses the needs of world languages; uniform--it uses fixed-width codes for efficiency; and unique--duplication of character codes is minimized.

The Unicode Consortium was formally incorporated in 1991 as a joint effort between a number of the major computer software and hardware companies. However, the Unicode standard first began at Xerox in 1985 when Huan-mei Liao, Nelson Ng, Dave Opstad, and Lee Collins began working on a database to map the relationships between identical Japanese and Chinese characters. This lead to a technique called "Han Unification," which Unicode uses to minimize the codes it requires.

At the same time, discussions on the development of a universal character set got underway at Apple. In September of 1987, Joe Becker of the Xerox group and Mark Davis of Apple began discussions on multilingual issues with a new type of character encoding as a major topic. Later that year, Becker coined the term "Unicode."

The following year saw Collins working at Apple on Davis's new character-encoding proposals. By February Collins had incorporated the basic architecture for Unicode, and Becker presented it to a /usr/group international subcommittee meeting in Dallas. Apple decided to incorporate Unicode support into TrueType.

In February of 1989, bimonthly meetings to discuss Unicode began between these companies and Sun, Adobe, Claris, and the Pacific Rim Connections joined in. In addition, Glenn Wright of Sun began the [email protected] on Internet for Unicode discussions. These discussions lead to the decision to incorporate all composite characters in existing ISO standards. In 1990, there was a flurry of activity, culminating in December, when Becker presented Unicode at a UNIX International meeting. Cooperating with Apple on TrueType, Microsoft became interested in Unicode and assigned Asmus Freytag to attend meetings. By March all non-Han work had been completed, and work began on cross mappings and order. Meanwhile, Wright and Mike Kernaghan of Metaphor started the process of incorporating the Unicode Consortium, and the work continues today.

Much effort has been expended to meet the demands of each language represented by Unicode and each company and country involved in its development. International politics have often played a significant part in the definition of this standard. The result is a highly consistent and easily usable standard which will be around for a long time.

For more information, or to become a member of the Unicode Consortium, contact: Unicode Consortium Inc., 1965 Charleston Road, Mountain View, CA 94043. (Phone 415-961-4189, fax 415-966-1637, or Internet [email protected].)

--D.V.C.

Figure 2: Type definitions for transparent characters.

#ifdef   UNICODE             //  TCHAR's are Unicode
typedef   wchar_t     TCHAR;
typedef   wchar_t  * LPTSTR;
typedef   wchar_t  * LPTCHAR;
#else                       // TCHARS are ASCII/ANSI
typedef   char        TCHAR;
typedef   char     * LPTSTR;
typedef   char     * LPTCHAR;
#endif

Example 2: Portion of header file which maps the standard string function names to the wide-character equivalents.

#ifdef     UNICODE
#define     strlen     wcslen
#define     sprintf     swprintf
#endif

Example 3: Converting between transparent and ASCII characters.

VOID MyFunction ( LPTSTR pszXparent, int cchMaxBuf )
{
    char * pszMbyte;
    /* convert transparent parameter to ASCII if necessary */
#ifdef UNICODE
    /* allocate a buffer for the ASCII conversion of pszXparent */
    pszMbyte = malloc ( cchMaxBuf * sizeof (TCHAR) );
    _wcstombs (pszXparent, psAscii, cchMaxBuf * sizeof (TCHAR) );
#else
    /* string is already ASCII/ANSI so just use it directly */
    pszMbyte = pszXparent;
#endif
    /* perform ASCII-specific operations.... */
        . . .
    /* then convert the result back to a tranparent string... */
#ifdef UNICODE
    mbstowcs ( pszMbyte, pszXparent, cchMaxBuf );
    free ( pszMbyte );
#endif
    return;
}


Copyright © 1994, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.