From Multibyte Coding System character set nightmares to tricky Western SQL database support, from GUIs that assume Latin font metrics to poorly planned parallel development, myriad mishaps can occur in what looks like a simple localization exercise.
Youve probably heard horror stories about how hard it is to modify software to work in Japanese or other Asian languages. On the other hand, perhaps youve encountered claims that such a process is a simple matter of extracting strings and translating them. In fact, there are many potential procedural and technical pitfalls in building international software, and there is no substitute for thinking carefully about the problem and formulating a plan of attack.
How Not To Internationalize
Heres a cautionary tale about a company well call it Acme Productsthat did not take an organized approach to internationalization.
The trouble began when the company decided to enter the Asian marketplace. Acme hired a new vice president for Asia, Clive, who in turn hired a new sales and marketing staff. Clives team members spent all their time and energy firming up distribution relationships in Japan. They made no attempt to verify that the engineering was being done to provide them a product to deliver in Japan. It wasnt long before they had closed the first big deal.
Clive arrived at the corporate office about a month before the next major release of Acmes product was due to be deployedjust after feature freeze. Not only was making changes to the new release unthinkable, but all of the developers were already working flat-out on the release. They had no time to discuss a Japanese version of the previous release, let alone build one. Clive had no choice but to bring in new people to build the Japanese version. With no access to the over-occupied original architects or developers, they used the previous release as a black box in their attempt to fulfill their mandate: "Make it work in Japan very quickly."
Clive wasnt a technical person. As far as he knew, this was a "localization" problemall the English strings in the product needed to be in Japanese. No big deal.
Given that direction, the team set out to do a strict localization: Find the strings and translate them, and nothing else. They didnt anticipate what, if any, problems the code might exhibit when faced with Japanese text, and they didnt consider any new features.
Acmes development manager, Sam, didnt want to hear about this project. It had come out of nowhere, with a budget that was not under his control. He told his people to make a source snapshot of the product and "throw it over the wall" to the "outsiders" on the Japanese project. Sams job was done.
Acmes programming shop was not accustomed to parallel software development on any scale. The release manager, Rhonda, was in charge of source control in conjunction with her release group. They used a source-management system with a single line of development from release to release and handled patches somewhat informally. They werent prepared to handle an outside team making significant changes to the code in parallel with their ongoing development and maintenance process. To set up a real parallel development branch would have been quite costly the first timeespecially since the Japanese project was off-site. Since this was neither in the budget nor the plan, no one in Rhondas group wanted to hear about it. Rhonda ignored the Japanese project.
The intrepid Japanese development team got down to work anyway. Their main problem was stringsor so they thought. They had heard somewhere that message catalogs or resources were the "right" way to handle these, but catalogs or resources looked like too much work for the time they had allocated. The code was in C and had constructs such as: static char *strx = "A typical string: our Hovercraft is Full of Eels";
The compiler wouldnt let them replace that string with a function call. Other strings were in performance-sensitive spots where the cost of a function call would be deadly. They simply didnt have time to deal with the changes required, so they made a fateful decision: They built scripts to replace the strings in-line, creating a mutant version of the code base with all the English replaced by Japanese.
In order to put Japanese strings into the source, they had to get some Japanese strings. That wasnt too hard. Plenty of folks "know Japanese." The team didnt consider whether those people were qualified to translate technical terms. After all, they were in a hurry, and one of the team members boyfriends could knock off the work in the evenings in no time flat.
Once the translation was done, the developers stuck it into the code without further thought. No one edited it.
The Acme Japanese development team started having problems with testing. To begin with, Acme had few, if any, formalized testing procedures. There was a quality assurance team, and the members of that team had a body of folk wisdom for their release testing. There were some automated scripts, but these all assumed English text in the user interface. Plus, the scripts were implemented in a testing tool that didnt support Japanese. The team struggled with the difficulty of obtaining, installing and maintaining localized Japanese versions of the operating systems for testing.
Finally, they delivered the first beta version to Japan. Soon after, the defect reports began to arrive. The team had expected some minor visual blemishes and idiomatic inaccuracies. They hadnt expected the following:
Storage corruption. Pointers and other data were corrupted due to buffer overflow.
Ungainly dialog boxes. Acmes product had a Windows GUI, which, like many GUIs, had a population of dialog boxes. Some were carefully designed to fit into 800x600 pixels. Imagine the teams surprise when they discovered that the dialog boxes didnt fit into 800x600 on Japanese Windows, even without any Japanese text!
Mysterious database errors. The application worked with a SQL database. Down in the depths of the product, the code set is specified by an obscure parameter that determined the character encoding for communications with the database. In this product, the parameter depended on a default setting that didnt support Japanese.
The Japanese team had little database expertise, and Acmes database experts were far too busy with the new release to be bothered. Not only didnt the team know about the environmental parameter that controlled the character encoding used on the client, they didnt know that there was a critical parameter that had to be set when the database was installed and configured. As a result, they went along using a "Western European" database and could not understand the strange misbehaviors and errors that resulted when they tried to store Japanese text in some fields. Furthermore, they just didnt understand why fields that were always long enough in English werent long enough in Japanese.
Poor translation. The translators werent qualified, and no one had checked their work.
Missing features. The product collected first and last names, but the developers had not made provisions for the pronounceable versions of Japanese names.
Against the odds, the team eventually delivered a product. It was late, it was missing critical features, and it had bugs. (Other than that, it was fine.) When mediocre sales followed, the top managers all blamed each other. Clive, the Asia vice president, blamed Sam and Rhonda in development and release management for failing to support his effort, while Sam and Rhonda felt that the mediocre sales proved what they had been saying all alongthere is no real money to be made in Asia.
The Japanese development team tried to merge their work into the source for the next release, but without a code branch in a source management tree (a basic capability of a good source management tool), this was impossible. In any case, the in-line Japanese strings couldnt be checked in replacing the English strings, and the Japanese version was now in permanent limbo. Japanese customers who encountered defects had to wait for someone to manually port the fix from the main English code base to the Japanese version.
In spite of the difficulties, Clive did not give up. He doggedly scoured for sales opportunities. And he found them. However, he had a huge problemthe new customers required features from the new release. And some of them were in China instead of Japan. Once again, the weary team suited up, resigned to starting the entire process over again. And here we leave them, slogging along and cursing their fate, the perpetual outsiders. The developers hate them, and they arent too popular with Clive either because of the products poor performance. Who would want to work on a team like this one?
What Went Wrong?
The Acme project suffered from a series of problems, and each problem fed the next. That doesnt mean that you cant get into trouble by only following parts of their example; you can get into big trouble with any one of them. To avoid internationalization problems, consider your organizational structure, process and technical requirements. Here are some of the specific mistakes in the Acme story:
Bad organizational structure. You cant get a good result from internationalization if you dont treat it as part of your core business. If you ghettoize your international initiatives, they will suffer from the lack of communication, coordination and buy-in. Its a good idea to have a separate marketing and sales vice president focused on the specific regional and cultural issues.
Failure to coordinate. Because the internationalization process wasnt integrated into the overall development plan, the Asian requirementswhich merited the initial attention of the architects and developerswerent included in the release.
Failure to understand the requirements. As with many other foreign markets, Japan has unique needs. For example, Acmes product required support for pronounceable Japanese strings, called furigana. These must, in some cases, be displayed atop a string in a typesetting convention called ruby text. The product also needed support for the Input Method Editor that allows the input of ideographic characters, as well as support for the Japanese rules for breaking lines. Since no one took the time to analyze these potential requirements, they were all left out. The first mistake was disastrous: Without furigana, Acmes customer service representatives couldnt pronounce the customers names.
Failure to appreciate the technical challenges. Even taking English software to Europe can yield snags. Acme had a much harder problem in aiming for Japan. Here are a few of the biggest obstacles they slammed into:
Their GUIs embodied English grammar and needed significant rearrangement.
The GUIs assumed Latin font metrics and thus didnt fit on the screen in Asia.
In Asian languages, text is represented in the Multibyte Coding System (MBCS). Acmes code used some of the common clichés for strings that corrupt MBCS text. See the sidebar for some gory details.
Acmes product used a database and foundered on the complexities of the SQL databases international support.
The code sorted strings by pure numerical order.
The code used third-party components that didnt work for international text.
The code included latent defects, especially storage defects, that are common to C or C++ code. Without excellent test coverage or a good storage defect detection product, these problems lie low. Heaps are surprisingly tolerant of some mistakes, but changing the lengths of strings is a good way to change the allocation pattern and convert latent defects into blatant defects.
In short, there is no such thing as "localizing" a body of code that has never been internationalized before. This is always an internationalization project. It might be a small one or a large one, but it is never absent. Once the code has been internationalized and the changes integrated into the code base, then there is the possibility of pure localization to adapt it to additional countries.
Failure to account for parallel development. The team made no allowances for proper source management during parallel development. Internationalization projects are textbook examples of the importance of source management. They involve many small changes to a large body of code, and they must often occur simultaneously with other development efforts.
Weak translation and editing. Acme is something of an extreme case. Many companies setting out to localize at least manage to hire a professional translator. However, even a professional translator does not guarantee success. What was missing from this picture? Editing. To speak in bald economic terms, translators are often paid by the word. They have an incentive to rush. And, like the rest of us, they make mistakes. Translators are a species of writer, and any good writer knows the adage, "He who proofs his own copy has a fool for an editor."
Leaving strings in place. The quickest way to localize code is to take the source and replace the English strings with strings in another language. It may be quick, but it essentially dooms any attempt to maintain a single source that works in multiple countries.
Testing. Acmes problems with testing were like Acmes problems with source management. A relatively informal development process that worked "well enough" for one country didnt work well at all for a multilingual product. Acme wasnt prepared to have international testers show up and do productive work.
A Spoonful of Process
To succeed with internationalization, modify your development process to accommodate the effort. Youll find that internationalization turns up at all phases of the project. This may sound daunting, but keep in mind that a small amount of effort and attention early on will save you a lot of work down the line.
Successful internationalization starts with good communication among all the participants. If there is a specific international sales and marketing team, its management should establish strong lines of communication to the development management team. The international team has to understand the development process, including the flow of requirements into design and the schedule for releases. A successful international initiative must be launched in the entire company, not just in an international group, subsidiary or division. It has to be sold to everyone, and the budget has to make realistic provisions for everyones efforts.
Ideally, internationalization should be the responsibility of the core development group. They know the code best. If international support is a full-fledged requirement for a normal release, then the developers can ensure that the required changes are integrated into the architecture. This is an important part of giving the entire enterprise an international outlook. Even if the core developers cant do the work, it is a good idea for their managers to have some oversight over the internationalization developers.
If, for reasons of budget or schedule, internationalization has to live outside of the core development organization, it is especially important to foster strong communications between the international sales and marketing team, the international development team and the core development team.
If you want your code to support international deployment, say so at the very beginning. It is much easier to build U.S.-only code than international code. Schedule-pressed developers will never allow for international support if you dont make it part of the requirements. Dont listen if someone tells you that international support is zero-cost if you only take it into account at the outset. Thats not true. It will cost something. Keeping it in mind from the outset, however, will make it cost much less.
Internal requirements take two major forms: the language-neutral requirements and the specific target requirements. Examples of language-neutral requirements are:
The code shall support localization with few or no modifications to executable code. This includes dialog layouts.
The code shall operate properly with MBCS international text. (Or with Unicode, another way of dealing with MBCS languages).
The code shall operate correctly with text in several different languages.
The code shall use local-sensitive sorting, currency formatting and date formatting.
The second flavor is specific requirements for specific target markets. Here are some examples:
Japanese furigana, as previously described.
Chinese elaborate numbers. In addition to Latin digits, the Chinese script includes a set of elaborate ideographic characters for numbers. These are used on financial documents in much the same way the spelled-out numbers are used on checks in English.
Line breaking. For each Asian language, there is a set of rules for inserting line breaks into text. These have nothing to do with word boundaries.
Dates and calendars. Japan, Korea and the Moslem world all have alternative calendars that are in common use. Depending on the context in which you are presenting or soliciting a date, you may have to work with an alternative calendar.
Numbers that identify people. The US has nnn-nn-nnn social security numbers. Other countries have a wide variety of identification schemes with varying formats.
In an ideal world, your development team would already include a few international aces, thus obviating the need for a separate group of experts. But perhaps your company has international developers in a special group or prefers to outsource the work. Acme took that difficulty and exacerbated it even further. To work successfully with parallel development teams, you need good communication and strong source control.
Internationalization projects often involve outsourcing. Specialized expertise is needed, often in conjunction with hurry-up schedules. Some consultants will educate your core developers as part of the process. Some wont. Stick with the first kind.
Once your code is internationalized, it has to be localized. That is the process of producing translations of strings and other materials for each target. Unless you work for an enterprise that is so vast that it can afford to have a staff of professional translators, use an outside vendor for localization. Your international developers (in-house or outsourced) have the job of making sure that the necessary materials are easily identified, handed to the localization vendor and reintegrated after translation.
Once the requirements are set, the next step is architecture. There are entire booksKen Lundes CJKV Information Processing (OReilly & Associates, 1999) is onethat discuss the various alternatives, so Ill restrict myself to a single example. The most basic question is where to put the strings. In many cases, there are three alternatives: Windows resource files or Java resource bundles, some other sort of message catalog file or a database.
When deciding among these, its important to question whether you need to access strings from more than one language at a time. If you are implementing a Web server, for example, you may need to grab the strings that apply to a particular client users, since different clients are in different languages. Some message catalog systems enforce a single, static language selection. One of those systems would be a very poor choice in a Web server. On some operating systems, the run-time library has a single language setting (the locale) for an entire process, and there is no thread-safe way to change it for the life of the process. On such a system, you must either ensure that a particular process handles only requests for a particular language, or you have to substitute a thread-safe mechanism for the system mechanism.
Another important consideration is performance. If you use a database to store strings, beware of introducing a database query latency into a performance-sensitive code path. One good strategy is to store a time stamp in the database that records the last time that the strings were updated. Clients can then maintain a local cache of strings and only retrieve them from the database when they change.
Internationalization first shows up in the schedule as the features chosen in the architecture are developed. So far, we have simply added more development tasks to the schedule. Later on, things get more complicated, and many of these things are the actual localization of the code.
Translation and localization take time. Before you can start translating, you have to have a set of strings to translate. You have to freeze the strings. If you cant freeze them altogether, you have to start to track the changes to strings, so that you can send incremental jobs to the translator(s) to keep up with developments. One way to expedite translation is to start with a glossary. Extract a list of important words and phrases from your code and documentation. Send it off to your best (and perhaps, most expensive) translator. Have it reviewed by an editor. When the time comes to send the bulk of the text off for translation, have the translator work from the glossary.
Once you send strings off to the translators, it takes them some time to complete the work. In the meantime, you may be wondering how well all your internationalization changes have worked. "Gee," you may think, "too bad we cant start testing this stuff yet." Well, you can. You can pseudo-translate to find and flush out many defects before the translations are available.
In pseudo-translation, you pretend to localize the product. You take the original translatable materials, and you decorate each string with (for example) a few Chinese characters. Then you test the product, looking to make sure that the Chinese characters appear, correctly, in all the right places.
Of course, you must leave room in the schedule for late translation fixes. In spite of the best efforts of translators and editors, mistakes and misinterpretations will turn up in the late stages of testing. Allow time to send them off for corrections.
A Business Imperative
Now that you know the truth about internationalizationit isnt easy, it isnt simply a localization project, and you cant accomplish it on a shoestringyou can focus on the rewards of successfully translating your software for use in other countries. International Data Corporation estimates that worldwide business-to-business e-commerce will grow to $30 billion by 2001, while by 2002, non-English speakers will make up more than 50 percent of the worlds online population.
With more than half of the worlds Internet users predicted to be non-native English speakers by 2002, going global is not merely a business advantage in the 21st century; it is a business imperative.
Using Unicode to get you there is the most efficient and promising way to ensure your worldwide engineering process is effective, affordable and rewarding.