Channels ▼
RSS

Bookmarks | Tackling a Daunting Task (Web Techniques, Sep 2000)


Bookmarks | Tackling a Daunting Task (Web Techniques, Sep 2000)

Unicode: A Primer
By Tony Graham
M&T Books, 2000, 475pp.
$24.99

Unicode is one such topic. At first glance, it's deceptively simple. Like ASCII, Unicode is a specification for character encodings. Fundamentally, this means that for every character, Unicode provides a corresponding number. The number 65, for example, represents a capital A, 66 represents a capital B, and so on.

Unlike ASCII, Unicode is meant to be a universal character set, addressing languages that do not use the Latin alphabet, such as the Asian languages. Unfortunately, this isn't as easy as it may seem. Supporting all written languages requires maintaining a mapping between a number and a character. For example, some languages are written from right to left, rather than left to right. The Unicode standard must somehow incorporate this and a lot of other information. On top of all this, Unicode attempts to be backward-compatible with many of the existing character sets for different languages.

While Unicode is an extremely important subject in this age of globalization, books on Unicode are few and far between. For this reason alone, Tony Graham deserves applause for his ambitiously titled Unicode: A Primer.

Text in Context

One of the book's strengths is the context he sets for why Unicode is important, and how it's relevant. One of the obvious difficulties of developing software or Web sites in multiple languages is translation. As it turns out, however, translation tends to be the least of developers' problems.

Maintaining software text in different languages generally means extracting all of the text from the source code into separate files, and letting translators maintain the different texts. For this to work properly, the software must know how to handle these different languages internally. And that's where the real difficulty begins.

Prior to the Unicode standard, developers were forced to use a variety of character sets for each language. Some of these character sets used 8-bit numbers (like ASCII), some used 16-bit, and some used both. Additionally, how software handled a character set depended on the language. In the aforementioned case of right-to-left languages, for example, software had to be smart enough to display the characters in the proper direction.

Taking all of these unique character-set characteristics into account means far more work for the developer than handling translations. Graham emphasizes this difficulty by describing his own tribulations as a programmer creating multilingual software. Before Unicode was available, Graham had to develop multiple versions of the software using a variety of applications and operating systems, a time-consuming and expensive proposition. Unicode, however, allowed Graham to support all of the different languages from one common code base, using one common set of software tools.

To help the reader understand the various subtleties of Unicode, Graham begins by distinguishing a character from a glyph. Character refers to the abstract meaning of a particular shape, while a glyph is the visual representation of a character. Fonts are simply collections of glyphs for each character.

Once again, the distinction is deceptively simple. Things can be a glyph of a character in one context, and a character in another. For example, take the letter "R" in the English language. "R" usually only means one thing—the letter "R"—no matter what font you use. However, in mathematics, different visual representations of "R" (such as an upper or lower case "R") have distinct meanings. So when read in the context of a mathematical equation, these different glyphs are actually separate characters as well.

Once readers understand Unicode's many complexities, they'll appreciate the reasons behind Unicode's design, and they'll have a better understanding of how to use it. Graham devotes large chunks of his book to explaining how Unicode is supported by XML, HTML, and a variety of operating systems and programming languages. His programming language coverage is extensive, with summaries and examples for mainstream languages such as C++ and Java, and lesser-known ones like Document Style Semantics and Specification Language (DSSSL).

Graham's discussion of Unicode and the Web is very good. He jokes that, in its early days, the World Wide Web was actually the "Western European Web." However, Unicode was quickly adopted as the standard character set for HTML and related software, allowing the Web to be truly world wide. More interesting is the interaction between XML and Unicode. While Unicode is supposed to be a conceptual layer underneath XML, it provides some of the same features, such as the ability to denote the language used in a file. Graham identifies these similarities, and explains how to prevent conflicts between them.

Room for Improvement

While the book shines at times, it can be difficult to follow. The introduction and first chapters are good examples of this. In some places, Graham is remarkably clear; in others, he's cryptic and confusing, relying on obscure terminology he has yet to define. Again, he's not entirely to blame. Not only is Unicode conceptually challenging, the naming systems used by various standards committees makes it even more difficult to explain. For example, the official ISO name for the Unicode standard is ISO/IEC 10646-1:2000. This is only one of many ISO standards of which the reader must keep track.

The book's greatest weakness is the discussion of the character set itself, which could have benefited from more detailed and thorough examples. These weaknesses are somewhat offset by the breadth of Graham's coverage. Overall, he does an adequate job of introducing technically competent individuals to Unicode. It's no primer, but it's not bad either.


Eugene writes, programs, and consults on a freelance basis. He is currently writing a book on the history of free software, entitled Software, Money, and Liberty: How Source Code Became Free. You can reach him at eekim@eekim.com.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video