Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Book Review: Unicode Explained


UnixReview.com
November 2006

Book Review:  Unicode Explained

Reviewed by Cameron Laird

Unicode Explained
Jukka K. Korpela
O'Reilly, 2006
0-596-10121-X
678 pages, $59.99

Why is Unicode so hard?

For good reasons: its complications have complications, and it's hard to isolate any part small enough to understand that isn't deeply coupled to much else. Three broad themes that illustrate this difficulty are:

  • Like rocket science or networking, Unicode has lots of pieces. Those of us who "learn by doing" typically have to pull together a Unicode-savvy application, a useful font, some sort of "input method", knowledge of a human language other than our native one, and perhaps a reconfiguration of the operating system and/or the keyboard, before we can see a working example of Unicode doing something useful. Imagine how different Little League baseball would be if all the players had to be competent in all skills before their first practice.
  • While there is a single "Unicode standard", it depends on dozens of other specifications, standards, and definitions, all linked in complicated ways. The standards rarely make good tutorials, and occasionally are impenetrable even as references, so entry in this domain involves navigation through a maze of primary texts and commentaries on them, with occasional inconsistencies across dimensions of time, treatment, and author, despite the best efforts of very smart and hard-working people. In some cases, simply understanding how to read a particular document — Is it advisory? Does it still apply? Is it intended to be specialized? — is a challenge.
  • Unicode exhibits politics in the vernacular sense — the kinds of disputes that motivate people who command armies. While 1s and 0s usually excite only computing insiders, Unicode codifies decisions that inspire passions among "civilians": the correct way to write the Tibetan language, whether English in India is the same language as English in Australia, and which reformations of Chinese characters are implicitly valid are the kinds of questions that simply cannot be answered on a purely technical basis.

There's good news, though: Jukka Korpela's Unicode Explained makes Unicode comprehensible. I've been working occasionally with Unicode for almost a decade, but I find I understand parts of it much better now that I've read his book.

Unicode Explained isn't unique in its values; several introductions to Unicode have been assembled by passionate, deeply informed authors who handle the topic's difficulties fairly and with insight. Among these, Unicode Explained deserves attention as the most recent and the one that exhibits the most scholarly refinements. Over and over, Korpela "goes the extra mile" for readers by his introduction of specific details and concepts crucial to understanding. Rather than a glib syllogism about how typographic unification can go to excess, he presents specific examples from Scandinavian languages, possessive punctuation, and speech synthesis (is it obvious that "Charles I ..." is about the first in a sequence of kings, and that "I" is neither a pronoun nor an initial?) to make his point. He's careful and explicit to keep HTML, CSS, and XML separate in all their manifestations. The entire book is dense with this sort of illuminating substance.

An introduction to Unicode is different from one on SQLite, say, or even a topic as broad as cryptography, because the subject of Unicode is so unavoidably incoherent. Unicode deals with human languages and their typographic representations and must expand to all the messiness we humans achieve. A good author on Unicode can't be just a formal prodigy in a bounded subject like chess, for instance. Instead, he must be experienced in all sorts of esoterica. Korpela appears to have devoted himself to the subject, with Unicode Explained the helping hand he generously offers those of us who merely use Unicode.

Conclusion

My recommendation, then, if you work at all outside the ASCII table or standard Latin alphabet, is to keep a copy of Unicode Explained at your desk. It's a wonderful reference for such common questions as:

  • When should I use UTF-8, UCS-4, ISO-8859-1, UTF-16, and so on? You can read his answers for yourself, in the free online sample (Chapter 3).
  • How do I encode mathematical subscripts?
  • What keystrokes tell Emacs I want a '\xe5' in my text?
  • Who uses IPA?
  • Where are free fonts available?
  • Why does acceptance of Unicode in Web applications constitute such a security hazard?

There are a very few places where Unicode Explained is confusing or misleading. Korpela, for instance, doesn't distinguish mathematicians from physicists, which leads to error in explaining the symbols the former use.

These missteps are minor, though. If you read, write, or program with human languages other than English (or perhaps Hawaiian or a very few others), you'll do well to keep Unicode Explained at hand.

Cameron is vice president of the Phaseit, Inc., consultancy, specializing in high-reliability and high-performance applications managed by high-level languages. He has reviewed more than 50 books for UnixReview.com, and has had a life-long interest and involvement with several human languages apart from English.

 


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.