Channels ▼

The Software Project and Unicode

Dr. Dobb's Journal August 1997: Programmer's Bookshelf

Michael, a software engineer for Hamilton Software, can be contacted at [email protected] Laurence is a freelance software engineer and author. He can be contacted at [email protected]

Dynamics of Software Development
Jim McCarthy
Microsoft Press, 1995
184 pp., $24.95
ISBN 1-55615-823-8

The Unicode Standard,
Version 2.0
The Unicode Consortium
Addison Wesley, 1996
944 pp., $59.00
ISBN 0-201-48345-9

Upon sitting down to read Jim McCarthy's Dynamics of Software Development, I expected a humorous perspective on structured analysis, ISO 9000, and software requirement specifications. But McCarthy doesn't bother to bore us with such dull topics. Instead, he lays out 54 rules of the game he calls the "software project."

In reality, the rules are just a gimmick. McCarthy's explanation of how to play the game so everyone wins is more important. He gives us valuable insight from his experiences as project manager for Microsoft Visual C++ 1.0. Throughout the book we are treated to gems such as:

The visionary leader will conceive of a future reality that must be created by the effort of the community, while the demagogue will perceive a need to remove something from the current situation. The visionary will harness the communal psychic energy toward a common goal, something that will require the delay of gratification; the demagogue will move to immediately sate the baser instincts he or she has excited.

No, the book is not as esoteric as this quote suggests.

McCarthy writes from the perspective of a program manager who wants to be team captain -- not boss, friend, or parent. Of course, the program manager must put his role into perspective.

Before the program manager can be worth anything to the team, he or she must be thoroughly disabused of the notion that he or she has any direct control.

Dynamics of Software Development grew out of a talk entitled "21 Rules of Thumb for Shipping Great Software" McCarthy used to give at customer sites. He expanded the list to 54 rules, labeling it a game because games are fun. The end result of the game is intellectual property (software), and it is much easier to create intellectual property when you are having fun. Likewise, reading a book about a game is much easier than, say, reading a book on structured analysis or project management. After reading his book, however, it seemed that McCarthy has never read a book on project management because his approach is so fresh that it could have only evolved directly from his experiences.

Nevertheless, I didn't agree with some of what McCarthy writes -- in particular, rule 4, "Don't flip the bozo bit." McCarthy's point is that project managers shouldn't get it stuck in their heads that someone is a bozo. But face it, Jim, there really are bozos in life, so deal with it. However, his "bozo bit" perspective will help me deal much better with them in the future. McCarthy explains what the bozo bit costs when it is flipped. This is important because for many people, it is very hard to clear the bozo bit once it is flipped.

Also, the blanket statement that "Most Software Sucks" is a bit extreme. If I believed that, I would not spend my waking hours writing code. The only time that software sucks is when it causes users to lose work. Boy does that suck! Most programmers write code that does what they intend, and that is good.

Still, I agree with most of McCarthy's book. He relays much that is not obvious, yet fundamental and true. He lets us know what it is like to be on his team, without burdening us with the technical aspects of the day-to-day coding and project management details. Dynamics of Software Development is easy to read, provides valuable insight to the software-development process, and is especially important to people who haven't had the pleasure of being on a software team.

-- M.E.F.


Every Dr. Dobb's Journal reader is familiar with the structure of -- and issues surrounding -- the venerable 7-bit ASCII. It copes well with representing text in most European languages, so it satisfies the needs of most information transfers. But because it fails to support languages that don't use the Latin alphabet, the Unicode Consortium has been working to design and implement a 16-bit character-encoding scheme that will support non-Latin scripts. (The Consortium, a nonprofit organization, includes Apple, DEC, HP, IBM, Justsystem, Microsoft, NCR, NeXT, Novell, SGI, Sybase, Unisys, and The Research Libraries Group.)

My first reaction when thumbing through The Unicode Standard, Version 2.0 was culture shock. I suspect every other programmer who lacks a degree in Arabic, Chinese, Cyrillic, Thai, Tibetan, and a dozen more languages will also be overwhelmed. Unicode defines codes and text-processing rules for almost every written language. Scripts can be broadly divided into three classes: ideographic (symbols represent ideas), syllabic (symbols represent syllables), and alphabetic (symbols represent phonemes).

Since alphabetic and syllabic scripts require a small set of characters, their code spaces are compact: Our Western alphabet, with its 26 letters, easily embeds in ASCII's mere 7 bits. Ideographic scripts, on the other hand, require a massive number of codes: Of the total 65,536 possible codes, the Chinese-Japanese-Korean (CJK) ideograms are allocated a code space ranging from 0x5000 through 0x9FFF (20,480 codes!). In comparison, all of the General Scripts (Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Thai, Lao, Tibetan, Georgian, and nine Indian subcontinent languages) occupy codes 0x0000 through 0x1FFF, with almost half of these codes still unallocated. As Unicode is extending the number of characters that ASCII supports, it is ironically going against the 5000-year trend of writing systems evolving to rely on as few symbols as possible (as in alphabetic systems).

Unicode also includes nonlanguage scripts, such as mathematical and scientific symbols, the Zapf Dingbats characters, block and line graphic characters, and the like. Most of these are grouped in the Symbols Area range (0x2000 through 0x2FFF).

These defined ranges are about the only easily understood aspects of the standard. Due to its ambitious scope and the ambiguities of and differences among human languages, the standard is complex. I'll enumerate a couple of issues to demonstrate that Unicode is definitely not ASCII but double the width.

Whereas ASCII has ten codes for the ten digits, Unicode has scattered code ranges representing digits in the various scripts, some of which do not even have a representation for a zero digit. Identical glyphs can also map to more than one code: The angstrom character, for example, appears twice: once in the Latin-1 extension (code 0x00C5, called a "Latin capital A with ring above") and once in the Symbols Area group (code 0x212B, called "angstrom sign"). Simple punctuation such as spaces, commas, and full stops come in different language flavors, too.

Finding the right code can also be a problem: The Symbols Area group has a block for currency symbols, but this block only groups a small number of currencies together. Other currency symbols (if they are included) can be found in the language block itself.

The list of questionable design decisions goes on and on. Single-digit superscripts, for example, have their own block (0x2070-0x2079), but this block excludes the superscripts for 1, 2, and 3, because these are already present (as noncontiguous codes) in the Latin-1 extension to ASCII. Confound that with issues of byte ordering, script directionality, combining marks (diacritics), canonical code sequences, and the Consortium's consideration of including nonliving scripts like Egyptian hieroglyphics and Sumerian cuneiform, and you end up with a standard of Babel-like complexity.

The Unicode Standard, Version 2.0 attempts to make all this comprehensible. Chapter 2 lays down the foundations to understand this complex standard. It explains, among other things, the interaction between the encoding and the text processes (algorithms), delves into the ten Unicode design principles (16-bit, full encoding, character versus glyph, semantics, plain text, logical order, unification, dynamic composition, equivalence sequences, and convertibility) and explains the complexities of combining characters, as in combining a "Latin small letter e" character (0x0065) with a "combining acute accent" (0x0301) character to produce an "e accent acute" (the legal alternative being to simply use the single code 0x00E9).

Chapter 3 deals with the critical issue of conforming to the standard. Any system claiming to be Unicode conforming will have to handle the following issues by the book: byte order, character semantics and properties, combining and decomposable characters, surrogates, canonical ordering, combining Jamo behavior, and bidirectional text.

Chapter 4 explains character properties. An ASCII database (UNIDATA2.TXT) on the CD-ROM that accompanies the book lists each character (by code and name) and its associated properties.

Chapter 5 is going to be the lifeline for any unfortunate mortal tasked with implementing a Unicode-conforming subsystem: "Implementation Guidelines" describes the complexities (again!) of rendering, searching, sorting, normalizing, editing, and "transcoding" Unicode text. While Chapter 5 is the second-longest noncatalog chapter of the book, most programmers will need far more information to fully implement the standard. Chapter 5 will be useful for software houses doing feasibility studies to determine whether they should tackle this in-house.

Piling complexity on complexity, Chapter 6 additionally contains a wealth of details, exceptions, and miscellaneous rules, as it explores each character block of the standard.

The bulk of the book -- 523 pages, in fact -- is taken up by Chapter 7, "Code Charts," which catalogs all currently defined Unicode characters, organized by group and block. Since a large proportion of the standard is concerned with CJK issues, and since I am not a sinologist, this part of the book was mostly indecipherable to me.

The Unicode Standard, Version 2.0 is required (but tough!) reading for any programmer involved with converting applications, languages, or operating systems to support Unicode. The general index is poorly done, although there is a good (and very necessary) glossary at the end.

For other application programmers, though, the day is near when our string searching/sorting or text-to-integer routines will be inappropriate in a world where everyone talks Unicode. The simplest approach will be to rely on your operating system or language to do the work for you. Java's char and String types and associated methods, for example, are defined to build on Unicode instead of ASCII.

Is the Internet, and more specifically the World Wide Web, proof that our culturally diverse planet can communicate using English as the Esperanto for the world, or do we really need Unicode? The marketplace, as usual, will be the judge.


Copyright © 1997, Dr. Dobb's Journal

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.