Channels ▼

Jocelyn Paine

Dr. Dobb's Bloggers

Unicode and the Shavian Alphabet II

July 27, 2010

In Unicode and the Shavian Alphabet, I wrote about the incompatibility between two online translators: shavian.org's one that translates English into the Shaw alphabet, and Pīnyīn.info's one that translates characters into their Unicode numbers. To summarise: the Shaw alphabet, also known as the Shavian alphabet, was invented in a competition to design an alphabet in which English is spelled as it sounds. I used it as an alien programming language in a cartoon, generating my text with shavian.org's transliterator. I then tried to convert the transliteration into Unicode numbers by pasting into Pīnyīn.info's translator. But the result had the wrong codes, and twice too many of them. Thomas Thurman, author of shavian.org's transliteration script, mailed me to explain why:

With reference to your column at http://www.drdobbs.com/blog/archives/2010/06/unicode_and_the.html : the reason the translator at http://www.pinyin.info/tools/converter/chars2uninumbers.html choked on the Shavian characters you gave it is because all Shavian characters have codepoints above 0xFFFF, and therefore (if you're using UTF-16, which the pinyin.info translator appears to be) they won't fit in a single word and will have to be represented using surrogate pairs. Wikipedia has a reasonable coverage of surrogate pairs: http://en.wikipedia.org/wiki/Surrogate_pair , but briefly, it's a way to represent a Unicode character whose codepoint is too high by using a pair of otherwise illegal characters, both of whose codepoints are low enough. Hence the effect you noted of having "the wrong codes, and twice too many of them".

The fault is presumably with the pinyin.info translator, which shouldn't give out surrogate pairs unless explicitly asked, but it does go to show that, as Wikipedia puts it, "code is often not tested thoroughly with surrogate pairs. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software", or as you put it, "computing still is not mature".

Thomas (author of the transliterator script on shavian.org).

 


Jocelyn Paine
popx@j-paine.org

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video