Channels ▼

Walter Bright

Dr. Dobb's Bloggers

Time for Unicode?

October 20, 2008


The first programming languages used a very restrictive character set. FORTRAN used a character set of only 48 characters. The ASCII character set offers 128 characters. Languages like C took full advantage of it, finding an appropriate and intuitive use for most of them. Then things went into reverse as C tried to accommodate more restrictive character sets by standardizing on trigraphs, and later with digraphs. For example, the trigraphs used ??< and ??> to represent { and }, and digraphs used .
These were treated with the enthusiasm one might reserve for a dead rat in a deli display case.

With the D programming language, we continuously run up against the problem that ASCII has reached its expressivity limits. Trying to come up with a sensible character or character pair for a particular need is frustrating, as "all the good ones are taken" and unattractive ones like the C digraphs are what's left.

But then there's Unicode. Programming language minds, intellects vast and cool, regard this Unicode with envious eyes(!). There are plenty of characters that fit the bill nicely. There are the chevrons « and » which serve as another set of brackets to lighten the overburdened ambiguities of ( ). There are the dot-product and cross-product characters · and × which would make lovely infix operator tokens for math libraries. The greek letters would be great for math variable names.

Alas, Unicode has a downside. Not all editors will display Unicode, and those that do make it hard to enter Unicode characters. A language designer might say, that's ok, we'll just pick a digraph or trigraph for those programmers who cannot edit Unicode source code. I think, though, that the C experience with trigraphs and digraphs shows this to be a failed path.

The D programming language has already driven stakes in the ground, saying it will not support 16 bit processors, processors that don't have 8 bit bytes, and processors with crippled, non-IEEE floating point. Is it time to drive another stake in and say the time for Unicode has come? Do your programming tools support Unicode source code?

What do you think?

 

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video