Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Unicode in Perl


September, 2004: Unicode in Perl

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.


Two hundred and fifty five characters really ought to be enough for anyone. I've lost count of how many times I've heard this statement or similar sentiments expressed when it comes to dealing with Unicode and the more general question of character encodings.

However, this kind of ASCII-centric thinking is becoming a liability. As Harald Tveit Alvestrand put it in RFC1766, "There are a number of languages spoken by human beings in this world," and the Unicode Standard was designed to be a way to make it easy for data from all kinds of environments, languages, and scripts to play nice together.

Until Unicode came along, the world was in a mess (at least in terms of data processing). Anyone who wanted to represent any kind of nonLatin character had to cobble together their own set of important characters to live in the top 127 character codepoints generated when we all moved from 7-bit ASCII to 8 bits. Unfortunately, when everyone has their own idea of what character 160 means, depending on whether they're coming from ASCII extensions to support Hebrew, Cyrillic, or the plain old European accents defined in ISO 8859-1—data interchange is impossible.

To make things worse, the Chinese, Japanese, and Koreans got involved with data processing and soon realized that the 127 spare codepoints just weren't enough to put a dent in their data processing needs. With over 2000 kanji characters in general use in Japan, plus two alphabets of about 85 characters, 255 characters start to look a bit piffling.

If you need more than 255 characters, you're not going to be able to store each character in a single 8-bit byte. Going to 16-bit bytes was not an option, so they devised any number of encoding mechanisms to shoehorn huge numbers of codepoints into 8-bit bytes—EUC, JIS, Shift-JIS, Big-5, and many others. Many of these try to maintain compability with ASCII by keeping the semantics of the bottom 127 characters and using the top half as "shift" characters, which introduce a wider character. Now we not only have many incompatible assignations of codepoints to characters, we have multiple incompatible ways of representing "wide" characters (more than a single byte) on disk or in memory.

Unicode came along to sort all this out. It introduced a single mapping between codepoint and character for every written script on Earth—the Unicode character set. It also proposed a number of standard ways to lay out these characters when they get bigger than a single byte—the UTF-8, UTF-16, and UTF-32 Standards (ss well as some extra ones like UTF-7 that nobody seriously uses).

Perl caught the Unicode bandwagon pretty early, thanks in part to Larry Wall's foresight (not to mention his love of Japan and its language), but many of Perl's programmers aren't on board yet. This month, I'm going to try to turn you from an ASCII-phile to a Unicode-aware programmer.

Generating and Munging Unicode Data

First, though, how do we create and deal with Unicode characters? Perl tries to make this as natural as possible. For instance, where chr and ord could previously deal only with values up to 255, they can now deal with values up to 4,294,967,295, at least on my poor old 32-bit computer.

Similarly, string escapes have been extended to deal with characters higher than \xFF. However, to keep Perl compatible with old programs, which may say "\x0dabc" if you want an escape sequence longer than two characters, you must surround the character code with curly braces, like so:

print "\x{263a}\n"; # Prints a white smiling face

Now it may not be immediately apparent that codepoint 263a is a white smiling face, so Perl provides the charnames pragma, which allows you to specify characters by name, using the \N escape:

use charnames ':full';
print "\N{WHITE SMILING FACE}\n";

Since the names themselves may not be that easy to find unless you have a copy of the Unicode Standard on hand—and may be a little unwieldy even if you do—you can also specify a short name consisting of the script name and the character name. For instance:

use charnames ':short';
print "\N{katakana:sa}\N{katakana:i}\N{katakana:mo}\N{katakana:n}";

This will print out my name in Japanese. Of course, if I'm handling lots of Japanese, it gets rather tedious to type katakana: every time, so we can also say:

use charnames qw(katakana);
print "\N{sa}\N{i}\N{mo}\N{n}\n";

Now we have a bunch of Unicode data to deal with. What can we do with it? Well, the first thing to note is that we can do anything we usually do with Perl. Nothing has changed now that Unicode data appears on the scene.

True, we're dealing with characters that are now wider than a single byte, but that's OK. Perl does the right thing with them:

print length("\N{sa}\N{i}\N{mo}\N{n}"); # prints 4

One neat extra thing that we can do with Unicode data is to use extended regular expressions. For instance, the Unicode Standard defines a set of properties that each character may have, and we can use regular expressions to match these properties. I deal with a kanji dictionary, which contains kanji headwords, followed by a mixture of codes and indexes that mean very little to me, and phonetic readings in the katakana and hiragana scripts. We'll see later how I read in the dictionary, but I can extract the hiragana readings like this:

while (<KANJIDIC>) {
     my @readings= /(\p{Hiragana}+)/g; 
     /(\p{Han}+)/ and print "$1: @readings";
}

"Han" is the property descriptor for a Chinese kanxi or Japanese kanji character. For a full list of Unicode properties, see the Unicode Standard.

Perl's Unicode Support

Perl's own support for Unicode has developed and matured over the years, after a pretty shaky start. Not only that, but the nature of the support and what Perl has offered in terms of Unicode support has changed. Writing with the benefit of hindsight, I can now tell you about what Perl can do at the moment—regardless of what it was supposed to be all along. We can mainly ignore all of the motivations and all of the little hacks along the way, and talk about the real world.

But first, a bit of history so we are clear on what's possible with particular Perl versions. The first Perl release to support any kind of Unicode data was Perl 5.6.0. You could generate Unicode characters as we've previously discussed , and you could print that data out, more or less, but there was no other way of getting Unicode data from files or from other sources into your application as Unicode. This was a bit useless, really. It also didn't help that Perl didn't have a clear strategy for what happened when Unicode data hit nonUnicode data, and it's here that an important distinction arises, which we'll look at in a second.

These problems were mostly sorted out through 5.6.1 and gone by Perl 5.6.2, but the problem of getting Unicode data into Perl still remained. Work began in 5.6.1 or so to fix this using the Encode module, and this has only been usable since around Perl 5.8.2. So while it is possible to do some Unicode-related work in 5.6.2 if you're careful, real Unicode applications ought to be based on 5.8.2 and above.

The Big Lie

I've been claiming that Perl now supports Unicode, but to be honest, that's a bit of a lie. Perl supports data encoded in the UTF-8 representation and knows what to do with it if that data is Unicode. It doesn't ever know whether that data really does represent Unicode or not.

Let's suppose we're dealing with a string of Japanese data (as I reasonably often do) and let's further suppose we know nothing about Unicode at all. We're just an ordinary Perl 5 application merrily handling Japanese text, which is encoded in the EUC encoding often used for UNIX-based Japanese data processing:

my $hello = "\272\243\306\374\244\317\241\242\300\244\263\246";
print $hello, "\n"; # Prints "Hello world" on an EUC terminal
print length($hello) # 12 bytes

Now we want to play in the Unicode world and add our familiar smiley face to the end of our "hello world" greeting:

my $smiley = "\N{WHITE SMILING FACE}";
print $hello . $smiley;

At this point, we have a problem. Perl has absolutely no idea that this data is Japanese EUC. It could be in any legacy encoding under the sun. And now we want to append a Unicode string to the end. What's Perl going to do?

Well, there's very little it can do. It knows that the string on the right is Unicode data, but it can't assume very much about the string on the left. What it does do is rely on a flag that marks a string as being represented internally as UTF-8. It further assumes that when it sees a string that isn't represented as UTF-8, this should be treated as ISO-8859-1. Since our Japanese data isn't ISO-8859-1, madness will soon ensue.

Perl will "upgrade" the string to UTF-8, but it doesn't know how to convert it to Unicode. What we end up with is some UTF-8-encoded Japanese EUC data, not UTF-8-encoded Unicode data, and this is no good to man or beast.

Encode—Dealing with Legacy Data

So what can we do about legacy data that isn't ISO 8859-1? At this point, the Encode module becomes useful. We can't tell Perl what encoding we're dealing with, but we can ask Perl to translate everything to Unicode for us, and use that as a lingua franca—one of the things it was precisely designed to do.

Let's take that same EUC string—the Japanese for "Hello, world":

my $hello = "\272\243\306\374\244\317\241\242\300\244\263\246";

and use Encode to translate it from Japanese-EUC into Unicode:

my $hello_uni = decode("euc-jp", $hello);

Where before we were dealing with the string as a binary sequence of bytes, we're now dealing with it as Unicode characters. This is not just our useless EUC-UTF-8 mix, but real, honest-to-goodness Unicode.

print length($hello);
print length($hello_uni);

At this point, all of our Unicode slicing-and-dicing, including Unicode-aware regular expressions, will work properly on $hello_uni.

Once we've finished munging our data, of course, we might want to put it back into the EUC format we began with. Once again, Encode helps out with the predictably named encode routine:

open OUT, ">sliced-hello.euc" or die $!;
print OUT encode("euc-jp", $hello_uni);

To find out what encodings Encode supports, you can say:

use Encode;
print Encode->encodings(":all");

So, for instance, we might want to create ourselves a Unicode transcoder—that is, something that takes data in one format and spits it out in another encoding. This is something I end up doing rather often, so I came up with the following program:

#!/usr/bin/perl
use Encode;
my ($from, $to) = splice(@ARGV,0,2); ($from && $to) or usage();
while (<>) {
    my $unicode;
    eval { $unicode = decode($from, $_) };
    if ($@ =~ /unknown encoding/i) { usage() } 
    eval { print encode($to, $unicode) };
    if ($@ =~ /unknown encoding/i) { usage() }
}

sub usage {
    die qq{
$0 - $0 <to> <from> [<file> ...]

This acts as a filter, encoding data from the first character set to the second. Available character sets are:

}, map { sprintf("\t%s\n", $_) } Encode->encodings(":all");
}

However, there's yet a neater way to do things. If you have Encode available, you also have the PerlIO module, which hooks into Perl's IO streams to control how file access is done. PerlIO is a mechanism that can be used to add filters onto a filehandle: One that automatically strips the newlines, for instance, or reads files that are gzipped, or even bypasses standard IO altogether and reads files directly into memory with mmap. Encode hooks into this PerlIO framework to read and write files through character set encoding or decoding. For instance, to read a Russian file from a Windows computer, using the koi-8 encoding, you can say:

open IN, "<:encoding(koi8-r)", "russian.txt" or die $!;

And to write it out again for use on a Russian Mac running System 9, you would say:

open OUT, ">:encoding(MacCyrillic)", "russian.mac" or die $!;
while (<IN>) { print OUT $_ }

So, if you're content with a little simplicity, you can slim your transcoder down to:

#!/usr/bin/perl -p
BEGIN {
    binmode(STDIN, ":encoding(".shift(@ARGV).")");
    binmode(STDOUT, ":encoding(".shift(@ARGV).")");
}

Dealing with the Outside World

The final piece of the Unicode puzzle comes when you need to send or receive UTF-8 data from files, or send to other applications that may or may not know anything about Unicode themselves—such as databases, which just store the data, not caring about its semantics.

To store data as Unicode is easy enough—you just do it. If you write to a filehandle with data that Perl thinks contains Unicode characters—that is, has the UTF-8 flag set—Perl will write the UTF-8 representation of the string to the file:

open OUT, ">smiley.txt" or die $!;
use charnames ':full';
print OUT "\N{WHITE SMILING FACE}\n";

This will work just fine, but Perl will issue a warning when any multibyte characters are emitted:

Wide character in print at smile.pl line 3.

In order to tell Perl that it's OK to send the output as UTF-8, you can set a flag on the filehandle:

binmode(OUT, ":utf8");

Similarly, if you have a file that contains UTF-8 data that you want to recognize as such, you can set the same flag using binmode again on the input filehandle:

open IN, "smiley.txt" or die $!; 
$a = <IN>; chomp $a; 
print length $a # 3 bytes - not marked as Unicode
close IN;

open IN, "smiley.txt" or die $!; 
binmode IN, ":utf8"; 
$a = <IN>; chomp $a; 
print length $a # 1 character - marked as Unicode

Indeed, this is the usual and best way of getting Unicode data into a Perl application. Unfortunately, files are not the only places where you might receive UTF-8-encoded strings. We might read data from a socket, or receive it via DBI from a database, or as I had to do recently, read it from the middle of another binary file.

This last case is particularly interesting—you can't read the whole binary file as though it were UTF-8 because you really want to treat it as a stream of bytes; however, when you get to the part representing a string, you need Perl to treat it as a UTF-8 string, and work character-wise.

The way to do this is to use utf8 as just another encoding. You have data that you know is in UTF-8 and you want Perl to turn it into Unicode data, so you say:

# String is packed with the length first
my $len;
read(BIN, $len, 4);
my $len_bytes = unpack("N", $len);

# Now read the string
my $str;
read(BIN, $str, $len_bytes);

# Make a UTF8-aware copy
my $utf8 = decode("utf8", $str);

There is another way to do this, which is a little messier, but I recommend it nonetheless. Encode can optionally export a subroutine called _utf8_on. As its name implies, this is an internal routine in that it directly messes with Perl's internal representation of the string, turning on the bit that says this data is UTF-8. I prefer this, however, because it is efficient, self-documenting, and easier to understand than trying to work out from what decode("utf8", $str) is decoding into what.

Finally, you may have to deal with situations where you don't want to end up with your Unicode data as Unicode. For instance, you have a bunch of database records about your company's contacts in Eastern Europe that you need to have inserted into your master contacts database. Unfortunately, even though you are an educated and progressive programmer, and have stored everything correctly in Unicode, Headquarters is full of people for whom ISO-8859-1 is a recent advance over 7-bit ASCII. What will you do for your friend in Cz rny?

Here is a problem where you are guaranteed to lose information. You want to represent a character that simply can't be represented in the character set you have to deal with. Your choice is how much information you want to lose. If you take the obvious approach, and say:

print decode("iso-8859-1", "C\x{17e}rny");

Encode will helpfully substitute in a "substitute character" for the letter that cannot be represented, and you'll end up with "C?rny." This is acceptable so far, but you should be thankful that you're not dealing with completely nonLatin alphabets, as mail to the Korean city of ??? is guaranteed not to arrive.

If you need to lose less information, you could try the wonderful Text::Unidecode module, which tries to turn Unicode strings into "plain text." For example:

use Text::Unidecode;
print unidecode("\x{d478}\x{c0b0}"); # pusan

It's not perfect, but it's certainly better than a stream of question marks. When you still need to communicate with ASCII dinosaurs, Text::Unidecode will give them pretty much what they deserve.

Unicode for All!

Thankfully, though, the world is getting more and more Unicode aware. As we move into global business, working with more countries, languages, and scripts, the importance of Unicode will continue to grow. Most of the time, it takes very few changes to make an application aware of the possibility of Unicode text or to deal with that text when it arises, so there's really no excuse for not making your code Unicode compliant—do it now, and it'll save time and effort later.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.