Channels ▼

[email protected] | Parlez-Vous Java? (Web Techniques, Sep 2000)

Web Techniques: Sidebar


ASCII, Unicode, and UTF-8

How do your programs store characters? For years, the answer was the ASCII format, but that's no longer true. Internally, Java uses the Unicode format. Unicode is a standard 16-bit character set used to represent glyphs for nearly every known language and a number of extra symbols.

For external data, Java uses an encoding scheme known as UTF-8. This is a particular way of storing characters in which the initial bit pattern determines the number of bytes in the character. (Remember that there are eight bits in a byte, numbered from 0 to 7.) Having this variable bit pattern lets you store data in an efficient but versatile manner. There are three types of characters possible in the UTF-8 scheme: one-byte, two-byte, and three-byte.

One-byte characters: If bit number 7 of the first byte is set to 0, then the character is made up of only one byte.

Two-byte characters: If the first three bits (numbers 7, 6, and 5) are set to 110 (binary), then the character consists of two bytes. In this case, the second byte must begin with 10 (binary), which leaves 11 bits remaining to define the character.

Three-byte characters: If the character requires more than 11 significant bits, the UTF-8 scheme says that Java must use three bytes to store it. In this case, the first byte must start with 1110 (binary). The next two bytes each start with 10 (binary). This lets you store the full 16 bits.

The UTF scheme has several advantages. All ASCII files are already proper UTF-8 files, so you don't have to convert any existing data. In addition, because of the bit patterns it's easy to recognize whether a byte starts a sequence, belongs to a sequence, or is its own character by looking at the starting bits. (Any byte that's part of a sequence starts with 10 (binary); any byte that does not start with 10 (binary) either begins a sequence or is a single byte.)


Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.