A Simple XML Parser

By Sebastien Andrivet, July 01, 1999

HTML has shown the power of a portable display markup language. XML is now extending that power to data with arbitrarily complex structures.

July 1999/A Simple XML Parser/Sidebar

Unicode and Alternate Character Encodings

If you open an XML document, you may think it's just another ASCII file with 8-bit characters. But, like HTML, XML is in fact specified as Unicode (16-bit)! The reason you can open an XML file just like an ASCII file is because of its encoding, or transformation format.

The Unicode Standard is a fixed-width, uniform encoding scheme for written characters and text. It is modeled on the ASCII character set, but uses 16-bit encoding to support full multilingual text (from English to Japanese and even Tibetan).

ISO/IEC 10646 is an International Standard that also defines a multilingual encoding scheme but uses a 31-bit format (called UCS-4) or a 16-bit format (called UCS-2). Since Unicode 1.1, these two standards have merged and are now equivalent. Currently, ISO 10646 has no characters that require more than 16 bits.

To enable use of Unicode in an 8-bit environment, a transformation format called UTF-8 has been developed. The UTF-8 transformation format maintains transparency for all the ASCII code values (0..127). For example, the character 'A' in Unicode (code 0x0041) is encoded in UTF-8 as 0x41 like in plain ASCII. The codes higher than 0x80 are used to transform the other Unicode characters. This transformation produces up to three bytes for non-ASCII characters. The UTF-8 transformation provides all the power of Unicode with the advantage of compatibility with eight-bit ASCII.

There are other transformation formats, and an XML parser can recognize several of them. The encoding used for an XML document is indicated in the XML header, as in <?xml version="1.0" encoding="UTF-16"?>.

Previous 1 2 3 4 5

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

A Simple XML Parser

Unicode and Alternate Character Encodings

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

A Simple XML Parser

Unicode and Alternate Character Encodings

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content