Unicode and Alternate Character Encodings
If you open an XML document, you may think it's just another ASCII file with 8-bit characters. But, like HTML, XML is in fact specified as Unicode (16-bit)! The reason you can open an XML file just like an ASCII file is because of its encoding, or transformation format.
The Unicode Standard is a fixed-width, uniform encoding scheme for written characters and text. It is modeled on the ASCII character set, but uses 16-bit encoding to support full multilingual text (from English to Japanese and even Tibetan).
ISO/IEC 10646 is an International Standard that also defines a multilingual encoding scheme but uses a 31-bit format (called UCS-4) or a 16-bit format (called UCS-2). Since Unicode 1.1, these two standards have merged and are now equivalent. Currently, ISO 10646 has no characters that require more than 16 bits.
To enable use of Unicode in an 8-bit environment, a transformation format called UTF-8 has been developed. The UTF-8 transformation format maintains transparency for all the ASCII code values (0..127). For example, the character 'A' in Unicode (code 0x0041) is encoded in UTF-8 as 0x41 like in plain ASCII. The codes higher than 0x80 are used to transform the other Unicode characters. This transformation produces up to three bytes for non-ASCII characters. The UTF-8 transformation provides all the power of Unicode with the advantage of compatibility with eight-bit ASCII.
There are other transformation formats, and an XML parser can recognize several of them. The encoding used for an XML document is indicated in the XML header, as in <?xml version="1.0" encoding="UTF-16"?>.