Channels ▼
RSS

Web Development

Elimination of Text Corruption in XML


Ashish Arora is an engineering graduate from NIT, Trichy in India. At Yahoo! he builds platforms for the Yahoo! Properties. Subramanian Narayanan is an engineering graduate in computer science from PSG Tech, Coimbatore, India. Subramanian is part of the feed processing platform team at Yahoo! and works on distributed processing using Hadoop.


Feeds have become a standard way of sharing content on the web. Formats like RSS and Atom provide the feeds that we use at Yahoo! to syndicate content from the many content partners we work with.

When processing XML data acquired from our partners/feed providers, we sometimes find the markup text or HTML text (in particular character entity reference, ISO characters, or non Unicode characters) to be corrupted. If this data is passed on to the Front-end as is, it results in a bad experience for the user. To avoid this situation, we use an algorithm to process the content to eliminate data corruption.

Recognizing Types of Data Corruption

The following data sets in an XML document will result in a bad user experience:

  • Double Encoded text (for example, ' Yahoo rocks! ')
  • Character entity reference (for example, —)
  • Windows characters in the range of 128-159 (for example, —)
  • ANSI Hex characters (for example, \x85)

Here are ways on how these affect can the user experience.

Double Encoded Text. When the XML containing the double encoded text is rendered in the front end, we have the following result:

  • Input XML Subbu' book
  • Expected result Subbu's book
  • Front-end page Subbu' book

where ' is a numeric character references corresponding to an apostrophe.

Character Entity References. XML parsers recognize only five symbols -- & " ' < >. All other valid character entity references (like — é) are not recognizable by XML parsers and the Front-end systems currently face difficulty in rendering these character entity reference. In some cases, the document gets truncated abruptly because of these character entity references. These often occur because they are valid in HTML. However, when they are provided as XML -- not CData -- the XML parser is not able to parse them.

Windows Characters In the Range of 128-159. ASCII characters in the range of 128-159 don't render well in most browsers:

  • Input XML Subbu — book
  • Front-end page Subbu[] book

[] above represents the square that browsers show when they don't have an appropriate character set.

ANSI Hex Characters. The Front-end machines do not parse the ANSI Hex Characters properly. We need to normalize these data sets to valid Unicode characters to comply with our globalization standard (all in Unicode) requirement.

Not As Simple As It Seems

Dealing with this problem isn't always entirely straightforward. There are a couple of considerations that create challenges to implementing a fix for these encoding issues.

First, the complete set of characters that needs to be corrected contains around 400 individual characters. We need a single, uniform and scalable solution to tackle the entire data set. It's also difficult to detect and eliminate double encoded characters. For example:

  • N'V --> Here ' is double encoded.
  • Arun &Subbu --> Here &Subbu is not double encoded.

It's important that any system we build recognizes the difference between the first and second example above. As such the system will need to be able to detect when an encoded ampersand is followed by a character entity.

Building a System To Fix Text Corruption

Use a normalized data set and map all characters to it. For all the categories in the data set (character entity references, Windows characters in the range of 128-159, ANSI Hex characters), we need to have a unique mapping data set for uniform conversion. The mapping set we use is that of numeric character references. For example:

  • © -> © (html symbol to numeric character reference)
  • &#151 -> — (ascii to numeric character references)
  • \x97 -> — (hex to numeric character references)

Why did we choose to use numeric character references? Any XML parser can parse the numeric character references. When parsed, they are rendered without character corruption. They are also viable HTML characters which means that they can be easily re-used further down the pipeline without further translation.

Detecting and removing double encoding. The double encoding is detected by the following regular expressions:

  • &(\\w*?);
  • &(#\\d*?);
  • &(#[x|X][0-9a-fA-F]*?);
  • &(\\w*?);
  • &(#\\d*?);
  • &(#[x|X][0-9a-fA-F]*?);

The matched pattern will detect patterns like & or &. However, it will also detect other patterns like &Guru;. Hence, before we do the replacement, we check if the matched pattern is a valid XML entity. Let's take two example strings:

  • It' my book

    The pattern ' is matched by the regex &(#\\d*?);. The & is removed from the matched pattern resulting in ', which is detected as being a valid XML entity; hence, it replaces '.

    Result - It' my book

  • Arun&Guru; are my friends

    Here, &Guru is matched by the pattern &#amp;(\\w*?);. & is replaced by & and results in &Guru;. But since &Guru; is not a valid XML entity, it does not replace &Guru.

    Result - Arun&Guru; are my friends

How is XML entity validation done on the matched pattern? The matched pattern is wrapped inside an XML, which contains the DTDs for XML entities in its doctype declaration. The wrapped XML is parsed for validating the matched pattern to determine if this is an XML entity or not.

Conclusion

When taking content from third-party sites it's easy to get corruptions of the text that break the way the text is rendered to the user. We've presented the two steps you need to implement in your systems to stop the common problems of invalid XML characters and double encoding. By implementing these changes to your system, you will create a much more robust ingestion pipeline for your content.

Notes

The character map in full is list below:


{ "characterMap": {
" ": " ",
"¡": "¡",
"¢": "¢",
"£": "£",
"¤": "¤",
"¥": "¥",
"¦": "¦",
"§": "§",
"¨": "¨",
"©": "©",
"ª": "ª",
"«": "«",
"¬": "¬",
"­": "­",
"®": "®",
"¯": "¯",
"°": "°",
"±": "±",
"²": "²",
"³": "³",
"´": "´",
"µ": "µ",
"¶": "¶",
"·": "·",
"¸": "¸",
"¹": "¹",
"º": "º",
"»": "»",
"¼": "¼",
"½": "½",
"¾": "¾",
"¿": "¿",
"À": "À",
"Á": "Á",
"Â": "Â",
"Ã": "Ã",
"Ä": "Ä",
"Å": "Å",
"Æ": "Æ",
"Ç": "Ç",
"È": "È",
"É": "É",
"Ê": "Ê",
"Ë": "Ë",
"Ì": "Ì",
"Í": "Í",
"Î": "Î",
"Ï": "Ï",
"Ð": "Ð",
"Ñ": "Ñ",
"Ò": "Ò",
"Ó": "Ó",
"Ô": "Ô",
"Õ": "Õ",
"Ö": "Ö",
"×": "×",
"Ø": "Ø",
"Ù": "Ù",
"Ú": "Ú",
"Û": "Û",
"Ü": "Ü",
"Ý": "Ý",
"Þ": "Þ",
"ß": "ß",
"à": "à",
"á": "á",
"â": "â",
"ã": "ã",
"ä": "ä",
"å": "å",
"æ": "æ",
"ç": "ç",
"è": "è",
"é": "é",
"ê": "ê",
"ë": "ë",
"ì": "ì",
"í": "í",
"î": "î",
"ï": "ï",
"ð": "ð",
"ñ": "ñ",
"ò": "ò",
"ó": "ó",
"ô": "ô",
"õ": "õ",
"ö": "ö",
"÷": "÷",
"ø": "ø",
"ù": "ù",
"ú": "ú",
"û": "û",
"ü": "ü",
"ý": "ý",
"þ": "þ",
"ÿ": "ÿ",
"ƒ": "ƒ",
"Α": "Α",
"Β": "Β",
"Γ": "Γ",
"Δ": "Δ",
"Ε": "Ε",
"Ζ": "Ζ",
"Η": "Η",
"Θ": "Θ",
"Ι": "Ι",
"Κ": "Κ",
"Λ": "Λ",
"Μ": "Μ",
"Ν": "Ν",
"Ξ": "Ξ",
"Ο": "Ο",
"Π": "Π",
"Ρ": "Ρ",
"Σ": "Σ",
"Τ": "Τ",
"Υ": "Υ",
"Φ": "Φ",
"Χ": "Χ",
"Ψ": "Ψ",
"Ω": "Ω",
"α": "α",
"β": "β",
"γ": "γ",
"δ": "δ",
"ε": "ε",
"ζ": "ζ",
"η": "η",
"θ": "θ",
"ι": "ι",
"κ": "κ",
"λ": "λ",
"μ": "μ",
"ν": "ν",
"ξ": "ξ",
"ο": "ο",
"π": "π",
"ρ": "ρ",
"ς": "ς",
"σ": "σ",
"τ": "τ",
"υ": "υ",
"φ": "φ",
"χ": "χ",
"ψ": "ψ",
"ω": "ω",
"ϑ": "ϑ",
"ϒ": "ϒ",
"ϖ": "ϖ",
"•": "•",
"…": "…",
"′": "′",
"″": "″",
"‾": "‾",
"⁄": "⁄",
"℘": "℘",
"ℑ": "ℑ",
"ℜ": "ℜ",
"™": "™",
"ℵ": "ℵ",
"←": "←",
"↑": "↑",
"→": "→",
"↓": "↓",
"↔": "↔",
"↵": "↵",
"⇐": "⇐",
"⇑": "⇑",
"⇒": "⇒",
"⇓": "⇓",
"⇔": "⇔",
"∀": "∀",
"∂": "∂",
"∃": "∃",
"∅": "∅",
"∇": "∇",
"∈": "∈",
"∉": "∉",
"∋": "∋",
"∏": "∏",
"∑": "∑",
"−": "−",
"∗": "∗",
"√": "√",
"∝": "∝",
"∞": "∞",
"∠": "∠",
"∧": "∧",
"∨": "∨",
"∩": "∩",
"∪": "∪",
"∫": "∫",
"∴": "∴",
"∼": "∼",
"≅": "≅",
"≈": "≈",
"≠": "≠",
"≡": "≡",
"≤": "≤",
"≥": "≥",
"⊂": "⊂",
"⊃": "⊃",
"⊄": "⊄",
"⊆": "⊆",
"⊇": "⊇",
"⊕": "⊕",
"⊗": "⊗",
"⊥": "⊥",
"⋅": "⋅",
"⌈": "⌈",
"⌉": "⌉",
"⌊": "⌊",
"⌋": "⌋",
"⟨": "〈",
"⟩": "〉",
"◊": "◊",
"♠": "♠",
"♣": "♣",
"♥": "♥",
"♦": "♦",
""": """,
"&": "&",
"<": "<",
">": ">",
"Œ": "Œ",
"œ": "œ",
"Š": "Š",
"š": "š",
"Ÿ": "Ÿ",
"ˆ": "ˆ",
"˜": "˜",
" ": " ",
" ": " ",
" ": " ",
"‌": "‌",
"‍": "‍",
"‎": "‎",
"‏": "‏",
"–": "–",
"—": "—",
"‘": "‘",
"’": "’",
"‚": "‚",
"“": "“",
"”": "”",
"„": "„",
"†": "†",
"‡": "‡",
"‰": "‰",
"‹": "‹",
"›": "›",
"€": "€",
"!": "!",
""": """,
"#": "#",
"$": "$",
"%": "%",
"&": "&",
"'": "'",
"(": "(",
")": ")",
"*": "*",
"+": "+",
",": ",",
"-": "-",
".": ".",
"/": "/",
"0": "0",
"1": "1",
"2": "2",
"3": "3",
"4": "4",
"5": "5",
"6": "6",
"7": "7",
"8": "8",
"9": "9",
":": ":",
"&#x3B;": "&#59;",
"<": "<",
"=": "=",
">": ">",
"?": "?",
"@": "@",
"A": "A",
"B": "B",
"C": "C",
"D": "D",
"E": "E",
"F": "F",
"G": "G",
"H": "H",
"I": "I",
"J": "J",
"K": "K",
"L": "L",
"M": "M",
"N": "N",
"O": "O",
"P": "P",
"Q": "Q",
"R": "R",
"S": "S",
"T": "T",
"U": "U",
"V": "V",
"W": "W",
"X": "X",
"Y": "Y",
"Z": "Z",
"[": "[",
"\": "\",
"]": "]",
"^": "^",
"_": "_",
"`": "`",
"a": "a",
"b": "b",
"c": "c",
"d": "d",
"e": "e",
"f": "f",
"g": "g",
"h": "h",
"i": "i",
"j": "j",
"k": "k",
"l": "l",
"m": "m",
"n": "n",
"o": "o",
"p": "p",
"q": "q",
"r": "r",
"s": "s",
"t": "t",
"u": "u",
"v": "v",
"w": "w",
"x": "x",
"y": "y",
"z": "z",
"{": "{",
"|": "|",
"}": "}",
"~": "~",
"€": "€",
"‚": "‚",
"ƒ": "ƒ",
"„": "„",
"…": "…",
"†": "†",
"‡": "‡",
"ˆ": "ˆ",
"‰": "‰",
"Š": "Š",
"‹": "‹",
"Œ": "Œ",
"Ž": "Ž",
"‘": "‘",
"’": "’",
"“": "“",
"”": "”",
"•": "•",
"–": "–",
"—": "—",
"˜": "˜",
"™": "™",
"š": "š",
"›": "›",
"œ": "œ",
"ž": "ž",
"Ÿ": "Ÿ",
" ": " ",
"¡": "¡",
"¢": "¢",
"£": "£",
"¤": "¤",
"¥": "¥",
"¦": "¦",
"§": "§",
"¨": "¨",
"©": "©",
"ª": "ª",
"«": "«",
"¬": "¬",
"­": "­",
"®": "®",
"¯": "¯",
"°": "°",
"±": "±",
"²": "²",
"³": "³",
"´": "´",
"µ": "µ",
"¶": "¶",
"·": "·",
"¸": "¸",
"¹": "¹",
"º": "º",
"»": "»",
"¼": "¼",
"½": "½",
"¾": "¾",
"¿": "¿",
"À": "À",
"Á": "Á",
"Â": "Â",
"Ã": "Ã",
"Ä": "Ä",
"Å": "Å",
"Æ": "Æ",
"Ç": "Ç",
"È": "È",
"É": "É",
"Ê": "Ê",
"Ë": "Ë",
"Ì": "Ì",
"Í": "Í",
"Î": "Î",
"Ï": "Ï",
"Ð": "Ð",
"Ñ": "Ñ",
"Ò": "Ò",
"Ó": "Ó",
"Ô": "Ô",
"Õ": "Õ",
"Ö": "Ö",
"×": "×",
"Ø": "Ø",
"Ù": "Ù",
"Ú": "Ú",
"Û": "Û",
"Ü": "Ü",
"Ý": "Ý",
"Þ": "Þ",
"ß": "ß",
"à": "à",
"á": "á",
"â": "â",
"ã": "ã",
"ä": "ä",
"å": "å",
"æ": "æ",
"ç": "ç",
"è": "è",
"é": "é",
"ê": "ê",
"ë": "ë",
"ì": "ì",
"í": "í",
"î": "î",
"ï": "ï",
"ð": "ð",
"ñ": "ñ",
"ò": "ò",
"ó": "ó",
"ô": "ô",
"õ": "õ",
"ö": "ö",
"÷": "÷",
"ø": "ø",
"ù": "ù",
"ú": "ú",
"û": "û",
"ü": "ü",
"ý": "ý",
"þ": "þ",
"ÿ": "ÿ",
"\\x21": "!",
"\\x22": """,
"\\x23": "#",
"\\x24": "$",
"\\x25": "%",
"\\x26": "&",
"\\x27": "'",
"\\x28": "(",
"\\x29": ")",
"\\x2A": "*",
"\\x2B": "+",
"\\x2C": ",",
"\\x2D": "-",
"\\x2E": ".",
"\\x2F": "/",
"\\x30": "0",
"\\x31": "1",
"\\x32": "2",
"\\x33": "3",
"\\x34": "4",
"\\x35": "5",
"\\x36": "6",
"\\x37": "7",
"\\x38": "8",
"\\x39": "9",
"\\x3A": ":",
"\\x3B": "&#59;",
"\\x3C": "<",
"\\x3D": "=",
"\\x3E": ">",
"\\x3F": "?",
"\\x40": "@",
"\\x41": "A",
"\\x42": "B",
"\\x43": "C",
"\\x44": "D",
"\\x45": "E",
"\\x46": "F",
"\\x47": "G",
"\\x48": "H",
"\\x49": "I",
"\\x4A": "J",
"\\x4B": "K",
"\\x4C": "L",
"\\x4D": "M",
"\\x4E": "N",
"\\x4F": "O",
"\\x50": "P",
"\\x51": "Q",
"\\x52": "R",
"\\x53": "S",
"\\x54": "T",
"\\x55": "U",
"\\x56": "V",
"\\x57": "W",
"\\x58": "X",
"\\x59": "Y",
"\\x5A": "Z",
"\\x5B": "[",
"\\x5C": "\",
"\\x5D": "]",
"\\x5E": "^",
"\\x5F": "_",
"\\x60": "`",
"\\x61": "a",
"\\x62": "b",
"\\x63": "c",
"\\x64": "d",
"\\x65": "e",
"\\x66": "f",
"\\x67": "g",
"\\x68": "h",
"\\x69": "i",
"\\x6A": "j",
"\\x6B": "k",
"\\x6C": "l",
"\\x6D": "m",
"\\x6E": "n",
"\\x6F": "o",
"\\x70": "p",
"\\x71": "q",
"\\x72": "r",
"\\x73": "s",
"\\x74": "t",
"\\x75": "u",
"\\x76": "v",
"\\x77": "w",
"\\x78": "x",
"\\x79": "y",
"\\x7A": "z",
"\\x7B": "{",
"\\x7C": "|",
"\\x7D": "}",
"\\x7E": "~",
"\\x80": "€",
"\\x82": "‚",
"\\x83": "ƒ",
"\\x84": "„",
"\\x85": "…",
"\\x86": "†",
"\\x87": "‡",
"\\x88": "ˆ",
"\\x89": "‰",
"\\x8A": "Š",
"\\x8B": "‹",
"\\x8C": "Œ",
"\\x8E": "Ž",
"\\x91": "‘",
"\\x92": "’",
"\\x93": "“",
"\\x94": "”",
"\\x95": "•",
"\\x96": "–",
"\\x97": "—",
"\\x98": "˜",
"\\x99": "™",
"\\x9A": "š",
"\\x9B": "›",
"\\x9C": "œ",
"\\x9E": "ž",
"\\x9F": "Ÿ",
"\\xA1": "¡",
"\\xA2": "¢",
"\\xA3": "£",
"\\xA4": "¤",
"\\xA5": "¥",
"\\xA6": "¦",
"\\xA7": "§",
"\\xA8": "¨",
"\\xA9": "©",
"\\xAA": "ª",
"\\xAB": "«",
"\\xAC": "¬",
"\\xAD": "­",
"\\xAE": "®",
"\\xAF": "¯",
"\\xB0": "°",
"\\xB1": "±",
"\\xB2": "²",
"\\xB3": "³",
"\\xB4": "´",
"\\xB5": "µ",
"\\xB6": "¶",
"\\xB7": "·",
"\\xB8": "¸",
"\\xB9": "¹",
"\\xBA": "º",
"\\xBB": "»",
"\\xBC": "¼",
"\\xBD": "½",
"\\xBE": "¾",
"\\xBF": "¿",
"\\xC0": "À",
"\\xC1": "Á",
"\\xC2": "Â",
"\\xC3": "Ã",
"\\xC4": "Ä",
"\\xC5": "Å",
"\\xC6": "Æ",
"\\xC7": "Ç",
"\\xC8": "È",
"\\xC9": "É",
"\\xCA": "Ê",
"\\xCB": "Ë",
"\\xCC": "Ì",
"\\xCD": "Í",
"\\xCE": "Î",
"\\xCF": "Ï",
"\\xD0": "Ð",
"\\xD1": "Ñ",
"\\xD2": "Ò",
"\\xD3": "Ó",
"\\xD4": "Ô",
"\\xD5": "Õ",
"\\xD6": "Ö",
"\\xD7": "×",
"\\xD8": "Ø",
"\\xD9": "Ù",
"\\xDA": "Ú",
"\\xDB": "Û",
"\\xDC": "Ü",
"\\xDD": "Ý",
"\\xDE": "Þ",
"\\xDF": "ß",
"\\xE0": "à",
"\\xE1": "á",
"\\xE2": "â",
"\\xE3": "ã",
"\\xE4": "ä",
"\\xE5": "å",
"\\xE6": "æ",
"\\xE7": "ç",
"\\xE8": "è",
"\\xE9": "é",
"\\xEA": "ê",
"\\xEB": "ë",
"\\xEC": "ì",
"\\xED": "í",
"\\xEE": "î",
"\\xEF": "ï",
"\\xF0": "ð",
"\\xF1": "ñ",
"\\xF2": "ò",
"\\xF3": "ó",
"\\xF4": "ô",
"\\xF5": "õ",
"\\xF6": "ö",
"\\xF7": "÷",
"\\xF8": "ø",
"\\xF9": "ù",
"\\xFA": "ú",
"\\xFB": "û",
"\\xFC": "ü",
"\\xFD": "ý",
"\\xFE": "þ",
"\\xFF": "ÿ",
"€": "€",
"‚": "‚",
"ƒ": "ƒ",
"„": "𔄮",
"…": "…",
"†": "†",
"‡": "‡",
"ˆ": "ˆ",
"‰": "‰",
"Š": "Š",
"‹": "‹",
"Œ": "Œ",
"Ž": "Ž",
"‘": "‘",
"’": "’",
"“": "“",
"”": "”",
"•": "•",
"–": "–",
"—": "—",
"˜": "˜",
"™": "™",
"š": "š",
"›": "›",
"œ": "œ",
"ž": "ž",
"Ÿ": "Ÿ",
}}


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV