Channels ▼

Open Source

Google Open Sources Gumbo HTML5 parser

Google's developer team confirms this week that it has open sourced the Gumbo HTML parser, a C language library implementation of the HTML5 parsing algorithm.

More Insights

White Papers

More >>


More >>


More >>

NOTE: A parser works to receive source program instructions, interactive online commands, and other defined sequential inputs (including markup tags) to break them down into component parts in order that programming engines such as those inside a compiler can process them.

Google's wider motives with this move are (one hopes) openly philanthropic.

If other browser developers follow Google's workflow methodology, we could see all HTML5 written code in the same way.

For its part, Google has already explained that one of the big accomplishments of the HTML5 standard was the standardization of the HTML parsing algorithm, which means that all browsers will see the same HTML document in the same way.

"So far, most implementations of this algorithm have either been tied to specific browsers or rendering engines, or they've been written in specific scripting languages. This makes it hard to write quick one-off tools to manipulate and clean up HTML if you don't happen to be working in a language that already has an HTML5-compatible parsing library," said Jonathan Tang, of Google's search features team.

"Gumbo seeks to provide a simple library that can serve as a basic building block for linters, refactoring tools, templating languages, page analysis, and other small programs that need to manipulate HTML. It's written in pure C for ease of interfacing with other languages, and has no outside dependencies. Gumbo was built from the start to support source locations and correlating nodes in the parse tree with positions in the original text," added Tang.

Related Reading

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.