Channels ▼

Web Development

Visual Content Tool To "Understand" Web In Context

Silicon Valley based Diffbot Corporation is aiming to garner interest from web-centric developers with its visual content and layout recognition (VCLR) technology. Diffbot essentially provides an API that conveys a "visual understanding" of web content in a digitized sense.

The company says it has "categorized the web" into approximately 20 different page types, which can then be visually analyzed using layout and contextual cues. Its developers say that this technology has been built to "perceive context" in the same way that humans do; i.e., understanding common page layouts (like headlines, bylines, and articles), contextual keywords, and content changes buried deep within pages.

The resulting developer proposition is an invitation to build applications that will make a call to the Diffbot API and so "follow websites" to then reflect changes in the app in question. Essentially this is a tool to aggregate content and/or manipulate it for personalization in a developer's own application that may present it in a new structure or with some call to action.

Diffbot cofounder Mike Tung said that he came up with the idea while at college when he realized that he wanted to be instantly notified when new assignments were posted on class websites.

"Diffbot is an incredibly sophisticated tool for developers to rapidly build compelling applications around web content," said Sky Dayton, founder of EarthLink and Boingo, and investor in Diffbot. "The more developers use Diffbot, the more it learns about and adds structure to data on the web. This technology is becoming the basis for a new kind of web experience enhanced by machine interpretation of content."

Diffbot supplies natural language processing to cross-reference against Wikipedia in order to determine relevance by context and deliver keyword tags. For example, Diffbot can determine that an article about Barak Obama is related to "politics" even though the word doesn't appear in the article, or that an article about a new computer is about Apple the technology company and not apple the fruit.

Diffbot's technology consists of two types of APIs: one for headline and article content including pictures, and one to follow changes or updates made to any web page. Diffbot allows developers to build applications that can extract and analyze information displayed on an article page; understand key words and phrases in the context of the larger article; and generate tags to allow developers to categorize, sort, and personalize content. It can also generate an RSS feed enabling an application to follow anything on the Internet.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.