Channels ▼
RSS

Database

After XML, JSON: Then What?


The need for a format to serialize data is as old as networking itself. In the early days of data processing, the problem was attacked by use of binary protocols — that is, protocols with data that was not human readable. These were frequently custom-defined on an ad hoc basis. The sender and receiver had to agree on where fields were located and what they contained in order to exchange data. These schemes eventually gave way, in part, to emerging standards such as ASN.1.

More Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

One of the most successful of these early protocols came from the UNIX world in the mid-1980s, as servers needed some way to exchange data. The resulting XDR format, proposed by Sun, solved one key problem; namely, how to exchange binary data when systems used different endian schemes — a constant problem in the UNIX heyday. XDR was quietly successful and is still found today in NFS and other protocols, as well as in modern products such as Mozilla's SpiderMonkey JavaScript engine where it's used for serializing compiled JavaScript.

By the mid-1990s, under pressure from the rapidly growing Internet, new standards were needed. XDR, for example, was not human-readable and it was generally felt that a human-readable representation that was in keeping with SGML — the markup superset from which HTML is derived — would be a good thing. This turned out to be XML. And by the end of the century, it was already in wide use. All major languages had XML libraries and the format was used whenever and wherever any kind of human representation of data was required. It proved so popular that it moved into areas it was never intended to be, such as text mark-up (in DocBook, for example). The addition of secondary XML technologies, such as XSLT, enabled this.

However, for all its popularity, XML has several significant drawbacks. The first one is the complexity of schemata, which require specialized skills to implement correctly. The second, and by far the biggest factor, is performance. XML is wordy and slow to process. A senior architect at a financial services firm told me recently that in order to optimize the performance of their key business logic servers, they'd done a deep analysis of what exactly was happening with each transaction. They discovered, to their dismay, that almost 50% of their server CPU cycles were consumed encoding and decoding XML. Other organizations have surely recognized, at various points, the significant processing overhead that XML imposes.

Predictably, a smaller alternative emerged over the last few years as the JavaScript revolution has reshaped software development: JSON. Standard JSON can be read as JavaScript and it has the additional benefit of being widely supported with various tools and libraries. However, as its use has been extended to new areas, such as databases, it's become clear that it lacks some desirable traits. Two of them are that it has no support for a date data type and it doesn't support comments. These shortcomings have already led to variants, such as BSON, the binary JSON format devised by 10gen and used in their MongoDB NoSQL database, instead of JSON.

Frustration with JSON has spurred examination and proposal of entirely new schemes. Perhaps one of the most interesting is TOML from Tom Preston-Werner, a cofounder of GitHub. It has the brevity of JSON, although it uses a different notational scheme, that's akin to configuration files with key-value pairs specified one per line and grouped by bracketed item names. It's a take-off on the format of .ini files first popularized by Microsoft, but with many conveniences added in. While there are already libraries in several languages supporting it, it's not clear if TOML will gain sufficient traction. TOML is by no means the only alternative under development. For example,  Protocol Buffers is a low-overhead, high-speed data exchange format of particular appeal to C and C++ programmers, that was developed at Google and is widely used there.

In my estimation, the one standard that seems to have almost all the desirable features is YAML. While a big standard (some 80 pages), it is remarkably concise in practice and highly readable. It borrows Python's use of whitespace to indicate the start and end of blocks and subblocks. YAML mostly avoids quotation marks, brackets, braces, and open/close-tags, which enhances its readability. It also contains references, which are ways to refer to a previously defined element. So, if an element is repeated later in a YAML document, you can simply refer to the element using a short-hand name. Finally, YAML supports all the standard data types and can map easily to lists, hashes, or simply individual data items.

YAML is widely supported by libraries in all the principal languages. Its biggest drawback seems to be political rather than technical; namely, that it has not gained the kind of mindshare that would give it the wide acceptance any such protocol needs. Still, if your goal is to elegantly solve the problem of data serialization, especially for internal use, YAML might be exactly the solution you're looking for.

— Andrew Binstock
Editor in Chief
alb@drdobbs.com
Twitter: platypusguy


Related Reading






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video