The need for a format to serialize data is as old as networking itself. In the early days of data processing, the problem was attacked by use of binary protocols that is, protocols with data that was not human readable. These were frequently custom-defined on an ad hoc basis. The sender and receiver had to agree on where fields were located and what they contained in order to exchange data. These schemes eventually gave way, in part, to emerging standards such as ASN.1.
White PapersMore >>
- Mobile Content Management: What You Really Need to Know
- How to Prep and Modernize IT For Cloud Computing
By the mid-1990s, under pressure from the rapidly growing Internet, new standards were needed. XDR, for example, was not human-readable and it was generally felt that a human-readable representation that was in keeping with SGML the markup superset from which HTML is derived would be a good thing. This turned out to be XML. And by the end of the century, it was already in wide use. All major languages had XML libraries and the format was used whenever and wherever any kind of human representation of data was required. It proved so popular that it moved into areas it was never intended to be, such as text mark-up (in DocBook, for example). The addition of secondary XML technologies, such as XSLT, enabled this.
However, for all its popularity, XML has several significant drawbacks. The first one is the complexity of schemata, which require specialized skills to implement correctly. The second, and by far the biggest factor, is performance. XML is wordy and slow to process. A senior architect at a financial services firm told me recently that in order to optimize the performance of their key business logic servers, they'd done a deep analysis of what exactly was happening with each transaction. They discovered, to their dismay, that almost 50% of their server CPU cycles were consumed encoding and decoding XML. Other organizations have surely recognized, at various points, the significant processing overhead that XML imposes.
date data type and it doesn't support comments. These shortcomings have already led to variants, such as BSON, the binary JSON format devised by 10gen and used in their MongoDB NoSQL database, instead of JSON.
Frustration with JSON has spurred examination and proposal of entirely new schemes. Perhaps one of the most interesting is TOML from Tom Preston-Werner, a cofounder of GitHub. It has the brevity of JSON, although it uses a different notational scheme, that's akin to configuration files with key-value pairs specified one per line and grouped by bracketed item names. It's a take-off on the format of .ini files first popularized by Microsoft, but with many conveniences added in. While there are already libraries in several languages supporting it, it's not clear if TOML will gain sufficient traction. TOML is by no means the only alternative under development. For example, Protocol Buffers is a low-overhead, high-speed data exchange format of particular appeal to C and C++ programmers, that was developed at Google and is widely used there.
In my estimation, the one standard that seems to have almost all the desirable features is YAML. While a big standard (some 80 pages), it is remarkably concise in practice and highly readable. It borrows Python's use of whitespace to indicate the start and end of blocks and subblocks. YAML mostly avoids quotation marks, brackets, braces, and open/close-tags, which enhances its readability. It also contains references, which are ways to refer to a previously defined element. So, if an element is repeated later in a YAML document, you can simply refer to the element using a short-hand name. Finally, YAML supports all the standard data types and can map easily to lists, hashes, or simply individual data items.
YAML is widely supported by libraries in all the principal languages. Its biggest drawback seems to be political rather than technical; namely, that it has not gained the kind of mindshare that would give it the wide acceptance any such protocol needs. Still, if your goal is to elegantly solve the problem of data serialization, especially for internal use, YAML might be exactly the solution you're looking for.