World's Largest Database: The Web of Data
While 'Big Data' has been a hot button for venture capitalists and the trade press, there have been several years of progress in the 'Biggest Data' arena. The Semantic Web meme, with some prodding by the World Wide Web Consortium (W3C), has morphed into the Linked Data meme. And the notion of building the world's largest database has gained traction and continues to move forward.
The basic premise is augmenting searches by making data, even raw data, accessible to the Internet browsing experience. Moving from a web of linked documents to a web of linked data provides a key building block for Web 3.0. A useful resource for data modelers and database geeks, "Open Conceptual Data Models", describes the transition as moving from
a Web linked at the document-to-document level, to one linked at the entity-to-entity level.
There were growing Linked Data, Open Data, and Web 3.0 movements by 2008 when Bob DuCharme and I chaired the LinkedData Planet conference. In picking a speaker lineup, we included a keynote by W3C Director Sir Tim Berners-Lee. It was Berners-Lee who had started the ball rolling on tying together the concepts of linked data and the next-generation web. In 2006 he'd written a paper about linked data design issues and he followed that with a campaign of public advocacy. He had written:
The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.
Two W3C specifications provide vital information to developers who want to publish and exploit Linked Data. The Resource Description Format (RDF) provides a data model for web information, representing it with a directed, labeled graph data format. The RDF format is triples: subject, predicate, object.
SPARQL provides a query language and protocol for operations with RDF data sets. SPARQL can produce query results as RDF graphs or result sets. Because working with RDF and SPARQL is like working directly with SQL, sites that publish link data might choose to offer APIs. The proliferation of SQL was due in part to a proliferation of tools that wrapped SQL and provided a higher level of abstraction. Simple, easily understood APIs will likely do the same for RDF and SPARQL.
Recipe for Linked Data
In his keynote address in New York and in his writings, Tim Berners-Lee (W3C), Tom Heath (Talis), and others have explained Linked Data basics.
First, think in terms of creating typed links. Use RDF for document encoding, which has an advantage over untyped HTML links. Use Universal Resource Identifiers (URIs) as the name or identifier for specific entities, making them searchable by using the HTTP protocol. It's possible to locate entities by de-referencing a URI over HTTP, which provides a means to retrieve serialized byte streams or descriptions of entities. This approach makes for machine-readable links that have explicit meaning and provide information when accessed via RDF or SPARQL queries. Last, but not least, include links to URIs of related items.
Kingsley Idehen points out that this data access mechanism is likely to become as ubiquitous as using Open Database Connectivity (ODBC) to access SQL databases. He said:
A Web of Linked Data is simply about the application of Web architecture to the time-tested concept of “Data Access by Reference.” It is about publication, derivation, and dissemination of structured data that exposes discrete entities that exploit the prowess of the HTTP protocol for Naming and Name Resolution. Linked Data is not different (conceptually) to Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC) when exploring the subject of “Data Access by Reference.
Free, Linked Open Data
Linked data is seen as an enabling technology for solutions to technology and architecture issues related to open data and open government, in part because of the effect of reducing data silos. The W3C sponsored a Linking Open Data (LOD) project to identify existing open data sets that were candidates for conversion to the RDF, linked data format. By May 2009, the Web of Data had grown to 142 million RDF links but the best was yet to come.
Sir Tim's campaigning produced endorsement of the Open Data and Linked Data concepts at the highest levels in the US and UK governments. Both the US and UK governments established government portals that published a large number of linked data sets. Nigel Shadbolt, a professor of Computer Science and member of the Web Science Research Group at the University of Southampton, is a key figure in the UK effort. He's worked with Sir Tim Berners-Lee to develop the technology behind the UK's data-sharing portal (data.gov.uk).
Data.gov.uk publishes a variety of data sets that are freely available with the UK Open Government Licence (OGL). The data provides point granularity, geographical and temporal, for global locations and those in the UK. Because the data volume is growing, developers are likely to start by downloading the metadata for the data sets. The metadata is available in the CSV and JSON formats.
Besides providing a portal to data, the data.go.uk menu also provides menu options for finding Apps and Linked Data. The site publishes linked data from public agencies, government departments, Members of Parliament, and other government sectors, with one or more SPARQL endpoints for each sector.
The US portal (data.gov) also publishes linked data under a menu option labeled Semantic Web. Professor James Hendler and his developers at the Tetherless World Constellation at Rensselear Polytechnic Institute have developed a variety of apps that demonstrate the use of linked data sets. If you want to know which public companies have filed for bankruptcy or who's among the most frequent visitors to the White House, you can find the information at data.gov. By May 2010, the US government had published about 400 data sets at data.gov, consisting of 6.4 billion RDF triples.
In addition to the US and UK portals, national libraries in countries such as Germany and Hungary have started publishing linked data sets. By September 2010, the Linked Open Data cloud had grown to 203 linked datasets, 25 billion RDF triples, and 395 million RDF links.
If there is near-universal adoption of Linked Data, we will create a global database — the Web of Data. It will be the world's largest collection of data achieved through a systematic effort to provide granular access to data items. The data will be distributed across time zones, much of it in the cloud, including the computing clouds of governments such as the US and UK.