Extracting meaning from the mounds of data available on the Web is no easy task. For example, take the case of Data.gov, where the Obama Administration has posted 272,000 or more sets of raw data from its departments, agencies, and offices to the World Wide Web.
“Data.gov mandates that all information is accessible from the same place, but the data is still in a hodgepodge of different formats using differing terms, and therefore challenging at best to analyze and take advantage of,” explained James Hendler, the Tetherless World Research Constellation professor of computer and cognitive science at Rensselaer Polytechnic Institute. “We are developing techniques to help people mine, mix, and mash-up this treasure trove of data, letting them find meaningful information and interconnections.
“An unfathomable amount of data resides on the Web,” Hendler said. “We want to help people get as much mileage as possible out of that data and put it to work for all mankind.”
The Rensselaer team has figured out how to find relationships among the literally billions of bits of government data, pulling pieces from different places on the Web, using technology that helps the computer and software understand the data, then combine it in new and imaginative ways as “mash-ups,” which mix or mash data from two or more sources and present them in easy-to-use, visual forms.
By combining data from different sources, data mash-ups identify new, sometimes unexpected relationships. The approach makes it possible to put all that information buried on the Web to use and to answer myriad questions
“We think the ability to create these kinds of mash-ups will be invaluable for students, policy makers, journalists, and many others,” said Deborah McGuinness, another constellation professor in Rensselaer’s Tetherless World Research Constellation. “We’re working on designing simple yet robust Web technologies that allow someone with absolutely no expertise in Web Science or semantic programming to pull together data sets from Data.gov and elsewhere and weave them together in a meaningful way.”
While the Rensselaer approach makes government data more accessible and useful to the public, it also means government agencies can share information more readily.
“The inability of government agencies to exchange their data has been responsible for a lot of problems,” said Hendler. “For example, the failure to detect and scuttle preparations for 9/11 and the ‘underwear bomber’ were both attributed in a large part to information-sharing failures.”The website developed by Hendler, McGuinness, and Peter Fox — the third professor in the Tetherless World Research Constellation — and students, provides stunning examples of what this approach can accomplish. It also has video presentations and step-by-step do-it-yourself tutorials for those who want to mine the treasure trove of government data for themselves.
Hendler started Rensselaer’s Data-Gov project in June 2009, one month after the government launched Data.Gov, when he saw the new program as an opportunity to demonstrate the value of Semantic Web languages and tools. Hendler and McGuinness are both leaders in Semantic Web technologies, sometimes called Web 3.0, and were two of the first researchers working in that field.
Using Semantic Web representations, multiple data sets can be linked even when the underlying structure, or format, is different. Once data is converted from its format to use these representations, it becomes accessible to any number of standard web technologies.
One of the Rensselaer demonstrations deals with data from CASTNET, the Environmental Protection Agency’s Clean Air Status and Trends Network. CASTNET measures ground-level ozone and other pollutants at stations all over the country, but CASTNET doesn’t give the location of the monitoring sites, only the readings from the sites. The Rensselaer team located a different data set that described the location of every site. By linking the two along with historic data from the sites, using RDF, a semantic Web language, the team generated a map that combines data from all the sets and makes them easily visible.
This data presentation, or mash-up, that pairs raw data on ozone and visibility readings from the EPA site with separate geographic data on where the readings were taken had never been done before. This demo and several others developed by the Rensselaer team are now available from the official US Data.gov site.
The aim is not to create an endless procession of mash-ups, but to provide the tools and techniques that allow users to make their own mash-ups from different sources of data, the Rensselaer researchers say. To help make this happen, Rensselaer researchers have taught a short course showing government data providers how to learn to do it themselves, allowing them to do their own data visualizations to release to the public.
The same techniques can be applied to data from other sources. For example, public safety data can show a user which local areas are safe, where crimes are most likely to occur, accident prone intersections, proximity to hospitals, and other information that may help a decision on where to shop, where to live, even areas to avoid at night. In an effort McGuinness is leading at Rensselaer along with collaborators at NIH, the team is exploring how to make medical information accessible to both the general public and policy makers to help explore policies and their potential impact on health. For example, one may want to explore taxation or smoking policies and smoking prevalence and related health costs.
The Semantic Web describes techniques that allow computers to understand the meaning, or “semantics,” of information so that it can find and combine information, and present it in usable form.
“Computers don’t understand; they just store and retrieve,” explained Hendler. “Our approach makes it possible to do a targeted search and make sense of the data, not just using keywords. This next version of the Web is smarter. We want to be sure electronic information is increasingly useful and available.”
“Also, we want to make the information transparent and accountable,” added McGuinness. “Users should have access to the meta data — the data describing where the data came from and how and when it was derived — as well as the base information so that end users can make better informed decisions about when to rely on the information.”