Richard Keller is a senior research computer scientist and group lead for the information sharing and integration group at NASA. He recently spoke with Dobb's editor in chief Jonathan Erickson.
Dr. Dobb's: What's the fundamental problem with data integration?
Keller: Because many large organizations generate data in a distributed and organic fashion, they can amass hundreds or thousands of separate, seemingly disconnected data sources in varying formats. Those data sources will remain disconnected unless their contents are properly documented and annotated.
Dr. Dobb's: Will standards help?
Keller: Standards and organizational policies go some way toward supporting data interoperability, but data standards can be difficult to legislate and are onerous and expensive to institute. Also, many companies focus on setting data formatting standards but neglect standards requiring any type of semantic metadata, which are even more important when it comes to robust integration.
Dr. Dobb's: What's the key to semantic integration?
Keller: The key to automated integration is to be rigorous about capturing semantic metadata. If you describe the meaning of the data, then you can automate the process of recognizing connections across data sources and allow them to be married together properly.
Dr. Dobb's: Does the W3C's SPARQL solve any problems?
Keller: SPARQL provides a standardized language for querying ontologies. To the extent that this standard achieves widespread adoption, it will consolidate the marketplace and increase tool interoperability within the semantic web space. But the challenges in semantic integration lie in the construction, not in the querying, of ontologies.
Dr. Dobb's: What's the next big hurdle to achieving semantic integration?
Keller: At the core of semantic integration is the problem of ontology mapping. An ontology map provides information on how to translate the objects, attributes, and relations from one ontology model into those of another. This is the essence of semantic integration; we need to understand how, and under what circumstances, can the data in one data source be combined with data in another. Ontology maps provide the basis for the data translation. When the ontologies for two sources are similar in their conceptual structure, mapping is easy. But when the underlying data models are conceptually disparate, things get complicated. Ontology mapping has been the focus in the semantic web community for a while now, but remains a challenging problem: what form should the maps take, how to automatically generate maps, how to use them efficiently, etc.
More broadly, I think the challenge for making semantic integration work in the marketplace is to make it quicker and easier to specify data semantics. Currently, specifying semantics using ontologies is a somewhat arcane and tedious process. If I create a dataset, then I have to see clearly that there will be significant benefits down the road if I expend the time and effort necessary to provide semantic metadata. Although we are starting to see some good tools on the market to make this process easier, the cost/benefit calculations are not yet sufficiently favorable to support widespread adoption of this approach.
Dr. Dobb's: Can you tell us about the SemanticIntegrator project?
Keller: SemanticIntegrator is a project focused on developing an architecture to support semantic integration of NASA data assets. NASA has accumulated many thousands of datasets generated as part of our work in science, engineering, aeronautics, and space exploration. If NASA could provide a means of connecting and combining disparate, but related datasets, then we would have a powerful capability that might enable scientists and engineers to gain additional insight based on the gestalt of the integrated data. We have used our architecture to demonstrate integration applications in both earth sciences and exploration. And we have demonstrated the utility of this approach for integrating NASA data with data from other federal agencies, such as USDA, EPA, NOAA, etc.
Our approach integrates information sources based on the use of ontologies plus explicit integration rules. For each data source, we develop an ontology that captures the semantics of the underlying data. In addition, we write a software wrapper that exposes the underlying data source as if it were a semantic web resource. Turning the data sources into semantic web resources enables them to be queried using a common, semantic query language such as W3C's SPARQL. The wrapper takes semantic queries as input and dispatches native data source queries (e.g., in traditional SQL) against the actual data sources. In addition to the data source ontologies, an integrating ontology is developed to capture the customer's view of relevant data and relationships across the various sources. To access integrated data, a client application queries the integrated ontology. Using a set of ontology translation rules, this query is mapped into a set of separate queries against the data source ontologies. The results are then translated back into the integrating ontology language and presented to the client application.