SIMILE: Rich Internet Collections

David Karger is a professor at MIT and a Principal Investigator on the Simile Project, an effort that seeks to enhance interoperability among digital assets, schemata/vocabularies/ontologies, metadata, and services. We talked to Professor Karger recently about the project.

DDJ: What is SIMILE?

DK: SIMILE is a collaboration between the MIT Libraries, the World Wide Web Consortium, and my Haystack research group. We're working on tools to help individuals, communities, and institutions create and utilize rich information collections. We make substantial use of "Semantic Web" standards to ease data integration, although we are careful to keep them encapsulated inside the system where they won't disturb our users.

DDJ: Why is it important?

DK: Collections arise everywhere, as one of the standard approaches to interacting with large amounts of information. They are created by individual enthusiasts -- Topher's breakfast cereal collection and my own collection of Israeli Folk Dance videos, communities (photos at Flickr and web pages at Delicious), companies (books at Amazon, movies at IMDB), and institutions (the MIT Libraries catalog). Collections attempt to bring order to some portion of the information universe. What makes them useful is that they tend to have meaningful structure, and their applications can exploit that structure to help you use the collections. Movies have release years, actors, writers, and producers, and you can navigate IMDB effectively using these properties. Prices, authors, and subjects help you navigate books at Amazon. Sensor sizes, ISO options, and battery life help you explore cameras. Tags organize Flickr and Delicious.

But at every scale, it is harder to create, maintain, and utilize these collections than it should be. Things that are easy to describe ("Users should be able to filter based on who is in the photograph, group by location taken, and chronologically sort each group") require lots of special-purpose system building, way beyond the capabilities of most individuals, and taxing the energy of even large organizations. One of the hardest problems is how much variety there is in the data. Traditional libraries solved the problem by mapping all the catalog data into one standard form. That won't work in the face of this new tidal wave of information.

Things get even trickier when you want to break collection boundaries -- when you want to search over multiple collections simultaneously, or pull information from multiple collections together for some new purpose. How can you make an interface that makes sense for simultaneously searching MIT Libraries' collection of books, the Getty museum's collection of art, and Topher's breakfast cereal collection for information about fine art in food marketing? When your search returns "hits" on all three sites, how can you see the results in a way that makes sense? How can you see the connections between the results from separate sources -- e.g., that a book from MIT talks about a painting at the Getty? It's unlikely the site owners have coordinated their collections to fit together, so how can they be mashed up after the fact? Historically, the only answer has been to do lots of special-purpose systems engineering. But given the exploding variety of collections and usages of them, we need a different answer now.

DDJ: Can you give a short example of an application of the project?

DK: One of the tools we're currently working on is called "Exhibit." This is a tool that lets anyone take a collection of anything they care about and put it on the web as a rich, interactive, web-2.0 style site without doing any programming. All you do is put up a file containing your collection and a web page describing how you want it to look. The result may be pretty much what you'd expect of a web 2.0 site these day -- until you realize that it avoids the whole team of database engineers and 3-tier web application developers, and lets you do it all yourself! Examples from my personal life (I'm a theoretical computer scientist of limited programming skills) include:

    My aforementioned Folk Dance video collection,
    A nice presentation about U.S. presidents that shows off several visualizations,
    And naturally, we like to collect all the press articles that mention us.

It's really easy to make your own exhibit, and I hope you give it a try -- there are tutorials on our site.

DDJ: How do you invision the project impacting the greater web community?

DK: The Web has been incredibly successful at making huge amounts of new information available to many people. But it still has a long way to go in depth and breadth. Regarding depth, there's plenty of awareness of the "deep web" -- stuff that doesn't show up on the web search engines because it is buried in special-purpose databases. We think some of our tools can help bring that information to light. As for breadth, while the Web has made it much easier for people to contribute textual information through tools like blogs and wikis, it's still not really possible for the lay person to contribute rich structured information collections. We think our tools can dramatically lower the barriers for a broader group of contributors to share the rich structured content they know.

It's very exciting for us that we are already seeing our tools used in really cool ways. For example, somebody is showing the history of Japanese literature; and a Romanian real estate broker is listing of available rentals . These days, everybody on the Web is a collector!

We think would be really exciting if these tools enable the data in these collections to start moving around more. People like to call that remixing, or web of data, or mash-ups; this is stuff the micro-format and Semantic Web/RDF communities talk about. We think it's so desirable that it's inevitable, and we hope that we can create a set of tools and ideas that make it easier and more fun for people to share.

DDJ: Where can our readers learn more about the project?

DK: Our web site describes all our projects. There's a lot there, so let me recommend some actions people can take. Play with the demos for Timeline, Exhibit, and Longwell. People who want to try the tools seem to have quick success by copying and tinkering with the demos for Timeline and Exhibit. If you have a larger collection, you'll want to look into Longwell, which can handle big collections. All these tools are part of a vibrant open-source project and licensed with a BSD style license, so feel free to use them and come join the fun.

