With us today is Dave Kellogg, CEO of Mark Logic, a company that focuses on XML as a content platform.
DDJ: Dave, you're a database guy from way back when. What's the biggest difference between some of the XML-based databases we're seeing today and some of the DBMS of a few years ago?
DK: Indeed, I've been working with databases since 1983 -- back when relational databases were revolutionary. We've come a long way since then.
In today's XML database systems, such as our MarkLogic, you essentially find a mix of native XML handling, full-text search engines, and state-of-the-art DBMS features like time-based queries, large-scale alerting, and large-scale clustering.
In essence, because RDBMSs date back to the 1960s and pure XML databases only date back to around 2000, the XML database vendors get the coveted chance to "start over" in designing a database system. So we can quickly incorporate a lot of the features put in RDBMSs over the past few decades while at the same time optimizing for XML.
In the mainstream database market, in my estimation, things haven't changed that much. The three main vendors continue to advance technology at a fairly slow rate, and tend to view every new challenge as "yet another feature" that needs to be added to the DBMS. As I like to say: When your only model's a data table, every problem looks like another column.
In the database world I think all the exciting stuff is happening at the edges: Stream databases like Streambase or Skyler, analytic databases like Aster Data Systems, XML databases like MarkLogic, data warehouse appliances like Netezza or DATAllegro, and parallelizing database alternatives like MapReduce and Hypertable.
DDJ: Dr. Dobb's has covered XQuery, but it's been a few months. Anything new on the "searching XML" front?
DK: Well, XQuery is taking some time to take off, but I firmly believe that in the fullness of time it will replace SQL. Why? Because XQuery was the DBMS community's chance to start over and they took it. XQuery is superior to SQL for a number of reasons. It's a full programming language, not just a data manipulation language. It handles XML natively, and XML is indeed becoming more and more pervasive. With adoption of Microsoft Office 2007 and ODF-based documents, every organization will find themselves with an explosion of XML and a desire to do more with it. Furthermore, while content has been seen as static, Web 2.0 applications and user generated content create an environment where documents inherently must evolve over time and XML is widely seen as the right format.
What's more, XQuery can work equally well against both XML-wrapped data (e.g., purchase orders) and XML content/documents (e.g., articles or books). So it's more universal than SQL.
I often say two things in this regard. First, our kids will find any data/content distinction totally arbitrary and a historical artifact of the mainframe/minicomputer era. Second, our kids will think of SQL the way that we think of COBOL. ("Daddy, do you mean you used a database language that assumed all data was stored in tables and didn't natively understand XML?" "Yes, Muffin, and I used to have to sew my own clothes, too!")
As far as what's new in searching XML: While products like MarkLogic have incorporated full-text search with XQuery from inception, it was fairly recent that the formal standard for full-text search extensions was made a candidate recommendation. In fact, it wasn't that long ago that XQuery 1.0 was (finally) approved in January 2007 by the W3C.
DDJ: Another topic that Dr. Dobb's has recently examined was DITA, the "Darwin Information Typing Infrastructure". What are your thought on this?
DK: Yes, DITA is indeed gaining traction, typically inside the technical publications community, but not only there. DITA as a technology is about an XML standard and a set of tools for doing XML-based publishing. But theres more to DITA than that. To properly move to DITA requires organizations to move to topic-based authoring -- basically jettisoning the whole notion of specific deliverables when making content and focusing on the content itself.
Instead of focusing on the operations manual or the training materials or the maintenance procedures or the helpfiles, with DITA you focus on the content as a series of topics and then build the eventual deliverables by recombining them. It's all about re-use and re-purposing, thus DITA saves money. And those savings get magnified as soon as the problem goes multi-dimensional: (N books) x (M audiences) x (Z languages) x (P delivery channels).
DDJ: A favorite topic around here is mobility. What role does mobility play when it comes to XML and the databases we're seeing today?
DK: Yes, I love mobility, too. From a search/content perspective, I think of mobility as simply "more context." Most of what we do at Mark Logic could largely be described as helping put content in context. What does that mean, other than the soundbite? Let me give an example. Putting content in context might mean:
- Getting a nurse the right information for a patient with a given set of symptoms of a given age/ethnicity with a given medical history.
- Providing a pilot the right procedure for a birdstrike when landing at a high-elevation airport at night.
- Helping a soldier find the right procedure to repair a broken armament when the standard procedure has failed and danger is imminent.
That's context. Again in clichi form: getting the right information to the right person at the right time. Put differently, knowing who you are and what you are trying to do (i.e., role and task awareness) so we can get you the right slice of the content to meet your needs.
With location, we're just adding more context -- now I can know who you are, what you're trying to do, and where you're trying to do it as context for a query. That's why we're in the midst of adding geospatial indexing to our product, so that every query can contain not only XPath and full-text constraints but location-based ones as well.
DDJ: Can you suggest a website that readers might go to for more information on these topics?