Channels ▼
RSS

Database

XML, SQL, and C

Source Code Accompanies This Article. Download It Now.


Inside the Programs

These programs are all built on top of the jkweb.a library. This library is notable for its speed, simplicity, and somewhat unconventional brace placement. The library adheres to some simple object-oriented conventions. In general, for every .c file there is a corresponding .h file, which collectively are a "module." The primary data structure shares the module name and publicly available functions begin with the module name. The code makes little use of global variables and generally avoids static variables to encourage multithreading and discourage side effects. Most data structures begin with a "next" field so that they can easily be made into singly linked lists.

Four library modules in particular are heavily used by the code. The hash module handles all symbol tables. The dtdParse module reads in DTDs. The xp module is a streaming XML parser. The xap module is a light wrapper around xp that stores the results of your startTag callback in a stack.

The xp module started out life as a drop-in replacement for the Expat library. Expat is a fine streaming parser, and in truth probably handles some rare wrinkles of XML and Unicode that xp does not. However, Expat depends on other modules that are hard to find on some platforms. Because of this and also because I wanted to get a little more speed when dealing with huge XML files, I wrote xp, which is 30 percent faster than Expat.

The first version of autoXml (www.linuxjournal.com/article/5949) suffered the same limitations that DOM parsers do—the whole XML had to be loaded in memory. To work around this, I hacked the recursion out of xp, and added parameters to let it seek to a particular tag type before executing any of its callbacks, and then return completely when that tag is finished parsing. Subsequent calls to xp start out where the previous call left off. This enables the iterative reading of data structures from an XML file in Listing Three.

The code in autoDtd and sqlToXml is straightforward, and I refer you to the commented open source. AutoXml is relatively straightforward, too, but it is a C code generator written in C, so there are some heavily escaped lines such as:


  fprintf(f, "fprintf(f, \" %s=\\\"%%%c\\\"\", obj->%s);\n", 
 att->name, fAttType(att->type), att->name);

The xmlToSql program is relatively complex, requiring 1100 lines of code beyond the library routines. Most of this code is concerned with building up data structures based on the DTD and stats files to represent the tables and fields, and if necessary the parentToChild tables. The parent/child relationships are found in the parentKeys member of the table structure. This element is actually a list, because the same child can be found under multiple parents in XML. Once the table and field structures are built up, xmlToSql calls the xap module to stream through the XML. The start-tag callback clears out the string values associated with each field, and then fills in some of these strings from the attributes. The end-tag callback fills in the text field, and then concatenates all of the field strings together into a line. It looks up the concatenated fields in a hash that returns the primary key for that row if it already exists. If the result doesn't exist, it creates a new primary key, stores this back in the hash, and outputs the line to a tab-separated file. The end-tag callback then looks up the parent in the xap stack and fills in the foreign key field in the parent with the primary key. By the time the parent's end-tag callback is executed, the children's end-tag callbacks have filled in all of the foreign key fields.

Conclusion

This suite of XML tools has been interesting to write. Together they take most of the tedium out of XML programming, and make XML files almost as easy to work with as the simple line-oriented text formats UNIX programmers have been happily grepping and awking through for 30 years. If you would like to extend the programs, please do so. Please drop me an e-mail at kent@soe.ucsc.edu, so that I can hook you up with the CVS depository, thus allowing other people to use your improvements, too.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video