Dr. Dobb's | XML, SQL, and C | October 30, 2006

XML, SQL, and C

Integrating data from XML sources into relational databases is easier when tools like autoXml, AutoDtd, sqlToXml, and xmlToSql do the work for you.

October 30, 2006
URL:http://www.drdobbs.com/database/xml-sql-and-c/193402895

Jim is a researcher at the University of California Santa Cruz. He can be contacted at kent@soe .ucsc.edu.

XML is an increasingly popular format for exchanging data. It handles optional fields and hierarchical data structures well, is readable by humans as well as computers, is portable across a wide variety of platforms, and it can even handle recursive data structures. In my own field of bioinformatics, XML has become almost as common as the venerable tab-separated file for exchanging information between databases. Still, integrating data from XML sources into our relational databases and our largely C code base sometimes seemed to involve more work than it should. Consequently, I was motivated to write the four tools—autoXml, AutoDtd, sqlToXml, and xmlToSql—presented in this article. The complete source code for the programs and libraries is available electronically from DDJ (see www.ddj.com/code/) and from me (www.soe.ucsc.edu/~kent).

XML files contain structured data, and the C language, of course, can represent structured data as well. However, to get from the XML to C representations is not a simple task. At a minimum, you must write several lines of code for each field of each data structure. Converting the parent/child relationships from XML to C is especially problematic. Using stream-oriented parsers requires that your callback procedures maintain some sort of context stack so that the child can be linked into the parent. With tree-oriented parsers (such as DOM), the relationships are reflected in the parsed tree. Unfortunately, the whole tree must fit in memory, which is not possible with larger databases.

The autoXml program automates mapping between C and XML data structures. Its input is an XML document type definition (DTD), and it outputs the equivalent data structures in C. The code converts these structures to and from XML. The full tree can be loaded, or just subtrees corresponding to a particular tag and its children. The C structure can contain numeric as well as string fields.

AutoDtd creates a DTD from an example XML file. This comes in handy because not all XML files have a DTD (autoXml requires one). AutoDtd figures out whether elements are required or optional; are lists or nonempty lists; and whether the fields are integer, floating point, or string, just by examining the example file.

Another common task is saving data stored in a relational database in XML format. The sqlToXml tool produces descriptive and readable XML guided by a simple tree definition that describes the parent/child relationships within the database. Currently, sqlToXml only works with MySQL, but it shouldn't be too much trouble to adapt it to other relational databases.

Finally, xmlToSql converts an XML document into a reasonably normalized relational database. The output is a directory full of tab-separated files and SQL table creation statements, so it should work with any relational database. The normalized relational representation is typically an order of magnitude smaller than the XML representation of the data.

All four programs are built using the GCC compiler linking in with the jkweb.a library. The programs, which you can incorporate into commercial or open-source tools without licenses, are regularly built on Linux and Mac OS X, and should port with a minimum of effort to other UNIX-like platforms. You'll need to have MySQL installed to build the sqlToXml program.

The programs run from the command line and print a usage summary if run with no parameters. You can use stdin/stdout as filename parameters if you wish to use them as part of a pipeline.

Using autoXml

Listing One is an XML document that describes a couple of geometrical shapes. You could use autoXml to generate a module called "poly" that handles reading/writing this type of XML. Listing Two is a routine that uses the poly module. This routine reads-in the XML, shrinks all the shapes by 50 percent, and writes the results to a new XML file.

<SHAPES>
  <POLYGON color="red" name="triangle">
    <DESCRIPTION>A 3-4-5 right triangle</DESCRIPTION>
    <PT-TWO-D x="0" y="0" />
    <PT-TWO-D x="3.0" y="0" />
    <PT-TWO-D x="0" y="4" />
  </POLYGON>
  <POLYGON color="black" name="square">
    <DESCRIPTION>Your basic black unit square</DESCRIPTION>
    <PT-TWO-D x="0" y="0" />
    <PT-TWO-D x="1" y="0" />
    <PT-TWO-D x="1" y="1" />
    <PT-TWO-D x="0" y="1" />
  </POLYGON>
</SHAPES>

Listing One

#include "poly.h"
void shrinkShape(char *in, char *out)
{
struct polyShapes *shapes = polyShapesLoad(in);
struct polyPolygon *polygon;
FILE *f;

/* Loop through all points of all polygons, scaling by 50% */
for (polygon = shapes->polyPolygon; polygon != NULL; polygon = polygon->next
    {
    struct polyPtTwoD *point;
    for (point = polygon->polyPoint; point != NULL; point = point->next)
        {
        point->x *= 0.5;
        point->y *= 0.5;
        }
    printf("Shrunk %s\n", polygon->description->text);
    }

/* Save result. */
f = fopen(out, "w");
polyShapesSave(shapes, 0, f);
fclose(f);
}

Listing Two

Because XML allows dashes and other characters in its identifiers that are illegal in C, some mapping of names is required when going between C and XML. AutoXml converts the identifiers to a mixed-case style, where capital letters are used to separate words within an identifier. Thus PT-TWO-D becomes ptTwoD. AutoXml also prepends a prefix to each function and struct name. By default, the prefix is the same as the module name. Because field names are not in a global namespace in C, no prefix is applied there. The text between tags is available in a field named "text." In case of name conflicts, you can name it something else with the -textField command-line option.

In Listing Two, the entire XML file is loaded into memory before processing. In my work, though, I frequently encounter XML files that won't all fit in memory. For instance, XML dumps of the National Center for Biotechnology Information's dbSNP database of human genetic variations is over 11 GB after compression with gzip! Listing Three shows how to process the sample XML file one polygon at a time.

void processPolysInLimitedMemory(char *in)
{
struct xap *xap = xapOpen(in);
struct polyPolygon *polygon;

while ((polygon = polyPolygonNext(xap)) != NULL)
    {
    /* Do some processing to polygon here... */
    polyPolygonFree(&polygon);
    }
xapFree(&xap);
}

Listing Three

Using autoDtd

The autoDtd program creates a DTD file from an XML file example. The program also generates an "attributed tree" and a summary of the actual data in the file.

Listing Four (available electronically) shows the DTD autoDtd generates from the XML file in Listing One. It starts out by defining XML entities to represent numerical data. Most XML parsers will treat these entities as synonyms for strings (CDATA). However, autoDtd and xmlToSql know to map them to numerical types in C and SQL, respectively. The DTD continues with ELEMENT and ATTLIST definitions. The elements correspond to the tags in the XML file. The parenthesized lists after the element name describe the legal child tags. The child tags followed by "?" are optional. "*" indicates that any number of children of this type can follow. "+" indicates there must be at least one child of this type. The ATTLIST describes the attributes within a tag. Note that the x attribute is of type %float; and the y attribute is of type %int;. This is because of the value 3.0 for the x attribute in one of the points in the example, while there are no decimal points in the y values.

Listing Five (available electronically) shows the "attributed tree" output of autoDtd. This is a denser, more readable representation of the same data structures. There is a line for each type of tag in the XML file. Children are indented in lines following their parent. Attributes, if any, are on the same line following the tag name. The data type, if any, is indicated by a letter before the tag or attribute name, with "$" representing strings, "#" representing integers, and "%" representing floating-point numbers. Tag names may be followed by the characters ?, *, or + with the same meaning they have in DTD files.

The stats output of autoDtd (Listing Six, available electronically) summarizes the contents of the XML file. It is also used by the xmlToSql program to decide which fields to index. The stats output begins with a comment that describes the format.

Using sqlToXml

The sqlToXml program is a convenient way to output all or part of a relational database as XML. I describe how to use the program here. For a more detailed description, see the sqlToXml.doc file in the program sources. The inputs to sqlToXml are a MySQL database, an "object dump" (.OD) file, and optionally a SQL statement to limit which parts of the database are dumped. The output is a subset of XML, which is particularly easy to parse.

Listing Seven (available electronically) shows an example of a sqlToXml dump for a simple relational database containing polygon and point tables. Each row in a table is output as a single tag in the XML. Each column in the table is represented as an attribute. Each tag is output in a single line, with child tags indented relative to parents. When there are no children, the closing tag is incorporated into the start tag.

The outermost tag in the XML file does not correspond to a table. By default, sqlToXml fills it in using the name of the database. The next tag, "polygon," corresponds to the "master table" being dumped. The dump has one record for each row in the master table. Optionally, a SQL statement can limit this to just a subset of rows.

Listing Eight (available electronically) is a .OD file that might produce a dump like this. The master table appears on the first line. In subsequent lines, fields are indented relative to the table they belong to. By default, sqlToXml outputs all fields. Fields only need to be mentioned if they link to a child table, or if you want to hide them. In this .OD file, I hide a bounding-box field and link to the point table via the id field. The format of a linking field is first the field name (in the parent table) followed by table.field in the child table. If you want the XML output to use a tag name different from table name, you can add an "as" clause. The children can themselves have children, though in this case, the hierarchy is only two deep.

Using xmlToSql

xmlToSql converts an XML file into a directory full of tab-separated files and SQL statements for loading a relational database. The program is fast and efficient. On a 334 megabase XML dump I recently received from a gene imaging project, xmlToSql ran in 59 seconds. The tab-separated output size was a mere 23 megabases. xmlToSql's input is an XML file, and DTD and .Stats files produced by autoDtd on the same XML file. xmlToSql does need to be run on a machine with enough RAM to store the entire output database in memory. Because the output is so much smaller than the input, this generally should not be a problem.

Like autoXml, xmlToSql converts XML tag names into mixed-case names legal in both C and SQL. If necessary, it will also rename tag names that correspond to reserved words in either language. If the conversion results in a name that is already used, the program appends successive integers to the name until it is unique.

In general, xmlToSql maintains a one-to-one relationship between tables and tags, and between columns and attributes. There are a few exceptions. xmlToSql needs each table to contain a field it can use as a primary key. If there is an integer attribute or a string attribute 12 characters or less long, and this attribute is unique in each instance of the tag, xmlToSql uses it as the primary key. Otherwise, xmlToSql creates a new integer field ID to use as the primary key. xmlToSql also creates a "foreign key" field in a parent table to link to the child table. This foreign key has the same name as the child table when possible. This field contains 0 or the empty string to indicate a missing optional child. If the parent can have more than one child of this type, instead of a child field in the parent, the program creates a parentToChild table. This table has one field for the parent and one for the child.

The other exception to the one-to-one rule is for nonattributed tags that surround short stretches of text. Rather than creating a separate table for these, xmlToSql treats them as attributes of the parent. By default, tags surrounding text no more than 32 bytes long are turned into attributes, but this is adjustable from the command line.

Inside the Programs

These programs are all built on top of the jkweb.a library. This library is notable for its speed, simplicity, and somewhat unconventional brace placement. The library adheres to some simple object-oriented conventions. In general, for every .c file there is a corresponding .h file, which collectively are a "module." The primary data structure shares the module name and publicly available functions begin with the module name. The code makes little use of global variables and generally avoids static variables to encourage multithreading and discourage side effects. Most data structures begin with a "next" field so that they can easily be made into singly linked lists.

Four library modules in particular are heavily used by the code. The hash module handles all symbol tables. The dtdParse module reads in DTDs. The xp module is a streaming XML parser. The xap module is a light wrapper around xp that stores the results of your startTag callback in a stack.

The xp module started out life as a drop-in replacement for the Expat library. Expat is a fine streaming parser, and in truth probably handles some rare wrinkles of XML and Unicode that xp does not. However, Expat depends on other modules that are hard to find on some platforms. Because of this and also because I wanted to get a little more speed when dealing with huge XML files, I wrote xp, which is 30 percent faster than Expat.

The first version of autoXml (www.linuxjournal.com/article/5949) suffered the same limitations that DOM parsers do—the whole XML had to be loaded in memory. To work around this, I hacked the recursion out of xp, and added parameters to let it seek to a particular tag type before executing any of its callbacks, and then return completely when that tag is finished parsing. Subsequent calls to xp start out where the previous call left off. This enables the iterative reading of data structures from an XML file in Listing Three.

The code in autoDtd and sqlToXml is straightforward, and I refer you to the commented open source. AutoXml is relatively straightforward, too, but it is a C code generator written in C, so there are some heavily escaped lines such as:


  fprintf(f, "fprintf(f, \" %s=\\\"%%%c\\\"\", obj->%s);\n", 
 att->name, fAttType(att->type), att->name);

The xmlToSql program is relatively complex, requiring 1100 lines of code beyond the library routines. Most of this code is concerned with building up data structures based on the DTD and stats files to represent the tables and fields, and if necessary the parentToChild tables. The parent/child relationships are found in the parentKeys member of the table structure. This element is actually a list, because the same child can be found under multiple parents in XML. Once the table and field structures are built up, xmlToSql calls the xap module to stream through the XML. The start-tag callback clears out the string values associated with each field, and then fills in some of these strings from the attributes. The end-tag callback fills in the text field, and then concatenates all of the field strings together into a line. It looks up the concatenated fields in a hash that returns the primary key for that row if it already exists. If the result doesn't exist, it creates a new primary key, stores this back in the hash, and outputs the line to a tab-separated file. The end-tag callback then looks up the parent in the xap stack and fills in the foreign key field in the parent with the primary key. By the time the parent's end-tag callback is executed, the children's end-tag callbacks have filled in all of the foreign key fields.

Conclusion

This suite of XML tools has been interesting to write. Together they take most of the tedium out of XML programming, and make XML files almost as easy to work with as the simple line-oriented text formats UNIX programmers have been happily grepping and awking through for 30 years. If you would like to extend the programs, please do so. Please drop me an e-mail at [email protected], so that I can hook you up with the CVS depository, thus allowing other people to use your improvements, too.