Open-RJ/D, 100-Percent D

By Matthew Wilson, May 01, 2005

Matthew discusses Open-RJ and the D programming language.

May, 2005: Open-RJ/D, 100-Percent D

Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://imperfectcplusplus.com/.

In this installment of "Positive Integration," I look at the Open-RJ/D mapping. Open-RJ is an open-source project [1] (created by Greg Peet and myself) that implements a reader for the Record-JAR structured text format described in The Art of UNIX Programming [2]. An Open-RJ database consists of one or more records, each of which contains zero or more "fields," each of which is two strings: name + value, separated by a colon. Records are separated by lines that begin with %%. D is a systems programming language that merges many of the best features of C, C++, and other advanced languages.

With this particular library-language combination, I consider how mapping an existing library to another language might be the wrong thing to do when providing a multilanguage library. As an alternative, I've implemented the Open-RJ/D Library entirely in D. This month, I examine the design decisions behind this course of action, and the costs and benefits of the approach. All the code described here is available in Version 1.3.1 of Open-RJ and will also be part of the D Standard Library from Version 0.118 onwards.

Open-RJ/D

In Version 1.2.1 of Open-RJ, there is a classic mapping for the D language. It consists of an interface file and main mapping file (see Table 1). The interface file is just a list of declarations—enums, structures, functions, and so on—for the Open-RJ C Library expressed in D. The main mapping file is a thin object-oriented layer over the underlying C Library, proving Database, Record, and Field classes, along with exception types. However, in Version 1.3.1, the implementation of Open-RJ/D is all D, and does not link to the C Library at all.

There were two reasons for considering a 100-percent pure D implementation of Open-RJ.

First was the potential for reducing the amount of code. In terms of source size, object size, and ease of building, a pure D implementation was likely to be smaller and easier to manage. But by how much?
Second, the Open-RJ Library seemed almost ideally suited to the D language. An Open-RJ database is essentially a sequence of lines of text, within which fields are defined and organized into records. One area in which D performs well is in the manipulation of strings, which pretty much covers the complexity of parsing an Open-RJ database. Coupled with the fact that D is a garbage-collecting language and uses exceptions as its default error-handling paradigm (at least for OOP; it also supports procedural programming with error return codes), it would mean you would have very little (mental or coding) overhead when handling parsing errors. Further, D's use of garbage collection means that the custom memory facilities in the C Library can simply be dropped out of the 100-percent D implementation, saving a little complexity.

(In truth, there was a third factor. Walter Bright, D's creator, was keen to have an Open-RJ implementation in the D Standard Library, but not so keen to have the mapped Open-RJ C Library in there. Since one tends to want more users rather than fewer, I was naturally swayed by this political factor to at least examine the other two.)

Code Sizes

Table 1 lists the files involved in the Open-RJ C Library and the D mapping for Version 1.2.1. It includes the size of each source file along with the resultant compiled object size. Table 2 lists the files involved for Open-RJ/D in Version 1.3.1. Because the D mappings in 1.3.1 are independent of the C Library, those files are not included, though they are of course still required for the other supported languages.

From the tables, you can see that the 100-percent pure D implementation is significantly smaller, going from eight files to one, and from 110,779 bytes to 27,373 bytes (~24 percent). That's clearly a win for simplicity and brevity.

In these times of open source, the inclination is for all source files to carry appropriate identifying information, including author, owner, license, homepage URL, and so on. Thus, all of the aforementioned files carry about 2400 bytes of comments containing this information. But even when you take this into account for something more properly approaching the lines of code measured, you still end up with about 91.5K versus 25K (~27 percent).

Of arguably greater interest to D users are the relative object code sizes. Here, however, the advantage is much reduced: The 100-percent D version saves only 20 percent in code size. You can infer a couple of things from this. First, the implementation of the Open-RJ C Library is pretty tight. Second, the design decision to make the Open-RJ String structure binary-compatible with D was a wise one because the compiled form of the interface file is only 338 bytes. Since the 1.3.1 pure D implementation does not support custom memory management and restricts itself to reading databases from memory, you can probably estimate the object size saving to be around 10-15 percent.

All in all, I'd say that neither the source code nor the object code sizes are overwhelmingly persuasive in the pursuit of 100-percent pure D, though they're certainly encouraging. Rather, it's that the management is dramatically simplified. The impact on the building of the D Standard Library is now the simple addition of rudimentary entries in the UNIX and Win32 makefiles, rather than the much more involved task of handling the .c + .h + .D and sublibraries.

Dry Spot

There's a fine principal in software engineering of only defining things once, variously known as DRY ("Don't Repeat Yourself" [4]) and SPOT ("Single Point Of Truth" [2]). Whatever you call it, the common sense of it is apparent: As soon as there is more than a single point of definition, there's a certain inevitable discomfiture waiting for you when the definitions diverge.

When writing libraries that are intended to work on several platforms, with several compilers, and be mapped to several languages, violating this principle is not a trivial matter. However, in this case, the single point of truth is the definition of the Open-RJ format itself, which is (conceptually, at least) different from the implementation of the Open-RJ C Library. It is based on the Record-JAR format ([1], [5]) but adds the ability to extend lines with trailing backslashes. Although Open-RJ is a useful reference implementation, providing an entirely separate implementation point in the form of Open-RJ/D does not, in principle, violate the format.

Notwithstanding that theoretical perspective, it remains important that the two implementations (along with any others that are "pure") are regularly cross tested, as dialecticism is highly undesirable. Currently, the only points of difference between the two implementations are in the manner of reporting badly formed databases, and not in the types of content that they deem (un)acceptable. The reason for this is that the C implementation has to preprocess the content to convert line-end sequences ('\n' or "\r\n") into the null character '\0', and to coalesce fields that have been line extended by trailing backslash ('\\'). Since Open-RJ/D Database instances are instantiated from memory in the form of char[] (an array of characters, wherein lines are separated by embedded line-end sequences) or char[][] (an array of lines, already split with line sequences removed), some of the lower-level error-reporting mechanisms are simply moot as far as the std.openrj module is concerned.

This decoupling, which is always attractive in principal, is appropriate in D because you can go from a file name to contents in D in a single statement, as in:

char[]     chars    = cast(char[])std.file.read(fileName);
Database  database  = new Database(chars, flags);

In C, you tend to be more forgiving of coupling to stdio to avoid the tedious and resource-loss-risky call sequences involving fopen(), malloc(), ORJ_CreateDatabaseFromMemoryA(), free(), and ORJ_FreeDatabaseA(). Far better to opt for ORJ_ReadDatabaseA() + ORJ_FreeDatabaseA(), and take your coupling lumps.

The Implementation

Listing 1 shows the definition of the Field class. Those familiar with Open-RJ will not be surprised at the simplicity. A Field instance has two private string variables, m_name and m_value, initialized in the constructor, and it makes them available to clients via the name() and value() accessor methods. It also has a private m_record member, which is a reference to a Record instance. This member is not initialized in the constructor because the owning record does not yet exist when its Field instances are created. The Record type, defined within the same module (std.openrj), is able to see the private members of Field and sets this for all its fields within its constructor. The only complexity in the Field class is the comparison operators. To facilitate sorting of arrays of fields—in response to ORJ_FLAG.ORDER_FIELDS; see Listing 2—the Field type must define the int opCmp(Object); method. Because we can't meaningfully compare a Field to anything other than a Field, we ensure it's the right type and pass it off to opCmp(Field);, which evaluates the fields based on their respective names and values.

Listing 3 (available at http://www.cuj.com/code/) contains the definition of the Record class, which has several interesting methods. In the constructor (constructors in D are called this(), irrespective of the class name), the given fields are copied via .DUp into m_fields. If the database is being opened with the ORJ_FLAG.ORDER_FIELDS flag, then this array is sorted. Also, the fields are enumerated, with D's foreach construct, and the associative array m_values are instantiated with the names and values of the fields. (Note that only the first field with a given name is so recorded.) All the other methods are for accessing the fields contained in the record. You can get a copy of the array via the fields() accessor method. Access to elements at a given index is via the subscript operator (opIndex(size_type)). You can query whether a field of a given name exists with the hasField() method. Testing for the presence of a field and retrieving it is done via findField(). Accessing an element with the assumption of its presence is done via getField(). A convenient alternative to getField() is provided by the char[] overload of the subscript operator (opIndex(char[])).

The last two methods are the most interesting because they facilitate the use of a Record instance with D's foreach construct, either as:

Record record = . . . 
foreach(Field field; record)
{
  ... do something with field instance
}

Record record = ... 
foreach(char[] name, char[] value; record)
{
  ... do something with name and value of the field
}

That just leaves us with the Database class in Listing 4 (available at http://www.cuj.com/code/). As with Record and Field, all nonconstructor methods are nonmutating accessors. Records may be accessed via subscript (opIndex(size_type)), via the records array (records()), or via foreach (opApply(...Record...)). They may also be selectively accessed via the getRecordsContainingField(char[] fieldName) and getRecordsContainingField(char[] fieldName, char[] fieldValue) methods, which select records based on field name and field name plus value, respectively.

The Database class also provides access to all fields for all records via arrays (fields()) and foreach (opApply(...Field...)). The complexity in the Database class is mostly in the constructor, or more specifically, in the init_() method that both constructors call. Although it's not a trivial amount of code, the conversion of an array of lines into a database is carried out in just 90 lines of code. Contrasted with the several hundred lines of code in the C implementation, D is incontestably better suited to this kind of work. Slicing operations—s[1 .. s.length - 1]—make it strip elements and compare substrings in place. Array concatenation—s = s1 ~ s2—makes merging continued lines and appending to the field/record arrays a simple matter.

Client Code

Using the Open-RJ/D mapping is a breeze. Enumerate the records and fields in a database:

char[]    chars     = cast(char[])std.file.read(fileName);
Database  database  = new Database(chars, flags);
printf("Records (%u)\n", database.numRecords);
foreach(Record record; database)
{
  printf("  Record\n");
  foreach(Field field; record.fields)
  {
    printf("    Field: %.*s=%.*s\n", field.name, field.value);
  }
}

Enumerate all the fields in a database:

printf("Fields (%u)\n", database.numFields);
foreach(Field field; database)
{
  printf("    Field: %.*s=%.*s\n", field.name, field.value);
}

Extracting the value of a field from a record:

char[] value = record["Name"];

Showing the names of all the records in a database:

Record[]  records = database.getRecordsContainingField("Name");
printf("Names:\n")
foreach(Record record; records)
{
  printf("  %.*s\n", record["Name"]);
}

Performance

There's little point in doing a comparison between implementations without considering relative performance. Table 3 shows the performance differences in loading a Record-JAR file and enumerating its contents for both implementations with a small file (11.3 KB) and a large file (1.3 MB). I must admit the results surprised me. Although I didn't expect the 100-percent D implementation to significantly outperform the C Library/mapping implementation, I did expect it to at least be on par. And with a small file, we see that it is. However, with the large file, it's clear that the implementation of the C Library affords much better performance. Since the times to load the file and instantiate the database, as well as to enumerate the instantiated databases and sum all the field name and value lengths, are taken separately, you cannot attribute the performance difference to, say, the two-allocation strategy of the C Library. Nor can you necessarily suggest that the D Standard Library's std.file.read() function is too slow. To be honest, I am rather stumped at the source of the performance advantage, and this will have to be subjected to further study, about which I'll report at a later date.

Whatever the cause, it's fair to say that for most Record-JAR files, the performance difference will not be significant, so the performance hit won't mitigate against the code size advantages of the 100-percent D implementation.

Acknowledgments

Thanks to Bjorn Karlsson, Greg Peet, Garth Lancaster, Kris Bell, and Walter Bright for their usual bashings and bruisings.

References

[1] http://www.openrj.org/.
[2] Raymond, Eric. The Art of UNIX Programming. Addison-Wesley, 2003.
[3] Digital Mars (http://www.digitalmars.com/d/).
[4] Hunt, Andy and Dave Thomas. The Pragmatic Programmer. Addison-Wesley, 2000.
[5] http://www.faqs.org/docs/artu/ch05s02.html.

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Open-RJ/D, 100-Percent D

Open-RJ/D

Code Sizes

Dry Spot

The Implementation

Client Code

Performance

Acknowledgments

References

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Open-RJ/D, 100-Percent D

Open-RJ/D

Code Sizes

Dry Spot

The Implementation

Client Code

Performance

Acknowledgments

References

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content