Dr. Dobb's | Open-RJ/D, 100-Percent D

Open-RJ/D, 100-Percent D

Matthew discusses Open-RJ and the D programming language.

May 01, 2005
URL:http://www.drdobbs.com/open-rjd-100-percent-d/184401961

May, 2005: Open-RJ/D, 100-Percent D

Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://imperfectcplusplus.com/.

In this installment of "Positive Integration," I look at the Open-RJ/D mapping. Open-RJ is an open-source project [1] (created by Greg Peet and myself) that implements a reader for the Record-JAR structured text format described in The Art of UNIX Programming [2]. An Open-RJ database consists of one or more records, each of which contains zero or more "fields," each of which is two strings: name + value, separated by a colon. Records are separated by lines that begin with %%. D is a systems programming language that merges many of the best features of C, C++, and other advanced languages.

With this particular library-language combination, I consider how mapping an existing library to another language might be the wrong thing to do when providing a multilanguage library. As an alternative, I've implemented the Open-RJ/D Library entirely in D. This month, I examine the design decisions behind this course of action, and the costs and benefits of the approach. All the code described here is available in Version 1.3.1 of Open-RJ and will also be part of the D Standard Library from Version 0.118 onwards.

Open-RJ/D

In Version 1.2.1 of Open-RJ, there is a classic mapping for the D language. It consists of an interface file and main mapping file (see Table 1). The interface file is just a list of declarations—enums, structures, functions, and so on—for the Open-RJ C Library expressed in D. The main mapping file is a thin object-oriented layer over the underlying C Library, proving Database, Record, and Field classes, along with exception types. However, in Version 1.3.1, the implementation of Open-RJ/D is all D, and does not link to the C Library at all.

There were two reasons for considering a 100-percent pure D implementation of Open-RJ.

First was the potential for reducing the amount of code. In terms of source size, object size, and ease of building, a pure D implementation was likely to be smaller and easier to manage. But by how much?
Second, the Open-RJ Library seemed almost ideally suited to the D language. An Open-RJ database is essentially a sequence of lines of text, within which fields are defined and organized into records. One area in which D performs well is in the manipulation of strings, which pretty much covers the complexity of parsing an Open-RJ database. Coupled with the fact that D is a garbage-collecting language and uses exceptions as its default error-handling paradigm (at least for OOP; it also supports procedural programming with error return codes), it would mean you would have very little (mental or coding) overhead when handling parsing errors. Further, D's use of garbage collection means that the custom memory facilities in the C Library can simply be dropped out of the 100-percent D implementation, saving a little complexity.

(In truth, there was a third factor. Walter Bright, D's creator, was keen to have an Open-RJ implementation in the D Standard Library, but not so keen to have the mapped Open-RJ C Library in there. Since one tends to want more users rather than fewer, I was naturally swayed by this political factor to at least examine the other two.)

Code Sizes

Table 1 lists the files involved in the Open-RJ C Library and the D mapping for Version 1.2.1. It includes the size of each source file along with the resultant compiled object size. Table 2 lists the files involved for Open-RJ/D in Version 1.3.1. Because the D mappings in 1.3.1 are independent of the C Library, those files are not included, though they are of course still required for the other supported languages.

From the tables, you can see that the 100-percent pure D implementation is significantly smaller, going from eight files to one, and from 110,779 bytes to 27,373 bytes (~24 percent). That's clearly a win for simplicity and brevity.

In these times of open source, the inclination is for all source files to carry appropriate identifying information, including author, owner, license, homepage URL, and so on. Thus, all of the aforementioned files carry about 2400 bytes of comments containing this information. But even when you take this into account for something more properly approaching the lines of code measured, you still end up with about 91.5K versus 25K (~27 percent).

Of arguably greater interest to D users are the relative object code sizes. Here, however, the advantage is much reduced: The 100-percent D version saves only 20 percent in code size. You can infer a couple of things from this. First, the implementation of the Open-RJ C Library is pretty tight. Second, the design decision to make the Open-RJ String structure binary-compatible with D was a wise one because the compiled form of the interface file is only 338 bytes. Since the 1.3.1 pure D implementation does not support custom memory management and restricts itself to reading databases from memory, you can probably estimate the object size saving to be around 10-15 percent.

All in all, I'd say that neither the source code nor the object code sizes are overwhelmingly persuasive in the pursuit of 100-percent pure D, though they're certainly encouraging. Rather, it's that the management is dramatically simplified. The impact on the building of the D Standard Library is now the simple addition of rudimentary entries in the UNIX and Win32 makefiles, rather than the much more involved task of handling the .c + .h + .D and sublibraries.

Dry Spot

There's a fine principal in software engineering of only defining things once, variously known as DRY ("Don't Repeat Yourself" [4]) and SPOT ("Single Point Of Truth" [2]). Whatever you call it, the common sense of it is apparent: As soon as there is more than a single point of definition, there's a certain inevitable discomfiture waiting for you when the definitions diverge.

When writing libraries that are intended to work on several platforms, with several compilers, and be mapped to several languages, violating this principle is not a trivial matter. However, in this case, the single point of truth is the definition of the Open-RJ format itself, which is (conceptually, at least) different from the implementation of the Open-RJ C Library. It is based on the Record-JAR format ([1], [5]) but adds the ability to extend lines with trailing backslashes. Although Open-RJ is a useful reference implementation, providing an entirely separate implementation point in the form of Open-RJ/D does not, in principle, violate the format.

Notwithstanding that theoretical perspective, it remains important that the two implementations (along with any others that are "pure") are regularly cross tested, as dialecticism is highly undesirable. Currently, the only points of difference between the two implementations are in the manner of reporting badly formed databases, and not in the types of content that they deem (un)acceptable. The reason for this is that the C implementation has to preprocess the content to convert line-end sequences ('\n' or "\r\n") into the null character '\0', and to coalesce fields that have been line extended by trailing backslash ('\\'). Since Open-RJ/D Database instances are instantiated from memory in the form of char[] (an array of characters, wherein lines are separated by embedded line-end sequences) or char[][] (an array of lines, already split with line sequences removed), some of the lower-level error-reporting mechanisms are simply moot as far as the std.openrj module is concerned.

This decoupling, which is always attractive in principal, is appropriate in D because you can go from a file name to contents in D in a single statement, as in:

char[]     chars    = cast(char[])std.file.read(fileName);
Database  database  = new Database(chars, flags);

In C, you tend to be more forgiving of coupling to stdio to avoid the tedious and resource-loss-risky call sequences involving fopen(), malloc(), ORJ_CreateDatabaseFromMemoryA(), free(), and ORJ_FreeDatabaseA(). Far better to opt for ORJ_ReadDatabaseA() + ORJ_FreeDatabaseA(), and take your coupling lumps.

The Implementation

Listing 1 shows the definition of the Field class. Those familiar with Open-RJ will not be surprised at the simplicity. A Field instance has two private string variables, m_name and m_value, initialized in the constructor, and it makes them available to clients via the name() and value() accessor methods. It also has a private m_record member, which is a reference to a Record instance. This member is not initialized in the constructor because the owning record does not yet exist when its Field instances are created. The Record type, defined within the same module (std.openrj), is able to see the private members of Field and sets this for all its fields within its constructor. The only complexity in the Field class is the comparison operators. To facilitate sorting of arrays of fields—in response to ORJ_FLAG.ORDER_FIELDS; see Listing 2—the Field type must define the int opCmp(Object); method. Because we can't meaningfully compare a Field to anything other than a Field, we ensure it's the right type and pass it off to opCmp(Field);, which evaluates the fields based on their respective names and values.

Listing 3 (available at http://www.cuj.com/code/) contains the definition of the Record class, which has several interesting methods. In the constructor (constructors in D are called this(), irrespective of the class name), the given fields are copied via .DUp into m_fields. If the database is being opened with the ORJ_FLAG.ORDER_FIELDS flag, then this array is sorted. Also, the fields are enumerated, with D's foreach construct, and the associative array m_values are instantiated with the names and values of the fields. (Note that only the first field with a given name is so recorded.) All the other methods are for accessing the fields contained in the record. You can get a copy of the array via the fields() accessor method. Access to elements at a given index is via the subscript operator (opIndex(size_type)). You can query whether a field of a given name exists with the hasField() method. Testing for the presence of a field and retrieving it is done via findField(). Accessing an element with the assumption of its presence is done via getField(). A convenient alternative to getField() is provided by the char[] overload of the subscript operator (opIndex(char[])).

The last two methods are the most interesting because they facilitate the use of a Record instance with D's foreach construct, either as:

Record record = . . . 
foreach(Field field; record)
{
  ... do something with field instance
}

Record record = ... 
foreach(char[] name, char[] value; record)
{
  ... do something with name and value of the field
}

That just leaves us with the Database class in Listing 4 (available at http://www.cuj.com/code/). As with Record and Field, all nonconstructor methods are nonmutating accessors. Records may be accessed via subscript (opIndex(size_type)), via the records array (records()), or via foreach (opApply(...Record...)). They may also be selectively accessed via the getRecordsContainingField(char[] fieldName) and getRecordsContainingField(char[] fieldName, char[] fieldValue) methods, which select records based on field name and field name plus value, respectively.

The Database class also provides access to all fields for all records via arrays (fields()) and foreach (opApply(...Field...)). The complexity in the Database class is mostly in the constructor, or more specifically, in the init_() method that both constructors call. Although it's not a trivial amount of code, the conversion of an array of lines into a database is carried out in just 90 lines of code. Contrasted with the several hundred lines of code in the C implementation, D is incontestably better suited to this kind of work. Slicing operations—s[1 .. s.length - 1]—make it strip elements and compare substrings in place. Array concatenation—s = s1 ~ s2—makes merging continued lines and appending to the field/record arrays a simple matter.

Client Code

Using the Open-RJ/D mapping is a breeze. Enumerate the records and fields in a database:

char[]    chars     = cast(char[])std.file.read(fileName);
Database  database  = new Database(chars, flags);
printf("Records (%u)\n", database.numRecords);
foreach(Record record; database)
{
  printf("  Record\n");
  foreach(Field field; record.fields)
  {
    printf("    Field: %.*s=%.*s\n", field.name, field.value);
  }
}

Enumerate all the fields in a database:

printf("Fields (%u)\n", database.numFields);
foreach(Field field; database)
{
  printf("    Field: %.*s=%.*s\n", field.name, field.value);
}

Extracting the value of a field from a record:

char[] value = record["Name"];

Showing the names of all the records in a database:

Record[]  records = database.getRecordsContainingField("Name");
printf("Names:\n")
foreach(Record record; records)
{
  printf("  %.*s\n", record["Name"]);
}

Performance

There's little point in doing a comparison between implementations without considering relative performance. Table 3 shows the performance differences in loading a Record-JAR file and enumerating its contents for both implementations with a small file (11.3 KB) and a large file (1.3 MB). I must admit the results surprised me. Although I didn't expect the 100-percent D implementation to significantly outperform the C Library/mapping implementation, I did expect it to at least be on par. And with a small file, we see that it is. However, with the large file, it's clear that the implementation of the C Library affords much better performance. Since the times to load the file and instantiate the database, as well as to enumerate the instantiated databases and sum all the field name and value lengths, are taken separately, you cannot attribute the performance difference to, say, the two-allocation strategy of the C Library. Nor can you necessarily suggest that the D Standard Library's std.file.read() function is too slow. To be honest, I am rather stumped at the source of the performance advantage, and this will have to be subjected to further study, about which I'll report at a later date.

Whatever the cause, it's fair to say that for most Record-JAR files, the performance difference will not be significant, so the performance hit won't mitigate against the code size advantages of the 100-percent D implementation.

Acknowledgments

Thanks to Bjorn Karlsson, Greg Peet, Garth Lancaster, Kris Bell, and Walter Bright for their usual bashings and bruisings.

References

[1] http://www.openrj.org/.
[2] Raymond, Eric. The Art of UNIX Programming. Addison-Wesley, 2003.
[3] Digital Mars (http://www.digitalmars.com/d/).
[4] Hunt, Andy and Dave Thomas. The Pragmatic Programmer. Addison-Wesley, 2000.
[5] http://www.faqs.org/docs/artu/ch05s02.html.

May, 2005: Open-RJ/D, 100-Percent D

Listing 1

class Field
{
/// \name Construction
private:
  this(char[] name, char[] value)
  in
  {
    assert(null !== name);
    assert(null !== value);
  }
  body
  {
    m_name  = name;
    m_value = value;
  }
/// \name Attributes
public:
  char[]  name()
  {
    return m_name;
  }
  char[]  value()
  {
    return m_value;
  }
  Record record()
  {
    return m_record;
  }
/// \name Comparison
public:
  int opCmp(Object rhs)
  {
    Field f = cast(Field)(rhs);
    if(null === f)
    {
      throw new InvalidTypeException("Attempt to compare a Field 
                                with an instance of another type");
    }
    return opCmp(f);
  }
public:

  int opCmp(Field rhs)
  {
    int res;
    if(this === rhs)
    {
      res = 0;
    }
    else
    {
      res = std.string.cmp(m_name, rhs.m_name);
      if(0 == res)
      {
        res = std.string.cmp(m_value, rhs.m_value);
      }
    }
    return res;
  }
// Members
private:
  char[]  m_name;
  char[]  m_value;
  Record  m_record;
}

May, 2005: Open-RJ/D, 100-Percent D

Listing 2

/** Flags that moderate the creation of Databases */
public enum ORJ_FLAG
{
    ORDER_FIELDS          = 0x0001
  , ELIDE_BLANK_RECORDS   = 0x0002
}

/** General error codes */
public enum ORJRC
{
    SUCCESS            =   0      
  , CANNOT_OPEN_JAR_FILE            
  , NO_RECORDS                  
  , OUT_OF_MEMORY               
  , BAD_FILE_READ               
  , PARSE_ERROR                 
  , INVALID_INDEX               
  , UNEXPECTED                  
  , INVALID_CONTENT               
}

/** Parsing error codes */
public enum ORJ_PARSE_ERROR
{
    SUCCESS             = 0   
  , RECORD_SEPARATOR_IN_CONTINUATION      
  , UNFINISHED_LINE               
  , UNFINISHED_FIELD              

  , UNFINISHED_RECORD             
}

May, 2005: Open-RJ/D, 100-Percent D

Table 1: Code involved in Open-RJ/D with Version 1.2.1. (a) C Library; (b) D mapping.

File	Purpose	Source Size	Object Size
(a)
/include/openrj/openrj.h	Main include file	28,116	—
/include/openrj/openrj_memory.h	Custom memory functions	3,389	—
/include/openrj/openrj_assert.h	Assertions	3,282	—
/src/orjapi.c	Main implementation	36,956	7,525
/src/orjmem.c	Custom memory functions	3,319	443
/src/orjstr.c	Error strings	5,927	1,209
(b)
std/openrj.d	Main mapping implementation	17,194	7,743
std/c/openrj.d	C library interface file	12,596	338
Total		110,779	17,258

May, 2005: Open-RJ/D, 100-Percent D

Table 2: Code involved in Open-RJ/D with Version 1.3.1. D implementation.

File	Purpose	Source Size	Object Size
std/openrj.d	Implementation	27,373	13,751
Total		27,373	13,751
Total as % of 1.2.1		24.7%	79.7%

May, 2005: Open-RJ/D, 100-Percent D

Table 3: Relative performance between 100-percent D and C Library plus D mapping. Times in microseconds.

Small file (11630 bytes)				Large file (1356423 bytes)
C Library + D mapping		100% D		C Library + D mapping		100% D
Loading	Enumerating	Loading	Enumerating	Loading	Enumerating	Loading	Enumerating
2186.2	51.8	2210.6	55	209555.8	2887.4	485452	6786
Relative:	1	01.1%	106.2%	Relative:		231.7%	235.0%