Matthew discusses Open-RJ and the D programming language.
May 01, 2005
URL:http://www.drdobbs.com/open-rjd-100-percent-d/184401961
Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://imperfectcplusplus.com/.
In this installment of "Positive Integration," I look at the Open-RJ/D mapping. Open-RJ is an open-source project [1] (created by Greg Peet and myself) that implements a reader for the Record-JAR structured text format described in The Art of UNIX Programming [2]. An Open-RJ database consists of one or more records, each of which contains zero or more "fields," each of which is two strings: name + value, separated by a colon. Records are separated by lines that begin with %%. D is a systems programming language that merges many of the best features of C, C++, and other advanced languages.
With this particular library-language combination, I consider how mapping an existing library to another language might be the wrong thing to do when providing a multilanguage library. As an alternative, I've implemented the Open-RJ/D Library entirely in D. This month, I examine the design decisions behind this course of action, and the costs and benefits of the approach. All the code described here is available in Version 1.3.1 of Open-RJ and will also be part of the D Standard Library from Version 0.118 onwards.
In Version 1.2.1 of Open-RJ, there is a classic mapping for the D language. It consists of an interface file and main mapping file (see Table 1). The interface file is just a list of declarationsenums, structures, functions, and so onfor the Open-RJ C Library expressed in D. The main mapping file is a thin object-oriented layer over the underlying C Library, proving Database, Record, and Field classes, along with exception types. However, in Version 1.3.1, the implementation of Open-RJ/D is all D, and does not link to the C Library at all.
There were two reasons for considering a 100-percent pure D implementation of Open-RJ.
(In truth, there was a third factor. Walter Bright, D's creator, was keen to have an Open-RJ implementation in the D Standard Library, but not so keen to have the mapped Open-RJ C Library in there. Since one tends to want more users rather than fewer, I was naturally swayed by this political factor to at least examine the other two.)
Table 1 lists the files involved in the Open-RJ C Library and the D mapping for Version 1.2.1. It includes the size of each source file along with the resultant compiled object size. Table 2 lists the files involved for Open-RJ/D in Version 1.3.1. Because the D mappings in 1.3.1 are independent of the C Library, those files are not included, though they are of course still required for the other supported languages.
From the tables, you can see that the 100-percent pure D implementation is significantly smaller, going from eight files to one, and from 110,779 bytes to 27,373 bytes (~24 percent). That's clearly a win for simplicity and brevity.
In these times of open source, the inclination is for all source files to carry appropriate identifying information, including author, owner, license, homepage URL, and so on. Thus, all of the aforementioned files carry about 2400 bytes of comments containing this information. But even when you take this into account for something more properly approaching the lines of code measured, you still end up with about 91.5K versus 25K (~27 percent).
Of arguably greater interest to D users are the relative object code sizes. Here, however, the advantage is much reduced: The 100-percent D version saves only 20 percent in code size. You can infer a couple of things from this. First, the implementation of the Open-RJ C Library is pretty tight. Second, the design decision to make the Open-RJ String structure binary-compatible with D was a wise one because the compiled form of the interface file is only 338 bytes. Since the 1.3.1 pure D implementation does not support custom memory management and restricts itself to reading databases from memory, you can probably estimate the object size saving to be around 10-15 percent.
All in all, I'd say that neither the source code nor the object code sizes are overwhelmingly persuasive in the pursuit of 100-percent pure D, though they're certainly encouraging. Rather, it's that the management is dramatically simplified. The impact on the building of the D Standard Library is now the simple addition of rudimentary entries in the UNIX and Win32 makefiles, rather than the much more involved task of handling the .c + .h + .D and sublibraries.
There's a fine principal in software engineering of only defining things once, variously known as DRY ("Don't Repeat Yourself" [4]) and SPOT ("Single Point Of Truth" [2]). Whatever you call it, the common sense of it is apparent: As soon as there is more than a single point of definition, there's a certain inevitable discomfiture waiting for you when the definitions diverge.
When writing libraries that are intended to work on several platforms, with several compilers, and be mapped to several languages, violating this principle is not a trivial matter. However, in this case, the single point of truth is the definition of the Open-RJ format itself, which is (conceptually, at least) different from the implementation of the Open-RJ C Library. It is based on the Record-JAR format ([1], [5]) but adds the ability to extend lines with trailing backslashes. Although Open-RJ is a useful reference implementation, providing an entirely separate implementation point in the form of Open-RJ/D does not, in principle, violate the format.
Notwithstanding that theoretical perspective, it remains important that the two implementations (along with any others that are "pure") are regularly cross tested, as dialecticism is highly undesirable. Currently, the only points of difference between the two implementations are in the manner of reporting badly formed databases, and not in the types of content that they deem (un)acceptable. The reason for this is that the C implementation has to preprocess the content to convert line-end sequences ('\n' or "\r\n") into the null character '\0', and to coalesce fields that have been line extended by trailing backslash ('\\'). Since Open-RJ/D Database instances are instantiated from memory in the form of char[] (an array of characters, wherein lines are separated by embedded line-end sequences) or char[][] (an array of lines, already split with line sequences removed), some of the lower-level error-reporting mechanisms are simply moot as far as the std.openrj module is concerned.
This decoupling, which is always attractive in principal, is appropriate in D because you can go from a file name to contents in D in a single statement, as in:
char[] chars = cast(char[])std.file.read(fileName); Database database = new Database(chars, flags);In C, you tend to be more forgiving of coupling to stdio to avoid the tedious and resource-loss-risky call sequences involving fopen(), malloc(), ORJ_CreateDatabaseFromMemoryA(), free(), and ORJ_FreeDatabaseA(). Far better to opt for ORJ_ReadDatabaseA() + ORJ_FreeDatabaseA(), and take your coupling lumps.
Record record = . . . foreach(Field field; record) { ... do something with field instance }or
Record record = ... foreach(char[] name, char[] value; record) { ... do something with name and value of the field }That just leaves us with the Database class in Listing 4 (available at http://www.cuj.com/code/). As with Record and Field, all nonconstructor methods are nonmutating accessors. Records may be accessed via subscript (opIndex(size_type)), via the records array (records()), or via foreach (opApply(...Record...)). They may also be selectively accessed via the getRecordsContainingField(char[] fieldName) and getRecordsContainingField(char[] fieldName, char[] fieldValue) methods, which select records based on field name and field name plus value, respectively. The Database class also provides access to all fields for all records via arrays (fields()) and foreach (opApply(...Field...)). The complexity in the Database class is mostly in the constructor, or more specifically, in the init_() method that both constructors call. Although it's not a trivial amount of code, the conversion of an array of lines into a database is carried out in just 90 lines of code. Contrasted with the several hundred lines of code in the C implementation, D is incontestably better suited to this kind of work. Slicing operationss[1 .. s.length - 1]make it strip elements and compare substrings in place. Array concatenations = s1 ~ s2makes merging continued lines and appending to the field/record arrays a simple matter.
char[] chars = cast(char[])std.file.read(fileName); Database database = new Database(chars, flags); printf("Records (%u)\n", database.numRecords); foreach(Record record; database) { printf(" Record\n"); foreach(Field field; record.fields) { printf(" Field: %.*s=%.*s\n", field.name, field.value); } }Enumerate all the fields in a database:
printf("Fields (%u)\n", database.numFields); foreach(Field field; database) { printf(" Field: %.*s=%.*s\n", field.name, field.value); }Extracting the value of a field from a record:
char[] value = record["Name"];Showing the names of all the records in a database:
Record[] records = database.getRecordsContainingField("Name"); printf("Names:\n") foreach(Record record; records) { printf(" %.*s\n", record["Name"]); }
There's little point in doing a comparison between implementations without considering relative performance. Table 3 shows the performance differences in loading a Record-JAR file and enumerating its contents for both implementations with a small file (11.3 KB) and a large file (1.3 MB). I must admit the results surprised me. Although I didn't expect the 100-percent D implementation to significantly outperform the C Library/mapping implementation, I did expect it to at least be on par. And with a small file, we see that it is. However, with the large file, it's clear that the implementation of the C Library affords much better performance. Since the times to load the file and instantiate the database, as well as to enumerate the instantiated databases and sum all the field name and value lengths, are taken separately, you cannot attribute the performance difference to, say, the two-allocation strategy of the C Library. Nor can you necessarily suggest that the D Standard Library's std.file.read() function is too slow. To be honest, I am rather stumped at the source of the performance advantage, and this will have to be subjected to further study, about which I'll report at a later date.
Whatever the cause, it's fair to say that for most Record-JAR files, the performance difference will not be significant, so the performance hit won't mitigate against the code size advantages of the 100-percent D implementation.
Thanks to Bjorn Karlsson, Greg Peet, Garth Lancaster, Kris Bell, and Walter Bright for their usual bashings and bruisings.
class Field { /// \name Construction private: this(char[] name, char[] value) in { assert(null !== name); assert(null !== value); } body { m_name = name; m_value = value; } /// \name Attributes public: char[] name() { return m_name; } char[] value() { return m_value; } Record record() { return m_record; } /// \name Comparison public: int opCmp(Object rhs) { Field f = cast(Field)(rhs); if(null === f) { throw new InvalidTypeException("Attempt to compare a Field with an instance of another type"); } return opCmp(f); } public: int opCmp(Field rhs) { int res; if(this === rhs) { res = 0; } else { res = std.string.cmp(m_name, rhs.m_name); if(0 == res) { res = std.string.cmp(m_value, rhs.m_value); } } return res; } // Members private: char[] m_name; char[] m_value; Record m_record; }
/** Flags that moderate the creation of Databases */ public enum ORJ_FLAG { ORDER_FIELDS = 0x0001 , ELIDE_BLANK_RECORDS = 0x0002 } /** General error codes */ public enum ORJRC { SUCCESS = 0 , CANNOT_OPEN_JAR_FILE , NO_RECORDS , OUT_OF_MEMORY , BAD_FILE_READ , PARSE_ERROR , INVALID_INDEX , UNEXPECTED , INVALID_CONTENT } /** Parsing error codes */ public enum ORJ_PARSE_ERROR { SUCCESS = 0 , RECORD_SEPARATOR_IN_CONTINUATION , UNFINISHED_LINE , UNFINISHED_FIELD , UNFINISHED_RECORD }
File | Purpose | Source Size | Object Size |
(a) | |||
/include/openrj/openrj.h | Main include file | 28,116 | |
/include/openrj/openrj_memory.h | Custom memory functions | 3,389 | |
/include/openrj/openrj_assert.h | Assertions | 3,282 | |
/src/orjapi.c | Main implementation | 36,956 | 7,525 |
/src/orjmem.c | Custom memory functions | 3,319 | 443 |
/src/orjstr.c | Error strings | 5,927 | 1,209 |
(b) | |||
std/openrj.d | Main mapping implementation | 17,194 | 7,743 |
std/c/openrj.d | C library interface file | 12,596 | 338 |
Total | 110,779 | 17,258 |
File | Purpose | Source Size | Object Size |
std/openrj.d | Implementation | 27,373 | 13,751 |
Total | 27,373 | 13,751 | |
Total as % of 1.2.1 | 24.7% | 79.7% |
Small file (11630 bytes) | Large file (1356423 bytes) | ||||||
C Library + D mapping | 100% D | C Library + D mapping | 100% D | ||||
Loading | Enumerating | Loading | Enumerating | Loading | Enumerating | Loading | Enumerating |
2186.2 | 51.8 | 2210.6 | 55 | 209555.8 | 2887.4 | 485452 | 6786 |
Relative: | 1 | 01.1% | 106.2% | Relative: | 231.7% | 235.0% |
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.