Open-RJ and Ch

By Matthew Wilson, November 01, 2004

Matthew takes a look at the Open-RJ library, along with its mapping to Ch and C++.NET.

November, 2004: Open-RJ and Ch

Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://imperfectcplusplus.com/.

The first six installments of this column featured recls—a platform-independent library that provides recursive filesystem searching—and demonstrated techniques for integrating such a C/C++ library with C++ ("normal" classes and STL sequences), C#, COM, D, Java, and Ruby by implementing recls mappings for each of those languages. This month, I take a break from recls, looking instead at the Open-RJ open-source library (http://www.openrj.org/), along with its mapping to the Ch and C++.NET languages.

Open-RJ, a new open-source project that Greg Peet and I recently launched, is a reader for the Record-Jar structured text-file format, implemented in C. (Record-Jar is described in The Art of UNIX Programming, by Eric Raymond, Addison-Wesley, 2003.) Each Record-Jar file consists of a Database, which is comprised of one or more Records, each of which is comprised of one or more Fields. A field is a simple pair of Name and Value, separated by the first colon on the field line. All unenclosed whitespace is ignored. All blank lines are ignored. Lines may be extended by use of a trailing backslash. Records are separated by lines whose first two characters are %%, and these lines also act as comments. Here is an example of a Record-Jar file:

%% Pets Database

Name:    Barney
Species: Dog
Breed:   Bijon \
      Frieze
%% 

Name:    Samson
Species: Dog
Breed:   Ridgeback
%% 

Name:    Fluffy Kitten
Species: Cat
%%

There are four records in this file—the three pets and the empty record delimited by the first comment. Barney's breed is split over two lines, but the whitespace preceding "Frieze" is elided, yielding "Bijon Frieze". (By the way, I only categorize Barney as a dog as a favor to my mum. He's not really what I'd class as a proper dog!) One of the options that can be specified to the library is to elide blank records, the other being to sort the fields in a record by name. There are no constraints whatsoever regarding the fields that comprise a record. Different records may contain different fields, and even several fields with the same name. It's entirely up to the users of the library.

The API has two main functions: ORJ_ReadDatabaseA() and ORJ_FreeDatabaseA(). (The library currently supports ANSI forms only, hence the "A" postfix.) They have the following signatures:

ORJ_CALL(ORJRC) ORJ_ReadDatabaseA(
  /*[in]*/ char const *jarName
, /*[in]*/ IORJAllocator *ator
, /*[in]*/ unsigned flags
, /*[out]*/ ORJDatabaseA const
                       **pdatabase
, /*[out]*/ ORJError *error);
ORJ_CALL(ORJRC) ORJ_FreeDatabaseA(
  /*[in]*/ ORJDatabaseA const
                      *database);

ORJ_ReadDatabaseA() takes the filename, a pointer to an optional allocator "interface," and flags that modify its behavior. If the file can be accessed and is correctly formatted, then it returns a pointer to an ORJDatabaseA structure that is allocated internally. If the function fails for any reason, it fills in the optional ORJError structure to provide information—type, file, line—on any parsing errors.

Listing 1 presents the structures for the API. The ORJStringA structure represents a string as a length and a pointer. The ORJFieldA structure consists of two string structures, for the name and the value, along with some members used internally to the API. The ORJRecordA structure is a counted array of fields, consisting of a number of fields and a pointer to (an array of) fields. The ORJDatabaseA structure is similarly a counted array of records, with a number of records and a pointer to (an array of) records. However, it also provides access to all fields in the database, since some client code may want to ignore the record structure and just deal with fields. The structure also contains the number of lines in the database and the allocator pointer provided to ORJ_ReadDatabaseA().

The ORJ_ReadDatabaseA() function creates a single contiguous block containing all the information in the database. This is done in two steps. First, the file contents are read into an allocated block, and then processed to coalesce extended lines, remove comments, and translate linefeeds into the null character ('\0'). This process also counts the number of lines, fields, and records, and the block is reallocated with enough space to incorporate all the string, field, record, and database structures in one block. Because the library is concerned solely with reading extant Record-Jar files, the structures do not change once created, so this two-step allocation avoids heap fragmentation (and potentials for memory leaks), and the ORJ_FreeDatabaseA() function simply frees the single block. Further, since most uses of the library are likely to be in reading configuration data, it's probable that they'll occur when the application is starting up and running only its single main thread, in which case the reallocation is simply an inexpensive expansion of the original allocated block.

Currently, Open-RJ is mapped to C++ (regular classes), Ch, C++.NET, D, Ruby, and STL. Like recls, I plan to map it to other languages and technologies in due course. In this article, I look at the Ch and C++.NET mappings; others may be covered in future columns.

Lessons Learned

My experience with mapping recls has naturally been reflected in the design of Open-RJ. Put another way, Open-RJ has learned from my mistakes with recls. There are several notable examples.

recls represents a string as a pair of pointers, which was, in hindsight, a mistake. I think I was too hung up on mapping to STL, and should have considered other languages. Open-RJ represents a string as a pointer and a length. Since D represents its strings as slices of character arrays—which lends high efficiency when processing text-based information—I defined the Open-RJ string structure to be binary compatible with the D language—4-byte length followed by a pointer to the first element in an array. This considerably simplified the Open-RJ-D mapping.
Open-RJ uses the widely recognized BSD license, rather than the more restrictive Synesis Software Standard Source License, which is currently used for recls. (I also plan to change to BSD for recls and STLSoft in the not-too-distant future.)
The organization of the library header file directories is better than it currently is with recls. The C++ and STL mapping header files are located under the main <openrj-root>/include/ directory, under the cpp and stl subdirectories, respectively. This means that one now includes in the following form:

	#include <openrj/cpp/database.hpp>

rather than with recls where the <recls-root>/Mappings/Cpp has to be included in the search paths along with <recls-root>/include. This is clearly an improvement in usability. I plan to evolve recls in the same vein, and the next release will contain the first step in this regard: New headers in the <recls-root>/include... subdirectories will include the original <recls-root>/Mappings headers via #include "" of the relative paths. This will provide for an easier migration to the new form, and a subsequent release will swap the roles and content of the old/new headers before the old ones are finally retired. Only at that point will the changes break existing code, so there'll be a lot of lead-in time.
The corollary to this is that the C-API headers are also made relative; that is, they are located in <openjr-root>/include/openrj, so including the Open-RJ root header is done by:

	#include <openrj/openrj.h>

Ch

Ch is an embeddable C interpreter from SoftIntegration (http:// www.softintegration.com/) that is a scriptable form of C. It is freely available for commercial and noncommercial use in its Standard Edition, and the Professional Edition is freely available for academic use.

To understand how to extend Ch, you need to look at its runtime architecture. Essentially, Ch scripts run in the Ch scriptspace, and connect to compiled C/C++ libraries via dynamically loaded library loading/invocation. What is unusual to me, given my experience in mapping to other languages over the last few years, is that the transformation between the two execution contexts occurs within a script, rather than a compiled component. Let me illustrate with a look at how we implement the ORJ_DatabaseA() function for Ch.

The first important point, which is both surprising and pleasing, is that the Ch script client code is able to include the same header file as the library; for example, openrj.h. This required only two changes to this root header.

I'm not going to show a listing for the Ch test script, since it looks almost exactly like the Open-RJ C test program (available in the Open-RJ distribution). It's just pure C, without any need for additional #defines or other code constructs.

So, the Ch script sees a call to a function, marked extern, which does not reside within its known runtime. How does it find that function? The answer—function files.

A Ch function file, with extension .chf, is a Ch script file that implements a named external function as a script. For simple functions, it's possible to skip the back end entirely and just emulate the functionality with the script. You might choose to do this, for example, for a simple function that looks up an element in a structure. But the main game, which I'll focus on here, is in making a bridge between the real functions implemented in compiled code and the scripting context. The mechanism for this is by using the dlopen(), dlclose(), dlsym(), and dlerror() functions. This might seem obvious after the fact, but it's certainly a break from how other languages interface with compiled code, from a syntactic perspective, at least. Listing 2 shows the implementation of the ORJ_ReadDatabaseA() mapping function, which is defined in the file ORJ_ReadDatabaseA.chf. Ch uses the name of the function file to map the function. The function file must be in the current directory, or be located in the Ch path (which I'll explain shortly).

The implementation of the ORJ_ReadDatabaseA() mapping function shown in Listing 2 looks almost entirely as one might write such a function in C. The syntax and semantics of the standard UNIX dynamic library functions dlopen(), dlsym(), and dlclose() are evident. The difference is in the Ch extension function dlrunfun(). This is a variadic function whose arguments comprise the function to call in the dynamic library "libopenrjch.dl": the return type, the function name, and the arguments to the function. If the function "returns" void, then the return type parameter should be specified as NULL. By specifying the function name, which is in Ch scriptspace, the Ch engine is able to validate the arguments to be passed to the compiled function.

Where Ch extensions differ from those for other languages is that the functions in the dynamically loaded library are not those in the mapped library. Because they are called directly from the scripting environment, they need a standard signature. Listing 3 is the implementation for the ORJ_ReadDatabaseA_chdl() function. (The postfix "_chdl" seems to be the convention for Ch mappings.)

You can plainly see that the scriptspace passes arguments to the dynamic library in a custom variadic argument form. These are easily retrieved on the compiled side via the Ch_VaArg() macro. Once the arguments are retrieved, they're then simply passed to the actual library function. There's really nothing more to it. For the Open-RJ-Ch mapping, I've only mapped five functions—ORJ_ReadDatabaseA(), ORJ_FreeDatabaseA(), ORJ_GetErrorStringA(), ORJ_GetParseErrorStringA(), and ORJ_Record_FindFieldByNameA()—so I've done them all manually. However, for the recls-Ch mapping there are 35 functions, and doing this all manually didn't seem the least bit attractive. Hence, I've defined macros to do the boilerplate, as in:

#define IMPLEMENT_3_void(F, T0, T1, T2)  \
                                     \
    EXPORTCH void F##_chdl(void *arg_)  \
    {                                   \
        va_list     ap;                 \
        T0          arg0_;              \
        T1          arg1_;              \
        T2          arg2_;              \
                                        \
        Ch_VaStart(ap, arg_);           \
                                        \
        arg0_   =   Ch_VaArg(ap, T0);   \
        arg1_   =   Ch_VaArg(ap, T1);   \
        arg2_   =   Ch_VaArg(ap, T2);   \
                                        \
        Ch_VaEnd(ap);                   \
                                        \
        F(arg0_, arg1_, arg2_);         \
    }

In short, the recls-Ch mapping dynamic library file simply contains a set of function-definition macros, such as:

IMPLEMENT_1_void(Recls_CloseDetails, 
  recls_info_t)
IMPLEMENT_2_R(recls_rc_t, 
  Recls_CopyDetails, 
  recls_info_t, recls_info_t*)
IMPLEMENT_2_R(recls_rc_t, 
  Recls_OutstandingDetails, 
  hrecls_t, recls_uint32_t*)

Ch comes along with a program c2chf, which can generate the chf and chdl files automatically for most libraries. I chose the manual approach in order to gain a deeper understanding of the mechanisms of Ch.

And that's it—really. Once I'd grasped the basic methodology of extending Ch, the only problems I had with developing the Ch mapping were in setting up the environment. Ch has an internal execution environment, and corresponding environment variables. You must remember to set up the appropriate variables, which is done inside the _chrc file that you obtain from your home directory (for example, F:\Documents and Settings\Matty\) by executing the ch -d command. Inside this file, you need to modify the _path and _ipath variables to see your compiler and your library header files. The _chrc file comes along with commented out examples for the Visual C++ 6.0 and 7.1 compilers, so if you're using either of these, it's a very simple matter to amend the _path variable. For the _ipath variable, you'll need to incorporate the Open-RJ (or recls) include directory. On my system this looks like:

_ipath = stradd(_ipath, 
  "H:/freelibs/openrj/include;");
_ipath = stradd(_ipath, 
  "H:/freelibs/recls/Dev/include;");

Once you've set this up, everything else flows pretty simply, and I encountered no gotchas. The only complaints I would make are that some of the error messages are less than 100-percent obvious and that Ch appears to have some rather serious problems with forward declared structures. If you look inside the latest Open-RJ header file openrj/openrj.h, you see the structures and enumerations defined in the following format:

enum ORJ_TAG_NAME(ORJ_FLAG)
{
    ORJ_FLAG_ORDERFIELDS = 0x0001  
        /*!< Arrange fields in 
        / alphabetical order */
  , ORJ_FLAG_ELIDEBLANKRECORDS  = 0x0002  
       /*!< Causes blank records 
       / to be ignored */
};
typedef enum ORJ_TAG_NAME(ORJ_FLAG)  
  ORJ_FLAG;

where the ORJ_TAG_NAME(X) macro resolves to X for Ch, rather than the tagX for other compilers. (Note: Since this work was conducted, the chaps at SoftIntegration have fixed this bug, so you shouldn't experience this problem once you've updated to the latest version.) That accounts for one of the changes I had to make to the Open-RJ header file. The other was the incorporation of the following lines into the header:

#ifdef _CH_
# include <chdl.h>
LOAD_CHDL_CODE(openrjch, OpenRJ)
#endif /* _CH_ */

Because Ch extensions are based on a function-by-function basis, the dynamic libraries are notionally loaded and unloaded for each function call. That works fine for single functions, and in fact would work for Open-RJ because it allocates a single block of memory in the call to ORJ_ReadDatabaseA(), but in general it is a bad idea. If the client code accesses data or functions that would be destroyed when the dynamic library is unloaded, this would result in a crash. To avoid this, the Ch SDK recommends use of the LOAD_CHDL_CODE() macro, which expands an inline call to the actual function _dlopen()—as opposed to the function dlopen(), defined in a Ch function file—to load the dynamic library libopenrjch.dl, along with a call to _atexit() to unload the library when the driving script exits. The dynamic library is loaded at all times during the script, so there's no danger of disappearing code and/or data.

So how does Open-RJ-Ch work? The answer is surprisingly well. Since the scripts are virtually 100-percent C code, you're in little danger of getting the syntax wrong. And because it's scripting, you tend to adopt the rapid prototyping methodology of development.

I was skeptical that there'd be any point to a scriptable form of C. Since scripting languages abstract a great deal of the low-level aspects of C, at the cost of lower performance, I saw a scriptable form of C as being the worst of both worlds. However, having now used it I can say that it is an enjoyable form of working, as it does, in effect, provide a fast way of prototyping C code. I was surprised by how natural Ch feels, since you are writing in C. One part I especially appreciated was having a preprocessor accessible within a scripting environment, though it's a guilty pleasure, to be sure.

Again, I've also worked on a Ch mapping for recls and will be releasing that soon, along with newer updates to the recls libraries. I expect to release Version 1.6 before the next installment of this column.

At the moment, Greg and I have released (on SourceForge) the main library, and mappings to C++, C++.NET, Ch, D, Ruby, and STL. I plan to discuss the C++.NET mapping in this forum soon, and will cover others when I think they'll be more useful in the exposition of new integration issues than recls.

Further Work

There are several things I want to achieve with these libraries. recls needs some structural reorganization, and I'd like to add Win32 registry and Visual SourceSafe enumeration. Open-RJ needs decent documentation, and there are several languages for which Open-RJ needs to catch up with recls. Both libraries are now well overdue for a Python mapping, and I'd also like to explore some languages that I've not yet played with, such as Tcl and Lisp.

As usual, if you have any specific requests, I'd love to hear from you. Also, I'm going to have to swallow my pride and ask for assistance on class-based Perl extensions—if anyone has some decent example projects that they can share, or point me to, I'm sure I'll be able to apply that work to recls without needing to bombard you with plaintiff e-mails.

Acknowledgments

Thanks to Greg Peet and Walter Bright for reviewing this installment, and to the folks from SoftIntegration for helping me with Ch.

1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Open-RJ and Ch

Lessons Learned

Ch

Further Work

Acknowledgments

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Open-RJ and Ch

Lessons Learned

Ch

Further Work

Acknowledgments

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content