Matthew Wilson is a software-development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://imperfectcplusplus.com/.
The last couple of installments of "Positive Integration" have focused on the new Open-RJ library, which is a simple, structured file reader and its mappings, in particular Ch and Python. In this installment, I take a look at a few enhancements to the Open-RJ base library that have come about as a response to experiences of users, in addition to diving into the details of the Open-RJ/C++.NET mapping. I also cover a few changes to the recls library, which was featured in the first six installments of this column, and which I return to with some gusto in the next installment. The changes described in this installment are encapsulated in Version 1.2 of Open-RJ (http://openrj.org/) and Version 1.6 of recls (http://recls.org/).
Open-RJ/C++ & Open-RJ/STL Changes
Because this column is a learning experience, I like to reflect on lessons learned from time to time. In its relatively short life, the Open-RJ library has undergone some exhaustive workout on various projects, including several of my own and some of my clients'. Most of these have been in C++, so the Open-RJ/C++ and Open-RJ/STL mappings have had a lot of critical examination. The result has been a number of enhancements to these mappings, without necessitating any changes to the base API.
Some of these changes are quite simple: The openrj::cpp::Database class was given a GetPath() method; openrj::cpp::DatabaseException was given a more meaningful what(), implemented in terms of the error=>string functions provided by the API; the STL mapping classes were reimplemented to provide out-of-class method definitions so that the class definitions are now succinct and accessible. Other changes are less trivial: openrj::stl::record has the two new methods count_fields(), which returns how many of its fields have a given name, and get_field_value(), which provides a mechanism for looking up a field and providing a default if that field does not existpreviously, you would have to catch the exception thrown by the subscript operator, or manually search the raw API structures. Other new usability features can be found in two new headers. <openrj/stl/functional.hpp> defines the record_has_field predicate, useful for searching out records:
std::find_if(db.begin(), db.end(), record_has_field("Common"));
or for generating record subsets for a given database, based on the presence of named fields:
vector<record> channels; stlsoft::copy_if( db.begin(), db.end() , std::back_inserter(channels) , record_has_field("Channel"));
<openrj/stl/utility.hpp> defines an overloaded lookup() function that takes a field name and two records, so that one record can act as a common/global source of fields, whose contents may be "overridden" by the other. One overload throws an exception if neither record contains the field; the second also takes a default value that is used in that case. These have proven especially useful in the more sophisticated uses to which we've been putting Open-RJ in a commercial context.
One more usability feature worth noting is the provision of forward declaration header files for both the C++ and STL mappings. These two files, <openrj/cpp/openrjfwd.hpp> and <openrj/stl/openrjfwd.hpp>, include forward declarations of their classes within their respective namespaces; for example:
namespace openrj { namespace stl { class field; class record; class database; class database_exception; } // namespace stl } // namespace openrj
These follow the <iosfwd> example, and have proved a great blessing in reducing the coupling on a large networking infrastructure project in which we've been using Open-RJ/STL.
The last change I want to mention is less a refinement and more a fix to a design blunder. For several mappings, the record type provided subscript access indexed by integer and also by name, as in the C++ mapping Record class:
class Record { ... Field operator [](size_t index) const; Field operator [](char const *name) const; ...
In hindsight, this seemingly small issue has proven a real bugbear to usability, in that it has lead to some really tedious client code and restricted use in templates, because the value of the returned field had to be explicitly accessed, as in:
using ::openrj::stl::string_t; string_t Id = record["Id"].value(); string_t Name = record["Name"].value(); string_t Type = record["Type"].value(); string_t PropName = record["PropName"].value(); string_t Width = record["Width"].value();
Of course, when using integer access, you need to have the whole record returned. But when accessing by name, one already has the name, so the only unknown information is the value. Furthermore, when accessing by name, one is commonly dealing with an expected structure of the data and attempting to express the access of this data as succinctly as possible. Although it's only 40 fewer characters, by omitting the calls to value() in the aforementioned code, we facilitate clarity of expression, which is always a good thing. (It's also a bit more efficient.) With the exception of a single case, having the subscript operator that is indexed by name return the value rather than the whole field has been both straightforward and beneficial in all mapped languages.
Open-RJ 1.2: Memory Databases
A few users have suggested that the dependency of the base API on stdio (fopen(), fread(), and so on) is a bit limiting. Indeed, one user suggested that requiring an Open-RJ database to be file based is unnecessarily inflexible. Having recently come across a requirement for an in-memory databasein an auditing GUI for a networking infrastructure project in which all the daemons use Open-RJ configuration filesI had sufficient impetus to address this issue. So the base API has now expanded with the addition of the ORJ_CreateDatabaseFromMemoryA() function, which is identical in arguments to ORJ_ReadDatabaseA(), save that the latter's single database path name parameter is replaced by a pointer to (char const*) and length of (size_t) a character buffer that contains the in-memory database. The database format is identical to that of a file-based form, including the use of the carriage return ('\n') as a line terminator. I was able to factor out a large part of the functionality of ORJ_ReadDatabaseA() into a common private worker function, which is also called by ORJ_CreateDatabaseFromMemoryA(). Since the two database creation functions produce the same end result, a pointer to an instantiated ORJDatabaseA structure, all other API functions remain the same, and ORJ_FreeDatabaseA() remains the single point of release. (Keen code historians will be able to see from the source that the common factoring is perhaps not all it could be at this point, in that it still leaves a little duplication between the two API functions. The reason I followed the given approach was that it let me do an almost perfectly clean chop of ORJ_ReadDatabaseA(), which meant I had far fewer qualms about introducing bugs. It also let me keep the code blocks, now separated into two functions, in the same order such that my source-code control system can render a readily comprehensible view of what was and what is now, which is a boon in and of itself.)
So the base API handled the issues okay, but how about the various supported languages? For plain-C clients, there was no change to existing code, and the same can be said for the Ch mapping, for which I needed only to provide a new function file and add the corresponding export to the shared library source. Object-oriented language mappings were a bit more involved, but they all followed the same general format: Abstract the database class and introduce two new derived concrete classes for file and memory databases.
This only leaves the issue of how to handle backwards compatibility. I chose to name the abstract base DatabaseBase/database_base, from which FileDatabase/ file_database and MemoryDatabase/memory_database were derived. In C++ (the C++ and STL mappings), one can declare a protected constructor and have the child classes define static worker functions to translate the child constructor arguments into a single parameter to be passed to the base class constructor, as in Listing 1. In this way, you observe good practice in the use of the member initializer list (see Chapter 2 of Imperfect C++), while ensuring that the maximum amount of functionality for manipulating the database structure pointer is kept in the base classall of it, in fact. Thank heavens for static worker methods!
In the case of Ruby, migration was a dream, as most things are in Ruby. It took less than 10 minutes to code, compile, and test, and it worked the first time. Effectively, all that was required was a function to translate from the MemoryDatabase constructor into a call to ORJ_CreateDatabaseFromMemoryA(), and just changing all of the previous Database methods' owner from cDatabase to the new cDatabaseBase. (Have I mentioned before that Ruby is a great language?)
To support extant client code, which needs to use the file database class under its previous name(s) (Database/database), I introduced typedefs/aliases. In Ruby, this was done via rb_define_alias(mOpenRJ, "Database", "FileDatabase"); in C++, this was via a typedef (also included in the forward declaration headers).
In enhancing the Python mapping for the Version 1.2 enhancements, I confess that I wimped out a little. The existing openrj.open() method was renamed to openrj.open_file(), which was then aliased back to its original, but now obsolete, form. The new openrj.open_memory() method was added. Under the covers, this latter function calls ORJ_CreateDatabaseFromMemoryA(), but then creates the same kind of Python object as openrj.open_file(). Because in Python you create objects by creator methods, rather than any kind of "constructor," this doesn't matter from an OO perspective, but it is slightly ugly in that a database object created via openrj.open_memory() will have a path property, albeit an empty one. I may try to elide it in a future version by intercepting the call request in the __getattr__ method.
Open-RJ/C++.NET
So, on to the C++.NET mapping. As you'd expect from the previous mappings, the object model falls out pretty obviously: Database base class, FileDatabase, MemoryDatabase, Record, Field, and DatabaseException. I've followed my previous strategy for having each of the mapped-language object instances maintain a pointer to the appropriate element within the ORJDatabase structure, as well as any necessary relational links between themselves. For example, a Database instance maintains an ArrayList of Record instances, so that it can provide indexed access to the records. The .netSTL (the STLSoft subproject for C++.NET) helper class ArrayListEnumerator handles all the Enumerator boilerplate, as in:
IEnumerator *Record::GetEnumerator() { // Forget about implementing get_Current(), MoveNext() and // Reset() yourself return new ::dotnetstl::ArrayListEnumerator(m_fields); }
Because .NET is garbage collected, rather than reference-counted, there's no impediment with having the "child" classes Record and Field hold back pointers to their "parent" classes Database and Record, respectively, and there's no need for special measuressuch as those taken in the Python mappingto protect against cyclic references. Like D, Python, and Ruby, .NET has the nice feature of properties, and they're used to good effect. The Field class has Name, Value, and Record properties; Record has NumItems and Database; Database has NumLines, NumFields, and NumRecords; see Listing 2. (One of the things I like about .NET is that one can name a property after its type; for instance, Record has a Database property whose type is Database. I'm still debating whether or not this is a guilty pleasure.)
As with the other C++ mappings of Open-RJ, the C++.NET mapping handles the commonality between the file and memory database types by placing the bulk of the behavior in an (abstract) base class, and passing the ORJDatabase pointer to the base class from the derived class constructor, which uses a worker function to keep everything in the initializer list. The create_database_() method of the FileDatabase and MemoryDatabase classes closely follow those in Listing 1, but there is the added complication of handling the conversion from a .NET string (System::String*) into a C-style string (char const*), which is achieved via use of the .netSTL class c_string_accessor (See my article "Accessing C-String Representations of Strings in Managed C++," Dr. Dobb's Journal, April 2004), as in Listing 3.
Abstract Musings
Since the Database class is abstract, my first instinct was to make it abstract in a C++ sense by applying a = 0 to the destructor. However, this gives the interesting error "error C3634: 'void OpenRJ::Database::Finalize(void)' : cannot define a pure virtual method of a managed class". This is oddly misleading, since a little experimentation shows that you can define another virtual function; for example, virtual void f() = 0and implement it in the derived classes to get precisely the required behavior of all three database classes. Though this is moot for our purposes, because C++.NET provides the __abstract class qualifier to enforce the abstract nature of the base class, it's worth bearing in mind should you be brave enough to have classes that you want to be conditionally compiled in managed and unmanaged form.
Before I remembered to enforce abstractedness, the code was still safe because the Database constructor is defined protected. The Record and Field classes' constructors are declared private public (see Listing 2), which gives them .NET "Assembly or Family" accessibility, meaning that derived types or types within the same assembly can access them. Thus, the Database class is able to instantiate Records, and the Record class Fields, while preventing any abuse of these types by other code, which could easily pass null pointers and cause nasty access violations.
Heterogeneous Indexers in C++.NET
Earlier, I said that all but one language had responded to the challenge of heterogeneous return types from their subscript operators (also known as Indexers in .NET). Well, C++.NET is the recalcitrant. Note from Listing 2 the presence of the #ifdef INDEXER_RETURNS_STRING conditional compilation around the string argument overload of the get_Item indexer. When defined, this causes the string indexer to return the value of the field rather than a Field reference. (In either case, an exception is thrown if the field does not exist.) Unfortunately, with the C++.NET mapping, this causes problems in client code. Specifically, in C# clients, they complain that: OpenRJTest.cs(39,24): error CS1546: Property, indexer, or event '$Item$' is not supported by the language; try directly calling accessor method 'OpenRJ.Record.get_Item(string)'.
Yet again, the error message is somewhat dissembling. C# supports indexers. C++.NET supports indexers. And when the return type is the same, overloaded indexers are also supported. Clearly, the difference in the return types between the string and integer indexed subscript operators causes the C# compiler to see ambiguities between the overloads of the indexer in the C++.NET component. This is so even with a Whidbey Technology Preview release of Visual Studio (csc 8.00.30730.4, .NET Framework 1.2.30703). (I didn't try it with VB.NETplease forgive me my morbid shunning of my least favorite languagebut I did ascertain that a C# class with heterogeneous indexers does not precipitate this compilation error.)
After some poking around, I discovered that it wasn't necessarily that the int-indexed form was more acceptable than the string-indexed form. Rather, this was how it appeared simply because I'd declared the Field *get_Item(int index) overload before String *get_Item(String *fieldName) (see Listing 2). If their declaration order is reversed, then the int-indexing form in C# client code is rejected, but again, only when the return types are heterogeneous. It's as if the application of the DefaultMemberAttribute("Item") attributewhich is what informs the compiler that the get_Item() method(s) should be treated as an indexer for the classbinds to the first get_Item it finds, and that only overloads with identical return types are included in this attribute's effects.
The upshot of all this confoundedness is that I've had to accept that both overloads should return Field references, and the consequent inconsistency with other mappings. (It's nothing but speculation, but it occurred to me that one might be able to write a custom indexer attribute to handle this. I ran out of time before I could find out, so I'd be happy to hear from any C++.NET gurus on the issue.)
Open-RJ/D
One or two readers have asked about a D mapping for Open-RJ. In fact, there has been such a mapping included since Version 1.0. (Indeed, the structures of the Open-RJ API were specifically designed to be directly compatible with D.) However, I have not updated the D mapping for Open-RJ 1.2. In part, this is because Walter Bright, the author of D, suggested that I should write a 100 percent D implementation of Open-RJ for the D Standard Library, and proved the worth of his suggestion by writing a largely feature-complete implementation in about an hour, and in a single page of D code. Naturally, such panache cannot be pooh-poohed, so I shall be giving serious weight to a 100 percent D Open-RJ module in the near future. Watch this space.
recls Changes
Before I end this installment, I want to point out a few changes to the recls project that have happened in the last few months. These include some minor directory restructuring and changes to the way libraries are named; for example, recls_lib_vc6.lib => recls.vc6.lib / recls.vc6.mt.lib. (I'm still looking for the perfect library naming scheme, so if anyone's got any deep wisdom on this issue, I'd love to hear it.)
Changes to mappings include proper handling of UTF-8 in recls/D, better to_s attribute methods and constants in recls/Ruby, and use of string Access Shims (see Imperfect C++) in recls/STL so that searches can be instantiated from any convertible type, not just char/wchar_t const *.
Also note that the recls license has changed, along with STLSoft, to be the popular BSD license, in common with Open-RJ. Finally, the size issues that have held up its later versions going into the D Standard Library will be addressed shortly, so expect a lot of reduction in code and object size as a result of the imminent refactoring. All of these changes will be available in Version 1.5.3 or 1.6.1 by the time you read this.
Next Time
The changes to recls mentioned are just a mere wisp compared to what's due for recls over the next few months. I'm intending to do a major rewrite, to incorporate all current functionality, but also to include the following:
- Optional breadth-first search.
- Date/time filtering, before and/or after.
- Attributes filtering, must-have and/or must-not-have.
- Size filtering, larger and/or smaller.
- Type filtering.
- Currently searched directory progress callback .
- D-compatible string structure; that is, len+ptr (in that order).
- Customizable/plug-in regular expression path matching.
- Optional architecture-independent/agnostic interface.
- Additional search arenasSourceSafe/CVS repositories, Windows registry, and so on.
These changes, along with new mappings, such as Ch and Python, will keep me busy for the next few installments. Once that's done, I hope to look at embedding interpreters.
Acknowledgments
Thanks to Bjorn Karlsson, Garth Lancaster, Greg Peet, and Walter Bright for reviewing this installment.