Channels ▼

Open-RJ and Python

January, 2005: Open-RJ and Python

Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at

In the previous installment of "Positive Integration," I introduced the Open-RJ library and presented a mapping to the Ch scriptable C interpreter. Open-RJ is an open-source library that implements readers of the Record-Jar structured text file format, in which the contents of a database file are interpreted as a sequence of records—each of which consists of zero or more fields, each of which is a name-value pair. This month, I look at mapping Open-RJ to Python. Yes, I previously said I'd be talking about Ch and C++.NET. Alas, I had to defer C++.NET mapping for now. The developments described here are encapsulated in Open-RJ 1.1.


Python is one of the preeminent scripting languages of our time [2]. The Python mapping of Open-RJ generally follows the Ruby mapping (see "Positive Integration," CUJ, July 2004), in having an object model comprised of Database, Record, and Field classes. However, mapping C libraries to Python is a somewhat more involved and verbose task in Python than it is in Ruby, although it's still eminently manageable.

Entry Point and Module Definition

Before diving into the object model and the complexities of class definitions, it's worth examining how Python extension modules are defined. To use a module from within Python, you must import it:

import openrj

which imports all of the names from the openrj module into the program, or:

from openrj import open

which imports the open() function from the openrj module into the program. Whichever mechanism you elect to use for importing symbols, the Python interpreter translates the import into a search for the module and access of its symbols. There are two ways in which this can be done for Python extensions. One way is to write extensions as dynamic libraries (UNIX shared objects or Windows DLLs). That's the approach I take in this article. The other way is to compile your libraries into the Python runtime.

When you bundle extension code into a dynamic library, how does Python know which library to load and what entry point to use? In fact it's straightforward. If the module to import is openrj, Python looks for a library called "" (on UNIX) or "openrj.dll" (on Windows). It then attempts to access the entry point initopenrj(), which it assumes is declared to take no arguments, and have void return type. For Windows, you'll want to have a .DEF file, which includes initopenrj in its EXPORTS section.

For the Open-RJ/Python mapping, the entry point function is defined as in Listing 1. The first thing to do is create the module object, via the Py_InitModule() function, passing in the name of the module and the openrj_methods variable. If the call to Py_InitModule() succeeds, then an exception object, called "openrj.error," is created and added to the module's dictionary under the name "error." (DatabaseExceptionObject is declared as a nonlocal static object so that functions within the Open-RJ/Python mapping can access it directly.) The two constants, ORDER_FIELDS and ELIDE_blank_ RECORDS, are also created (as Python integers) and added to the module's dictionary. If any of these operations have failed, then you register a fatal error; otherwise, the function returns and the module is ready for action.

openrj_methods is a method table containing only a single module function, openrj_open(). Thus, openrj_open() is the effective entry point of the library. The blank record terminates the table; this is a common theme in Python's extensions architecture.

static PyMethodDef openrj_methods[] =
    { "open", openrj_open, METH_VARARGS
      , "Opens an Open-RJ database file, 
       and returns a corresponding 
       Database instance." }
  , { NULL, NULL, 0, NULL }

The four parts of the PyMethodDef structure are the method name, the C function pointer, flags, and a documentation string. The name is "open," which means that Python client code refers to it as open() (or The flags are METH_ VARARGS, which indicates that the C function consists of two PyObject pointers, corresponding to the self (the instance for class methods, usually NULL for modules), and to the arguments in the form of a tuple; tuples in Python are immutable sequences of objects.

Opening the Database

That takes care of the module/entry-point loading infrastructure, and exposing the module functions—actually just one function for Open-RJ—to Python. Now I look at openrj_open(); see Listing 2. This is standard fare for followers of the Open-RJ or recls libraries and their mappings. The Python-related aspects are the use of PyArg_ParseTuple() to parse the database path and the flags arguments passed to the method, and the call to PyErr_SetString() used to set an error indicator to the Python runtime in the event that the call to ORJ_ReadDatabaseA() has failed. Client Python code might look something like this:

db ="../../samples/pets/pets.orj",

In the event that the database is opened successfully, the openrj_Database_alloc() function (see Listing 3) is called to create the Database instance. The function first creates a Database object by calling PyObject_New() and passing the type of the created instance, openrj_Database, and the type object, openrj_Database_Type. If that succeeds, then it allocates the array of records.

The Database type, openrj_Database, is defined as shown in Listing 4. In common with all Python types, it begins with PyObject_HEAD, which gives it a compatible binary layout with the generic PyObject type. If you were implementing the extension in C++, you might instead derive your types from PyObject. The remaining fields in the structure are used to represent the members of the Database type. The database member provides access to the underlying database structure, and the path member keeps a record of the name of the record JAR file. dbh is a pointer to an ORJDatabase_holder structure (Listing 5) that wraps a pointer to the underlying database structure.

The reason this is used is because Python reference counts its objects, which means that the Database instance could be destroyed while some of its Record or Field instances might still be alive. By sharing the underlying structure in this way, rather than always having the Database instance destroy it (via ORJ_FreeDatabaseA()), there are no problems with vanishing pointers. The remaining member, records, is used to hold an array of Python instances (in the form of the openrj_Record type) so that the Python sequence protocol—whereby access is provided by index—can be supported.

I chose to implement the path member as a C string (created by strdup() and released by free()) rather than as a Python string because I find it more straightforward to have a C string to play with in the openrj_Database_print() method. This could be wasteful if several accesses were made to the database's path() attribute. Still, I opted for convenience. You may choose to do it differently.

To understand the use of the Database type, look at Python's object definition mechanism. Listing 4 shows how the Database type is comprised. As well as the openrj_Database type you've already met, it shows declarations for the 12 functions. Unlike Ruby, in which all custom functions are defined and built into types in the same way, Python splits up a type's functions into those that correspond to the standard operations of objects, to those that correspond to one or more protocols, and to type-specific/custom methods. For the database type, we can see this delineation in the composition of the three structures/tables in Listing 4. Standard object operations go into the PyTypeObject structure openrj_Database_Type, including openrj_Database_dealloc() (the "destructor"), openrj_Database_print() (used to provide a human-readable representation of the Database instance), and openrj_Database_getattr().

Since the Database type supports the Python Sequence Protocol, I also provide the sequence methods openrj_Database_length() ("len(db)"), openrj_Database_item() ("db[0]" or "x in db"), and openrj_Database_slice() ("db[1:2]"). As you can see from Listing 4, these are referenced in the definition of the openrj_Database_as_sequence table, which is itself then referenced in the initializer of openrj_Database_Type. In this way, you declare the Database type to support the sequence protocol, and simultaneously specify which of the sequence protocol operations it supports. How this looks to client code is that you can ask the length of the sequence:

l = len(db)

or step through all its records:

for r in db:
  print r

or take a slice of some of the records:

someRecords = db[1:len(db)]

The final set of functions are the custom functions, which correspond to the Database class's path(), records(), numRecords(), numFields(), and numLines() methods. You may be wondering how the Python runtime knows how to look up these custom functions. If you are familiar with implementing classes in Python, you know that the special Python function __getattr__() method is called when an attribute has not been found in the normal lookup scheme. The openrj_Database_getattr() function corresponds to that function for the Database type. It has a pretty standard implementation:

static PyObject *
  openrj_Database_getattr(openrj_Database *self
            , char const      *name)
  return Py_FindMethod( openrj_Database_methods
             , (PyObject*)self, (char*)name);

In other words, you call the Python extension API function Py_FindMethod(), which looks up a method from the given method table. This is where the openrj_Database_methods table comes into play.

Record and Field Classes

The Record type structure contains a database holder, a pointer to the Open-RJ API record to which it corresponds, and an array of field instances. It has the same set of standard and sequence methods as the Database type, and does not provide any custom methods.

typedef struct
  /* Header */
  /* Record specific information */
  ORJDatabase_holder  *dbh;
  ORJRecord const     *record;
  openrj_Field        **fields;
} openrj_Record;

The Field type structure contains a pointer to the Open-RJ API field to which it corresponds, and a database holder pointer. The Field class does not support the sequence protocol, so all it defines are the same standard functions as the Database and Record types (for example, _alloc(), _dealloc(), _print(), _getattr(), and _compare()). It also defines the two custom methods, name() and value(), as implemented by the openrj_Field_name() and openrj_Field_value() functions:

static PyObject 
  *openrj_Field_name(openrj_Field *self)
  return Py_BuildValue("s#", 
                   , self->field->name.len);
static PyObject 
  *openrj_Field_value(openrj_Field *self)
  return Py_BuildValue("s#", 
                  , self->field->value.len);

Reference Counting

Again, Python objects are reference counted. As is the case with any reference-counting mechanism, when it works it's great—but to make it work is a nontrivial effort. As you'd expect, each time you construct a Python object via PyObject_New(), the returned object has an initial reference count of 1. Further, many functions will increase the reference count on a Python object that is passed to them. However, this is not universal, so you need to check the documentation of the functions you use; PyList_SetItem() and PyTuple_SetItem() do not increase the reference count, and are said to "steal" a reference. Furthermore, when you return pointers to Python objects that you are holding, rather than ones you've just created (as with openrj_Field_value(), for example), you need to ensure that you increase the reference count before you return them; otherwise, you'll find out the hard way sometime later that the object you thought you owned no longer exists. Manual reference counting is effected by the Python functions (well, macros, actually) Py_INCREF() and Py_DECREF(), as in the implementation of openrj_Record_item() (called by the subscript and in operations):

static PyObject *openrj_Record_item(openrj_Record *self, int index)
  if( index < 0 ||
      index >= (int)self->record->numFields)
    PyErr_SetString(PyExc_IndexError, "index out-of-bounds");
    return NULL;
    openrj_Field *field = self->fields[index];
    return (PyObject*)field;

Also worth noting are the standard Py_XINCREF() and Py_XDECREF(), which do the same thing as their X-less brethren, but are benignly inert when passed NULL pointers. I've not used them in the implementation, as the only two places where they'd be appropriate are better served by a more (maintainer-resistant) explicit if() statement, but they're a well-used facility in general in Python extensions.

A Better Approach to the Object Model

The current implementation of the Open-RJ/Python mapping is such that each instance of the Database, Record, and Field types holds onto the underlying structures in the Open-RJ library. This is in common with other mappings of the Open-RJ library that I've done so far. However, Python requires that sequence objects be subscriptable by integer, as opposed to, say, Ruby's unindexed each{} construct, or STL's Forward Iterator concept. Hence, the Database and Record types in the Open-RJ/Python mapping also maintain arrays of Record and Field types. What this means is that the Database holds onto the underlying database structure for its numFields, numRecords, and similar attributes, but it doesn't use it for the actual records, and there's an analogous situation with Record instances and their fields.

Naturally, this is not exactly optimal (it's not even good design!), but I left it in the current implementation because I believe that the reference-counted database-holder mechanism and the arrays of Python types are worthy of discussion and study, even though in this case it's not proven to be suitable. This is, after all, a learning exercise (for you and for me). So, although it appears in Version 1.1 in this form, expect it to be trimmed down markedly in a subsequent release.

Further Work

Naturally, with two libraries and lots of interesting languages to choose from, there's a lot of scope for further coverage of language mappings. Furthermore, with several extant language mappings for each, there are many things to keep up to date—each time I update one of the libraries, there are several languages mappings to percolate out the advances. Although Open-RJ is simple in scope and implementation, the recls library has a long way to go. I still want to incorporate recursive searching of the Win32 registry and Visual SourceSafe (and maybe other source-control systems), and I also need to provide FTP searching on UNIX.

Over the next two or three installments, I hope to cover the Open-RJ/C++.NET mapping, the recls/Ch and recls/Python mappings, and the enhancement of recls to other types of searching. Furthermore, Walter Bright (author of the D language) pointed out that in the recent update to recls, to include FTP searching, the code and object size doubled. Hence, there's a need for some serious refactoring.

Once I've tidied things up, and updated the mappings with the various wisdoms gained over recent times, I hope to get into some new languages. Dylon, Heron, Objective-C, Perl, Sather, and Tk all seem worth some investigation, but please feel free to write in if you've other interesting languages that you'd like to see discussed. And I'm still looking for some Perl extension gurus to contact me.

If you're wondering about the other side of language integration—embedding—I can tell you I certainly plan to cover it, but it'll likely not be until later in the year.


Thanks to Bjorn Karlsson, Garth Lancaster, Greg Peet, and Walter Bright for reviewing this installment. Thanks also to Walter for his patient efforts in integrating new recls releases into the D Standard Library. It's not always quite as easy as I'd like.


[1] Open-RJ ( and

[2] Python, or Python Essential Reference, Second Edition, by David Beazley, New Riders, 2001. For information on writing Python extensions, see

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.