Channels ▼
RSS

Open Source

Extending Python

Source Code Accompanies This Article. Download It Now.


Simplifying project maintenance

By Greg Smith

Greg is a senior DSP engineer at Silicon Optix. He uses Python for prototyping and verification of ASIC circuitry. He can be contacted at gsmith@siliconoptix.com.


One of Python's greatest strengths is that you can extend it with modules written in C or C++. By creating extensions, you can make existing C facilities available to Python programs; for example, the data compression facilities of the standard zlib library are available to Python programmers by means of the zlib module, which is simply a Python extension acting as a wrapper around the zlib API.

The process of creating such a module is well documented in the "Extending and Embedding" section of the Python manual (http://www.python.org/doc/current/ext/ext.html). It is also possible to implement new Python data types in C/C++ extensions. In this article, I only address C, but the concepts are equally applicable to C++.

There are three main reasons to write Python extension modules (as opposed to, say, solving the problem entirely in Python):

  • You need the higher execution speed of C.
  • You have an existing C module that you want to make available for Python applications.
  • The C module may be your primary project, and you want to give it a Python API, allowing the testing and verification framework to be in Python. I've found myself in this situation fairly often. For example, when debugging and verifying a C module, I don't want to spend a lot of time writing and debugging a test framework. By quickly implementing a Python wrapper, I can move on to writing complex test frameworks in Python with much less overall effort than would be required for an "all-C" approach.

When writing a Python-to-C adapter, most of the work usually involves converting Python data (the parameters to functions) into the basic data types required by the C function, then converting the return value, if any, to a Python object. The Python/C API provides many handy functions to help with these conversions. For instance, the PyArg_ParseTuple function (which is rather like scanf, but reading Python data types instead of strings) is useful for converting Python parameters into C data, while simultaneously checking the types.

However, there are many cases where the data conversion you'd really like requires a fair bit of code.

A Simple Example

Listing One (stats.c) is a simple C module that implements a function, which finds the mean and variance of an array of floats (the stats.h file contains just a function prototype and is not shown). Listing Two is statsmodule.c, a Python extension module based on stats.c.

Since the find_variance function accepts a variable-sized input array, its Python parameter list cannot be directly handled by PyArg_ParseTuple. The interface function has been written to accept a list of floats; in fact, to be more "Pythonic," it accepts any sequence of objects that can be converted to floats.

Refer to the function stats_find_variance in Listing Two. To perform the conversion, the C code must:

  • Verify that there is only one parameter, and that it is a sequence.
  • Obtain the length of the sequence and verify that it is nonzero (since the C function to be called requires that).
  • Allocate an array of floats to hold the data.
  • Retrieve each element of the sequence, and if it is not a Python float, convert it to a float while checking for an error in the conversion; then, store the result in the array.

If all of these operations succeed, the code calls the find_variance function in the stats.c module, and then converts the results (two doubles) back into Python floats. In addition, the code must carefully maintain the reference counts on Python objects to avoid memory leaks or similar problems.

As you can see, the code to perform this conversion amounts to more than half of the program. In fact, I wrote most of the rest of this module by simply pasting the examples from the Python documentation, then changing "spam" into "stats." In addition to the find_variance function, the module contains a table of pointers to functions, which is used by the Python internals to find the entry points; and an initialization routine, initstats, makes this table available to Python.

Usually, I finish writing the C/Python interface quickly, and move on to something else. This means avoiding the kind of code that can be seen in Listing Two, which is fairly involved. It needs to check for several error conditions, even for a simple case such as this.

Using a Python Wrapper Layer

The solution is to do the work in Python. Since the underlying C function needs a pointer to an array of floats, why not require the Python code to supply a contiguous chunk of memory containing that array? The data could be contained in a Python string, for instance. There are several ways to do this, but it puts you in an odd position if, to obtain statistics on a set of floating-point numbers, they must be first converted to a binary string format.

To fix that problem, add a layer of Python to do the work. Listing Three is statsimodule.c, a new extension module that is the same as statsmodule.c, except that the find_variance function accepts a string instead of a sequence of floats. The function is simpler—it can now use PyArg_ParseTuple to obtain the address and length of the supplied string data (while checking that the parameter is of suitable type). It then performs a simple length check and calls the find_variance function, passing it a pointer to the data supplied by the caller.

The Python wrapper code, stats2.py (Listing Four), contains a function find_variance that does the conversion and passes the result to the built-in function in the statsi module (a Python function implemented in C is referred to as "built-in," even when it is in an extension module). To convert the parameter to a string, it creates an instance of the Python type array.array with typecode 'f'; that operation packs the data internally into machine floats, exactly as desired. It is then possible to convert the array to a string. Instead, I pass the array directly to the built-in find_variance function. This works because array.array implements the buffer interface—it is based on a memory buffer, and can identify the location and length of that buffer via the standard buffer protocol.

So, both of these code snippets give the same result:

>>> import stats
>>> print stats.find_variance( [ 1.0, 1.5, 1.8, 2.12 ] )
(0.17007496588230353, 1.6049999594688416)

>>> import stats2
>>> print stats2.find_variance( [ 1.0, 1.5, 1.8, 2.12 ] )
(0.17007496588230353, 1.6049999594688416)

However, the stats module is all C, including a fairly complex conversion routine, while the stats2 module is split into two layers. Also, I've removed the docstring from find_variance in statsimodule, instead placing it in stats2.py where it is visible at the upper layer.

Splitting code up like this is a popular approach; for instance, the socket module in the standard Python library is written in Python and uses a lower level C module, _socket. However, the socket module is not just a wrapper. It provides additional higher level functions.

I find that splitting the code up like this can create a maintenance problem; there is now an interface between stats2.py and statsimodule.c, and it is rather an odd interface. In fact, this is the whole point of the exercise: to use an unusual (some might just say crude) interface to make the C programming easier, but to supply a conversion layer so that Python programmers don't need to use that unusual interface. If I have to maintain that interface (which could mean documenting it somewhere other than in the comments of the source, or taking steps to make sure that both files are always updated together) then I've made extra work, which is what I've been trying to avoid in the first place. Code used as part of a test framework is frequently modified and run during development. Changes to the Python code generally take effect immediately, while changes to the C module do not take effect until it is rebuilt. So the potential risk of problems arising from incompatible versions is higher than usual.

Combining the Wrapper Layer with The Extension Module

Version incompatibility problems can be avoided by putting the C code and the Python code in the same file. Fans of the Obfuscated C contest are, no doubt, now scribbling down #if 0 and figuring out how to make the same file acceptable to both C and Python. This is pretty easy, actually, but it's not what I meant.

Built-in Python modules (including those based on C extensions) are of exactly the same basic Python type as modules created from pure Python code. Both types have a dictionary attached, which contains the global namespace of the module; the difference lies only in the kinds of function objects that are placed in this namespace when the module is constructed. So rather than putting the wrapper function in a separate Python file, you can include it in the C module.

The program stats3module.c (available electronically; see "Resource Center," page 5) is identical to statsimodule.c, except that the entry point is called _find_variance instead of find_variance. This lets you introduce find_variance later. Also, there are a few lines of Python code in a char[] variable called py_code_string and some extra code in the module initialization function.

The extra code simply obtains the namespace dictionary for the module and calls PyRun_String to execute the Python code. The module dictionary is supplied as the context for the execution, which means that all global values defined in the Python code wind up in the module's namespace. It also means that any Python functions defined by this operation will have, as their global namespace, the module's namespace.

So, referring to the contents of py_code_string, it looks a lot like stats2.py, except that it doesn't need to import a lower level module, and it refers to the lower level function directly as _find_variance, rather than as statsi.find_variance.

There is another step performed during the module initialization: Before calling PyRun_String, I obtain a reference to the __builtins__ module and place this reference into the stats3 module namespace. This module is part of the namespace of any pure-Python module; without it, all the built-in objects and functions (None, abs, len, range, open, to name a few) are unavailable. In fact, without this entry, you can't even execute an import statement because that is hooked through the __import__ function in __builtins__.

Another little detail: The Python code in py_code_string imports array as _array, to keep it from being included when "from stats3 import *" is done. This statement includes all names not starting with an underscore; in the case of stats3, that means just find_variance.

So, stats3 looks just like stats and stats2, from the Python side; like stats, it's implemented as a standalone C extension, but, like stats2, it includes a Python wrapper layer.

When initially creating and debugging such a module, I sometimes replace the entire py_code_string contents temporarily with the single statement

execfile('mycode.py')

This lets me compile the extension just once, and debug the contained Python code by changing the contents of the file mycode.py. When it's working, I can just paste it back into the C string.

A Note on Type Conversions

There is an interesting difference between stats (implemented in statsmodule.c) and the other two implementations:

>>> import stats
>>> stats.find_variance( [ 10.0, 11, "12"] )
(0.66666666666666663, 11.0)

If you try this with the other modules, it raises an exception. statsmodule.c simply looks at each of the elements in the supplied sequence, and if the element is not a float, it tries to convert it to one. The string "12" can be converted to a float, 12.0. This is rather more implicit type conversion than most Python programmers are used to. Here is a more dangerous example:

>>> stats.find_variance( ("1234",) )
(0.0, 1234.0)
>>> stats.find_variance( ("1234") )
(1.25, 2.5)

The first example shows a tuple of strings—just one string—being passed to the function, so the mean is 1234.0 and the variance is zero. The second example shows a string being passed to the function—the parentheses without the comma have no effect. The string is, according to Python type conventions, a sequence of one-character strings, each being a single digit. So the second example finds the mean and variance of the four values 1.0, 2.0, 3.0, 4.0. This is a hazard of having too many automatic type conversions—a simple programming error may result in behavior different from what was expected, where you would rather get an exception.

If you actually want the other modules to have this behavior, it's easy to add, since you can do it in the Python layer. Just change this:

if type(fltlist) is not list:
fltlist = list(fltlist)

to:

fltlist= map(float, fltlist)

This breaks the supplied sequence into its components, explicitly converts each one to a Python float, and builds a list of the results. Making a similar kind of change to C code could require substantially more work.

Calling Python Code from C

stats4module.c (available electronically) implements the find_variance function directly in C again; but this time it takes its parameter and passes it to a pure Python function attached to the module, and lets that do all the work of the conversion. That function, _flts_to_array, converts whatever is passed to find_array into an array.array of C floats, and returns that. It enforces the minimum length of one element and returns the result inside a tuple, so that the C code can pass it directly to PyArg_ParseTuple to get its address and length. The args parameter, passed to the C function stats4_find_variance, is a reference to a Python tuple containing all of the parameters passed to the function. The same protocol is used when calling an arbitrary Python object from C, using PyObject_CallObject; the first parameter is the object to call, while the second is a tuple of the parameters. If someone improperly calls find_variance with zero parameters, or more than one, then an error will occur when PyObject_CallObject attempts to call _flts_to_array with the same parameter set. It is possible to catch this error in the Python code, and to provide more detailed error descriptions for other conditions, if desired.

If your C module uses callback functions, your wrapper needs to implement these C functions; the callback function generally needs to manipulate Python data. Calling Python code from these C functions is often the easiest way to handle this.

Reporting Errors

It is fairly easy to check the parameters in such a way that an exception is raised when there is a problem—many of the possible error conditions cause the normal conversions to throw exceptions.

It is generally more work to make sure that the exception messages are detailed and directly meaningful to API programmers. In the previous example, if the parameter count to find_variance is incorrect, the exception indicates that the number of parameters passed to the internal function _flts_to_array is incorrect, which may certainly be confusing. This issue is fairly common in Python; it is convenient to allow lower levels of code to check for problems and raise exceptions. The drawback is that you end up with possibly confusing error messages; you need to look at each line on the stack traceback to find where the problem really originated. With extra work, you can either check these conditions at a level closer to the API, or catch exceptions thrown by lower level code and restate them in better terms.

As I mentioned, I often write Python-C interfaces in order to create test frameworks in Python. I don't need to be too concerned about API users being confused by error messages, because this code does not wander very far from my desk. It is, however, important that the type and range checking is thorough. If, during testing, the application throws a segmentation fault, I want to be sure that this did not occur simply because the test framework actually passed bad data through the Python API. In other words, it should not be possible to cause memory corruption simply by writing incorrect Python code. This principle should be applied to all C/Python interfaces; much of the convenience of using Python is lost if you need to use C-language debugging techniques to find problems in your Python code.

User-Defined Python Types

If your C extension implements a Python data type, the interface between the Python wrapper and the C code can become more complex, and these techniques can be even more useful. For instance, the Python wrapper can define a Python class, which is based on the type defined in the C module, and supply implementations for methods that are more easily coded in Python. The methods of the base class, being written in C, could use the Python/C API calls to ensure that when member functions written in C are overridden in a derived class, the correct function is called even from C code. It is useful to be able to split such an implementation into a C layer and a Python layer, but it may be cumbersome to have these in separate source files.

The Downside

There are only two real drawbacks to this approach: The Python source code must appear in quoted strings in the C file, making it inconvenient to maintain, and the Python layer is "compiled" to Python byte-code every time the module is loaded. In a conventional approach, with a separate .py file, Python compiles the code once, and save the results in a .pyc file for later use. So, the technique is mainly applicable when the Python code is quite short. This is not unusual because significant data conversions can be performed in a few lines of Python.

DDJ



Listing One

/* ---------- stats.c --------------
 * simple example of a C module
 */
#include <math.h>
#include "stats.h"
/* given a set of pointers to 'n' floats (>=1), find the mean and variance. 
 * The variance is returned directly; the mean via a pointer.
 */
double
find_variance( float const *data, int npts, double *mean_p)
{
    int i;
   double sum = 0;
    double sumsq = 0;
    double mean;
    /*
     * there are more numerically stable ways to do this, but
     * that's a different article...
     */
    for( i = 0; i < npts; ++ i ){
        double x= data[i];
        sum += x;
        sumsq += x*x;
    }
    mean = sum/npts;
    *mean_p = mean;
    return ( sumsq - mean*sum )/npts ;
}
Back to article


Listing Two
/* ---------- statsmodule.c -------------- */
/* python extension based on stats.c       */

#include <Python.h>
#include "stats.h"

static char stats_find_variance_doc[] = 
"Given a sequence of floats, find the standard deviation and mean";

static PyObject *
stats_find_variance(PyObject *self, PyObject *args)
{
    double mean, var;
    PyObject * flt_list;
    PyObject * lst_element;
    int n,i;
    float *tmp_arr;
    /* must be a single parameter which is a sequence of floats. */
    if( PySequence_Size(args)!=1 ){
        flt_list = 0;
    }else{
        flt_list = PySequence_GetItem(args,0);
    }
    /* at this point either flt_list is NULL, or
     * it's the param and we own a reference to it
     */
    if( flt_list ==0  || !PySequence_Check(flt_list)){
        PyErr_SetString( PyExc_TypeError, 
                "find_variance accepts a list of floats");
        Py_XDECREF(flt_list);
        return 0;
   }
    /* we know it's a sequence. Get its length, and allocate an
     * array of floats. Length of 0 is an error.
     */
    n = PySequence_Size( flt_list );
    if( n < 1 ){
        PyErr_SetString( PyExc_ValueError,"find_variance: empty list");
        Py_XDECREF(flt_list);
        return 0;
    }
    tmp_arr= PyMem_Malloc( n * sizeof(float));

    if( !tmp_arr ){
        PyErr_SetString( PyExc_MemoryError, "out of memory");
        Py_DECREF(flt_list);
        return 0;
    }
    /* now fill in all the floats. If any operation fails, we have to free 
     * everything. The code is structured so that if something returns NULL, 
     * it will drop down to the single error test. We don't need to
     * set the exception, since the failed operation will do it.
     */
    for( i = 0; i < n ; i++ ){
        lst_element = PySequence_GetItem( flt_list, i );

        if( lst_element != 0 && !PyFloat_Check(lst_element)){
            /* succeeded, but not a float - convert it */
            PyObject *new_obj= PyNumber_Float( lst_element );
            Py_DECREF(lst_element); 
            lst_element = new_obj;  /* may be 0 */
        }
        if( lst_element != 0 ){
            tmp_arr[i] = (float)PyFloat_AsDouble(lst_element);
            Py_DECREF(lst_element);
        }else{
            Py_DECREF(flt_list);
            PyMem_Free(tmp_arr);
            return 0;
        }
    }
    Py_DECREF(flt_list);    /* don't need that any more */
    /* call the actual C routine with all the floats */
    var = find_variance( tmp_arr, n , & mean );
    /* free our tmp array and return results */
    PyMem_Free(tmp_arr);
    return Py_BuildValue("dd", var, mean);
}
/* table of module methods. If the module has multiple
 * Python-callable functions, they are all listed here.
 */
static PyMethodDef StatsMethods[] = {
{"find_variance",  stats_find_variance,
    METH_VARARGS,  stats_find_variance_doc},
    {NULL, NULL, 0, NULL}        /* Sentinel */
};
/* The module initialization function */
PyMODINIT_FUNC
initstats(void)
{
    Py_InitModule("stats", StatsMethods);
}
Back to article


Listing Three
/* ---------- statsimodule.c ----------------- */

#include <Python.h>
#include "stats.h"
/*
 * This is an extension implementing the 'stats'
 * module which is intended to be wrapped by a
 * python module; it provides a simple internal
 * interface.
 *
 * the 'find_variance' entry point accepts
 * a single parameter, which is expected to
 * be a read-only buffer of floats (at least one)
 */
static PyObject *
statsi_find_variance(PyObject *self, PyObject *args)
{
     double mean, var;
     char *data_ptr;
     int nbytes,n;

     /*
      * expect a single parameter, which is a string
      * or buffer, containing value in native order.
      * The API call below gets us the address and len
      * of the string, or fails.
      */
    if (!PyArg_ParseTuple(args, "s#", &data_ptr, &nbytes))
        return NULL;
     /*
      * check the len - it must be a multiple of sizeof(float)
      */
     n = nbytes/sizeof(float);
     if( n <= 0 || n*sizeof(float) != nbytes ){
          PyErr_SetString( PyExc_ValueError,
                     "find_variance: bad length");
          return 0;
     }
          
     /*
      * call the actual C routine with all the floats.
      * Note: there is a risk here that the data will not
      * be aligned on a suitable boundary for floats. This
      * is not an issue on X86 hardware; also, the Python
      * wrapper can guarantee this, if needed, by using 
      * array.array type instead of a string.
      */
     var = find_variance( (float const*) data_ptr, n , & mean );
     /*
      * return results
      */
    return Py_BuildValue("dd", var, mean);
}
/*
 * table of module methods. If the module has multiple
 * Python-callable functions, they are all listed here.
 */
static PyMethodDef StatsiMethods[] = {
{"find_variance",  statsi_find_variance,
     METH_VARARGS,  ""},
    {NULL, NULL, 0, NULL}        /* Sentinel */
};
/*
 * The module initialization function
 */
PyMODINIT_FUNC
initstatsi(void)
{
    Py_InitModule("statsi", StatsiMethods);
}
Back to article


Listing Four
#---- stats2.py ----
# same as 'stats', but done as a wrapper around a C
# extension (statsi)
#
import statsi
import array

def find_variance( fltlist ):
    "Given a sequence of floats, find the "\
        "standard deviation and mean"
    try:
        if type(fltlist) is not list:
            fltlist = list(fltlist)
        data = array.array( 'f', fltlist)
    except (TypeError,ValueError):
        raise ValueError, 'bad parameter to find_variance'
    return statsi.find_variance(data)

#
# don't export array and statsi
#
__all__ = ['find_variance']
Back to article

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video