Ruby: The Next Facet

By Matthew Wilson, July 01, 2004

This month Matthew maps his recls library to the Ruby scripting language.

July, 2004: Ruby: The Next Facet

Matthew Wilson is a software development consultant for Synesis Software and creator of the STLSoft libraries. He is the author of the forthcoming Imperfect C++ (to be published this fall by Addison-Wesley, 2004), and is currently working on his next two books, one of which is not about C++. He can be contacted at http://stlsoft.org/.

In previous columns, I've introduced recls—a platform-independent library that provides recursive filesystem searching, and have demonstrated techniques for integrating such a C/C++ library with C++ ("normal" classes, and STL sequences), C#, COM, D, and Java by implementing recls mappings for those languages. The source for all the versions of the libraries and the mappings are available from http://www.cuj.com/code/ and http://recls.org/ downloads.html, respectively.

This month, I map recls to its first scripting language—Ruby, which was invented by Yukihiro Matsumoto (Matz). I'd not previously done much of anything with Ruby before implementing this mapping, but I'm happy to report that not only is it an easy language to learn and use, it is also pretty easy to write C extensions for Ruby, as you'll see.

recls Improvements

I've made a few changes to the basic API for this version. The main reason for this is to incorporate, within the recls API, easy access to the filesystem roots so that client code can be made platform independent. For example, on UNIX there is only one filesystem root: "/". On Win32, however, there can be a number of drives; for example, "C:\", "D:\", "H:\", "P:\". To encapsulate this in a platform-independent form, I've added the Recls_GetRoots() function, which has the following signature:

size_t Recls_GetRoots(recls_root_t 
  *roots, size_t cRoots);

where the structure recls_root_t is defined as:

#if defined(RECLS_PLATFORM_IS_WIN32)
# define RECLS_ROOT_NAME_LEN   (3)
#elif defined(RECLS_PLATFORM_IS_UNIX)
# define RECLS_ROOT_NAME_LEN   (1)
#else
# error Platform not recognized
#endif /* platform */
typedef struct recls_root
{
  recls_char_t  name[1 + RECLS_ROOT_NAME_LEN];
} recls_root_t;

You call Recls_GetRoots() with cRoots == 0 to elicit the number of available filesystem roots, then allocate an array of recls_root_t of that length and call it again, passing the address of the array and the number of elements.

I've added three more functions to aid the writing of platform-independent client code, each of which takes no parameters and returns recls_char_t const* pointers to internally managed string constants.

Recls_GetPathNameSeparator() returns the symbol used to separate distinct path names in path-name lists, which is ":" on UNIX and ";" on Win32.
Recls_GetPathSeparator() returns the symbol used to separate the directory parts within paths, which is "/" on UNIX and "\" on Win32.
Recls_GetWildcardsAll() returns the wildcard symbol used to represent "all files" for the current operating system. For now, this symbol is "*" on UNIX and "*.*" on Win32 because recls currently uses the operating system's wildcard matching. I hope to plug in more sophisticated wildcard matching whose "all files" symbol will be seamlessly updated by changing the implementation of this function internally.

All of the current mappings—C++, C#, COM, D, Java, and STL—have been updated to incorporate this new functionality, and now provide the appropriate wrapping for the roots information. The latest version, 1.4, is available at http://www.cuj.com/code/ and http://recls.org/downloads.html.

Ruby

In the various documentary materials that come with the free Ruby distribution (http://ruby-lang.org/), Ruby is described as an object-oriented scripting language. This is indeed the case, although that does not mean that you have to bother making everything a method of an overt class, as in Java. Any free functions you write join the global object as methods, but you are free and able to define classes and methods in a simple and straightforward way.

Attempting to go into a full description of Ruby here is an impossible task, but thankfully, I don't even need to. Andy Hunt and Dave Thomas, The Pragmatic Programmers (http:// pragmaticprogrammer .com/), have written a book on the subject—Programming Ruby: The Pragmatic Programmer's Guide (Addison-Wesley, 2000)—and have kindly included a free CHM version within the Ruby distribution.

Ruby is easy to learn for anyone who has mastered any of the C-family languages, providing no surprises. As far as I've been able to determine, it seems to take the best of Perl and Python, mixed with a few other nice features from some other languages and some novel ideas of its own.

The Implementation

Probably the best way to learn about extending Ruby is to look at the code. Listing 1 shows the main entry function for the recls mapping, which must be called Init_recls(). The first thing you see is that there are three VALUE global variables specified: mRecls, cFileSearch, and cEntry. These represent the Recls module object and the class objects for the Recls::FileSearch and Recls::FileSearch::Entry classes. Strictly speaking, the first two do not need to be defined in the global scope because they are not used outside the Init_recls() function, but it's usual to need them throughout extension implementations, so it's customary to make them module global. You'll see where you need cEntry later on.

The set of major tasks performed in the initialization function revolve around the Recls module itself. The first thing is to define a module, which you do by passing the "Recls" string to rb_define_module() (Ruby module names begin with uppercase). Now that you have a module, you can define constants, functions, and classes within the mRecls variable. Several constants are defined. The first two provide a version for the Ruby extension—another convention—and also the RECLS_VERSION constant. Both of these are derived from the RECLS_VER_MAJOR, RECLS_VER_MINOR, and RECLS_VER_REVISION constants provided in recls.h, whose current values are 1, 4, and 1. They are converted to a string form, then created as Recls module constants via the rb_define_const() functions. These functions take the opaque VALUE type (used for just about everything in Ruby), which you obtain from a C string via the rb_str_new2() function. Since (almost) everything in Ruby is garbage collected, you don't have to be concerned about memory-management issues when creating and using VALUEs.

Next, I define the recls flag constants RECLS_F_FILES, RECLS_F_DIRECTORIES, and so on. Since these are enum values, hence integers, the corresponding VALUEs are created via the rb_uint2inum() function.

The last part of the module definition is to create entries for the four module-level functions. Like D, Ruby lets you call no-parameter methods (of modules or of classes) without braces, meaning that a method call looks syntactically like a member variable access. D calls them "properties;" Ruby calls them "attributes." What this means is that our four functions roots(), pathNameSeparator(), pathSeparator(), and wildcardsAll() are each accessible as if they are member variables of the module, as in:

Recls::roots.each { |root| . . . do stuff with root . . . }

Each of the four attributes are defined in the following fashion:

rb_define_module_function(mRecls, "roots", recls_roots_get, 0);

This says that the module mRecls should have a function (attribute) backed by the C function recls_roots_get(), which takes 0 arguments. Listing 2 shows the definitions of the four C functions. recls_roots_get() creates an array via rb_ary_new(), then calls Recls_GetRoots() with a fixed array of 26 (which covers UNIX and Win32). It then iterates over the retrieved roots and pushes each one in VALUE form (via rb_str_new2()) onto the array. The other three functions are trivially simple. None of these four functions uses their self argument because they are "free" functions.

The next main section of the Init_recls() function is concerned with defining the FileSearch class. The first statement calls rb_define_class_under() to define the class; it passes three arguments. The first, mRecls, denotes that it will be defined within the Recls module. This means it will be referenced via Recls::FileSearch (or Recls.FileSearch) and thereby resides within the namespace defined by the Recls module. The second argument, FileSearch, specifies its name. The third, rb_cObject, is the built-in instance for the Object class. Hence, FileSearch does not inherit from another specific class, only the root Object class; all Ruby classes must inherit from something, and Object is the default if you do not specify anything within Ruby code. When defining a class from within an extension, you must specify rb_cObject to achieve the same thing.

Now that we have a class, we can provide its characteristics. The next line calls rb_include_module(cFileSearch, rb_mEnumerable) to "include" the built-in Enumerable module into the FileSearch class. This can seem a little strange at first, but Ruby has a great way of providing reusable code via the module mixins. Essentially, if you include a module into a class, you inherit all the methods of the module as methods of the including class. If the module methods have been written appropriately, the methods can extend the behavior of the class by relying on expected features of the class. It's rather like a bolt-in template in C++, whereby the outer template methods and/or types are implemented in terms of the parameterizing type's fields/methods/types.

This is easier to demonstrate than to explain: The Enumerable module provides a host of enumerating methods, such as find() (for finding an item matching a given criteria), grep() (returning items matching a given regexp), include?() (for detecting the presence of an item matching a given criteria), sort() (for sorting the items according to a given criteria), and many more. All that Enumerable requires of its mixing class is that it provides an each() method, which FileSearch does, as you can see in Listing 1. As well as each(), FileSearch also provides the initialize() method, which acts as its constructor, and is marked as taking three parameters, for the search root, pattern, and flags parameters that we see in all recls mappings. Listing 3 shows the definitions of the FileSearch_initialize() and FileSearch_ each() functions.

FileSearch_initialize() is straightforward. It takes four parameters—the obligatory self (this) parameter, and the three search parameters. Since it is the only point in the recls-Ruby mapping at which information comes in from the outside world, and Ruby is not a strongly typed language, these three parameters must be validated as being of the correct types. This is done via the Ruby extension API function Check_Type(), which throws an exception if a type mismatch is detected. Once the types are validated, the three parameters are simply stored in member variables, ready for use later. (In Ruby, member variables are prefixed with the symbol.)

The member variables are used in the FileSearch_each() function, which is where pretty much all the action happens. The first three lines elicit the values (as VALUE) of the three member variables, using the rb_iv_get() function. The rest of this function is then pretty much recls boilerplate, with a couple of notable differences. If the call to Recls_ Search() fails for a reason other than RECLS_ RC_NO_MORE_DATA, then we elicit the corresponding error string from the recls API, and call rb_throw(), passing our message and Qnil. Qnil is the Ruby null value and, in this case, indicates that we do not have any additional error information to send.

If the search is successful, we need to process the elements. This is done by a wonderful Ruby facility known as yield. The yield statement has the effect of executing the currently associated block, which may have been passed into a method, for example. Where it really comes into its own is when dealing with iteration. If you create a method that performs an iterative function, say the calculation of a mathematical series, then calling yield with the current value means that a block passed in from client code will be given the current value and executed, after which the iteration continues. You can see how this looks in Ruby in the test program shown in Listing 6 (available at http:// www.cuj.com/code/). You can see how it works from within an extension by looking at the contents of the for loop in the FileSearch_each() function. All that we do is call rb_yield(), passing in the current Entry instance (which is returned by Entry_ create()). Whatever code block is passed to the each method on a FileSearch instance will be executed here, for each entry.

Before looking at the implementation of the Entry class, I'll finish with the FileSearch code in Listing 1. By default, Ruby member variables are private, which is a nice thing. To make them accessible to client code, we need to define attributes for them. This is done in the extension by the rb_define_attr() function. For each of the searchRoot, pattern, and flags members, an attribute is defined. The last two arguments are 1 and 0. This denotes that the attributes may be read (1) but not written (0). Naturally, you can define writable attributes by making the fourth argument 1.

Now it's time to look at the Entry class. Most of the class definition in Listing 1 should be familiar now. We define the class with rb_define_class_under(), this time making it a member class of FileSearch. It has some attributes—path, drive, file—and a large number of accessor methods; for example, directory(), directoryPath(), and size(). This class also includes a module; this time it's the built-in Comparable module that provides a variety of comparison mixin functions, and requires only that the mixing class provides the <=>() method. There's one new feature with Entry in that we've defined a method alias, from path() to to_s(). to_s() is a stock method, used to convert an object to a string form. If you don't provide this mapping, then when you print a string form of an Entry, you'd get something like "#<Recls::FileSearch::Entry:0x2774420>".

At this point, I've examined the main entry point in Listing 1, and looked at the implementation of the Recls module functions (Listing 2) and the Recls::FileSearch class (Listing 3). A sample that contains several functions that provide the implementation of the Recls::FileSearch::Entry class is available at http://www.cuj.com/code/ (see Listing 4).

The Entry_create() was used in the rb_yield() call in FileSearch_each() (Listing 3). Because an Entry manages a recls entry, that is, a recls_info_t handle, you need to associate the handle with the instance. This is done by the Ruby API Data_Wrap_Struct(). It takes the class variable, cEntry, two function pointers, and the data handle. The first of the two function pointers is used for garbage-collection marking, which we don't care about in this case. The second is used to release the resource associated with the instance. We specify the Entry_free() function, which simply casts the void* to recls_info_t, and then calls Recls_CloseDetails(). The remainder of Entry_create() creates the member variables path, drive, and file, whose attributes are declared in Init_recls() (see Listing 1). As mentioned by Matz when I sent him an early version, this is not really necessary since it hardly represents an optimization as it likely does in the Java mapping. However, I left it in for pedagogical purposes. The remainder of the attributes of Entry are done via methods.

The four time methods are implemented in terms of Entry_time_get(), which takes a pointer to one of the recls time parameter accessor functions, such as Recls_GetCreationTime(). This function uses the Entry_entry_handle_get() helper to get the recls entry handle from the Ruby self parameter, and calls the given recls time function. The Win32 version converts this to a time_t, and this is then converted to a VALUE via rb_time_new().

The isReadOnly, isDirectory, and isLink attribute functions are implemented in terms of the corresponding recls API functions, simply translating the Boolean values to the Ruby equivalents Qtrue and Qfalse. Don't forget to do this translation because although Qfalse is 0, Qtrue is actually 2. (Ruby reserves the lowest bit of all VALUEs, which is why the fixed integer type is a 31- or 63-bit type.)

Several of the string methods—directory, shortFile, filename, and fileExt—are implemented in terms of the Entry_strptr_get() function, which takes the offset of the particular string range structure (recls_strptrs_t) within the recls entry info, and then returns a Ruby string (in a VALUE) corresponding to the given structure. directoryPath is retrieved via the recls API function Recls_GetDirectoryPathProperty(), since it uses different parts of the entry structure depending on the operating system.

The last three functions are slightly less straightforward. Entry_directoryParts_get() creates an array and enumerates through the directory parts range, passing each one onto the array in Ruby string form.

Entry_size_get() should be straightforward except for the fact that the recls file size can be different depending on platform, hence the preprocessor free-for-all. Basically, if the Ruby C extension header files define HAVE_LONG_LONG, then you convert the file size to a Ruby fixed number (a 63-bit number); otherwise, you convert it to a Ruby big number (which is an arbitrary-sized integer, but which performs less efficiently than the fixed number).

The final function, Entry_cmp(), provides the comparison function for the <=>() method. The implementation of this is reasonably simple, but note that on Win32, the comparison is carried out in a case-insensitive fashion, in line with the case-insensitive Win32 filesystem.

Building the Mapping

Another way in which Ruby makes things easy is by the provision of the mkmf module, which helps in the creation of extensions. Basically, all that is needed is to create the script extconf.rb (available at http://www.cuj.com/code/; see Listing 5), then issue the command:

ruby extconf.rb 
   --with-recls-include=$(RECLS_DEV)/API/include 
        --with-recls-lib=$(RECLS_DEV)/lib

This scans the project and creates a .DEF file (on Win32) and a makefile. All you then do is "make," and your project is built. It assumed Visual C++'s cl.exe, which was fine with me at the time. I expect there's a way to make it recognize your compiler of choice, but I didn't have time to delve into that.

There's a slight weirdness in that Ruby shows its UNIX roots, because even on Win32 it creates recls.so. But the Ruby runtime seems perfectly happy with this, and obviously expects it, so all is well. I've actually installed this into the Ruby distribution extension directories—in $(RUBY_INSTALL)/lib/ruby/site_ruby—so that I always have recls-Ruby whenever I need it.

Using recls-Ruby

So once recls-Ruby has been implemented, how do you actually use it? Well, this is the simple part. The full test program is available at http://www.cuj.com/code/ (see Listing 6). The main work is done within the doSearch() method. A FileSearch instance is created via Recls::FileSearch::new() (which calls FileSearch_initialize(); see Listing 3). This is then used via the each construct, whose block prints out the path (via the String(fe) expression, which calls to_s()), and also prints out the various aspects of the entry if a succinct processing has not been requested. All the string attributes are simply printed out. The size is converted via String(fe.size), and the times are converted to string form via the strftime() method.

Like other scripting languages, code that is not encapsulated within a class or a method is part of the mainline, and that's what the second half of Listing 6 contains. It's all pretty standard stuff to Ruby programmers, and should be reasonably obvious to those who are not. The other noteworthy part is the use of the Recls module roots attribute, which is used if the options request search over all filesystem roots.

As well as being a useful exercise for the recls-Ruby mapping, I've actually started using this script for real. It's really nice to be able to combine the power of recls with the convenience of a scripting language, and make little temporary mods to the script to suit a particular need.

Although it's not shown in the test program, we can also take advantage of the Enumerable mixin facilities to carry out other kinds of processing on the entries of a given search. For example, you could collect just the filename + extension via:

nameExts = fs.collect { |fe| fe.file }

Alternatively, you could select only the read-only items:

ro_items = fs.select { |fe| fe.isReadOnly }

You can sort the items:

sorted_items = fs.sort

and even evaluate the largest (max) files:

largest = fs.max { |a,b| a.size <=> b.size }

Next Steps

As for the next installment of the recls library, I'd like to get a Python mapping written, as I'm crying out for recls in my general housekeeping scripts, most of which are Python. However, having learned—and become a fan of—Ruby for this column, I'm considering making the jump from Python to Ruby. Being slightly versed in a wide variety of scripting languages is nice, but being master of none means that one is forever reaching for textbooks or previously written scripts. Maybe I should pick one and stick to it.

In terms of recls itself, I'm still hoping to have a proper reorganization of some of the badly named functions sometime soon. Of more significance, however, is my desire to put in proper, platform-independent, wildcard processing into the pattern matching. I also want to be able to specify composite patterns; for example, "*.cpp:*.java:*.rb:". At the moment, one of my favorite/most useful tools—whereis (http://synesis .com.au/r_systools.html)—is still in its original STLSoft sample code form (which you can download from http://stlsoft.org/), and uses WinSTL/UNIXSTL filesystem enumeration sequences to manually descend directories. I've been wanting to plug-in recls for a long time, and consider that my ultimate test of its genuine usefulness. Once the composite patterns and wildcard processing are done, I can "update" whereis, and truly feel like recls is my one-stop shop for filesystem enumeration. Hopefully, it can be the same for you.

Feel free to write to me (or post a FAQ at http://recls.org/faq.html) and suggest other languages/technologies for which you'd like to see a recls mapping. I've had a few requests for some of the less well-known languages, and I plan to feature some of these later in the series, so keep your requests coming.

Acknowledgments

Thanks to Yukihiro Matsumoto for sharing Ruby with the world, and for providing some useful feedback on an early implementation of the recls-Ruby mapping.

1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Ruby: The Next Facet

recls Improvements

Ruby

The Implementation

Building the Mapping

Using recls-Ruby

Next Steps

Acknowledgments

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Ruby: The Next Facet

recls Improvements

Ruby

The Implementation

Building the Mapping

Using recls-Ruby

Next Steps

Acknowledgments

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content