Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://stlsoft.org/.
Q: How do you know there's an elephant in your refrigerator?
A: Footprints in the butter.
Because "Positive Integration" focuses on real projects whose evolutions involve dealing with general problems and issues, I seem to be spending a fair amount of time discussing these evolutions. Although this is somewhat outside the official charter of the column, it's worthwhile examining the issues because they inform on the experience of any language integration projects involving evolving libraries. As suggested by the classic "Footprints in the Butter" joke, this two-part column is all about code bloatsource size, object size, and executable sizeand finding its sources and rooting them out. This month, I examine the source changes. In the next installment, I look at issues involving binary size.
In addition to adding a recls/Python mapping and other features as part of the evolution of the recls library from 1.5 to 1.6, I've had occasion to have a good go at refactoring the code to redress a fair amount of redundancy between the UNIX and Win32 implementations. And I've also had a concerted, albeit only the first, effort to try and get the elephant out of the fridge.
Prior to Version 1.5.3, recls did not support Universal Naming Convention (UNC) paths (\\HostName\ShareName\SomeDir\SomeFile.exe, for instance) and they broke it in unpleasant ways. This has been completely addressed in 1.6 in terms of correctly parsing UNC paths, and in the addition of the new Recls_IsFileUNC() function to the API.
Another problem, or at best an inconvenience, was the difficulty in getting information about a single file or directory. The various language mappings had to have specific code to catch such a case, and split the path into "directory + file" to be passed to the Recls_Search() function. So 1.6 introduces the Recls_Stat() function, which takes a path of a file or directory and returns a file entry (recls_info_t) that represents that path, from which all the normal attributessize, full path, directory parts, file extension, and the likeis available. Because it's now handled in a "proper" way, the special case code throughout the rest of the implementation could be removed.
Although neither of these additions represents a huge amount of code, it nonetheless adds to the amount in 1.6, and that should be borne in mind as I compare and contrast the sizes of the 1.5 and 1.6 versions of the library.
A major problem in refactoring C++ libraries is the likelihood of breaking client code. One of the good things about libraries that present a C API (like recls) is that you can completely change the internals, including the implementing class hierarchies, without impacting client code. It is indeed the case that between 1.5 and 1.6 the internal classes were subject to significant change. In 1.5, the class hierarchy looks like Listing 1. Note the separate definitions and implementations of the ReclsFileSearchDirectoryNode class, and the separate implementations of the ReclsFileSearch class. In 1.6, this was tidied up significantly by having the directory node classes share a common parent, and by the movement of more functionality of the search classes into the base class ReclsSearch; see Listing 2. This has the added benefit (as least to those who get a kick out of such things) that you can now define the ReclsSearchDirectoryNode overridden methods private in the ReclsFileSearchDirectoryNode and ReclsFtpSearchDirectoryNode classes.
These changes opened the door to a serious refactoring of all the search and directory node classes, such that the separate UNIX and Win32 implementations of the classes could be merged. Table 1 lists the names and sizes of all the files in the src directory in the recls distribution for Versions 1.5.3 and 1.6.2. Some files have morphed into others; some have been added; some removed. The differences in size are represented for each file, and for the directories as a whole.
As you can see, although there have been significant changes in individual files, the various new filesrecls_string.h and the likehave resulted in the overall saving only being just over 24 KB (9.6 percent). Even taking into account the additions to the API, at best you might hope to have saved about 30 KB (~12 percent). I confess to being initially disappointed by this. But there are a large number of files, each of which has about 2 KB of license header and boilerplate file structure comments, namespace, and so on. Given that, it's going to be hard to pare down what is, redundant UNIX/Win32 classes aside, a reasonably tight C++ implementation. We must hope for better with the binary code sizes.
Rationalizing the Code
The effort spent rationalizing the code covered a full gamut of techniques, from the prosaic right through to the baroque. The first simplification was the dumping of the char_type member types from the implementation classes. I have a tendency to put these in automatically, but since the recls sources explicitly support only one character encoding within a given build, it is unnecessary and misleading: Changing ReclsFileSearchDirectoryNode::char_type to recls_char_t in several places saved a few bytes and also simplified the code in terms of readability.
Similar Class Definitions With Similar Components
The most major changes were to the ReclsFileSearchDirectoryNode class (which formerly had separate definitions and implementations for UNIX and Win32). The first files to undergo this coalescence were ReclsFileSearchDirectoryNode_unix.h/ReclsFileSearchDirectoryNode_win32.h. The only differences in these were their respective inclusions in Listings 3(a) and 3(b), and the member type definitions in Listings 3(c) and 3(d).
In refactoring terms, this is pretty much a gift. Just conditionally include the headers in Listing 4(a), use namespace aliasing [NS-ALIAS] like Listing 4(b) to define an implementation namespace for the structurally similar components, and finally, a little pre-processor platform discrimination to select UNIXSTL's glob_sequence [GLS] or WinSTL's find_file_sequence [FFS], as in Listing 4(c).
A more significant task was coalescing the implementation files ReclsFileSearchDirectoryNode_unix.cpp and ReclsFileSearchDirectoryNode_win32.cpp. Although most of the contents were identical, there were some important differences. I used a process of evolutionary transmogrification, wherein the files are converged a step at a time, with compile/build/test at each stage. First, the lines are made as similar to each other as possible. For example, some code accessed the c-string value of path buffers by using the subscript operator&bufferwhereas other code used string access shimsc_str_ptr(buffer). (The use of the subscript operator is thus clearly reserved for when it's appropriate: in the access of a pointer to nonconst for writing into the buffer.)
The next step was, for each important difference, to have each file include the other's lines inside #if/#else/#endif conditional blocks, and rebuild and test; see Listing 5. Then the last step for the similar lines was to encapsulate them, still within the two files, in identical conditional blocks based on the presence of RECLS_PLATFORM_IS_ symbols, presenting identical blocks in both files, as in Listing 6. In this way, a great proportion of the outstanding differences between the two files could be resolved, verifying the source visually with a difference tool, and validating the runtime behavior by continually testing at each stage. This, then, left only a handful of meatier differences, which had to be inspected on a case-by-case basis, and surrounded in conditional blocks.
Remember that the coalescence of these files is derived by two factors. First, the size of reclsboth source and objectis larger than you would expect or desire for a library of its scope. Second, they were originally done entirely separate in order to avoid the preprocessor spaghetti that so often occurs in cross-platform code bases. But because of the close, but not perfect, structural conformance between the (STLSoft) components used for enumerating filesystem entities on UNIX and Win32, and the consequent high degree of similarity in the code bases, this second concern is largely moot. Only in a few places in the coalesced codebase are the hairy fingers of the preprocessor felt.
Search Class Implementation
A similar transformation, albeit smaller in scope, occurred with the search class(es). The formerly fully abstract ReclsSearch class was expanded and given the implementation of the (identical) GetNext(), GetNextDetails(), GetDetails(), and GetLastError() from the ReclsFileSearch and ReclsFtpSearch classes. Note that, contrary to received wisdom, ReclsSearch was given two protected data members, shared by ReclsFileSearch and ReclsFtpSearch, so that the four functions could be pushed into the base class. That's valid to do in this case because these classes are insulated from the outside world by the C API, and this tactic enabled us to save code and data space. In normal circumstances, you should avoid protected data in classes, as it introduces coupling between base and derived classes.
File-scope type definitionsfor example, file_path_buffer_twere changed to be defined in terms of member types of ReclsFileSearchDirectoryNode. The Win32 implementation contained the find_directory_0() function, which finds the first directory in the path, whether UNC, or "drive + directory," or directory; this was resolved by simply adding a find_directory_0() for UNIX (in recls_util_unix.cpp) that simply returns the string passed to it. Finally, the formerly separate UNIX and Win32 implementation files were coalesced.
API refactoring was straightforward. First, there were some vestigial using declarations that could be immediately dispensed with. Other platform-specific includes and using declarations were taken care of by the changes in ReclsFileSearchDirectoryNode.hpp. Several API functions, such as Recls_IsFileReadOnly(), that were defined separately in recls_api_unix.cpp and recls_api_win32.cpp, were moved to recls_api.cpp and their definitions coalesced with a single point of difference handled by preprocessor discrimination of RECLS_PLATFORM_IS_UNIX and RECLS_PLATFORM_IS_WIN32. The root's API functions were moved from recls_roots_unix/win32.cpp files to recls_api_unix/win32.cpp.
One nontrivial change involved the removal of significant parts of the Recls_Search() implementations that were there to provide a foundation for enumeration of other recursive information trees, such as FTP, and Win32 Registry; these features are now going to be deferred until recls 2.0. This allows Recls_Search() itself to be made entirely platform independent and moved into recls_api.cpp.
The other significant change was to move the code for validating multipart patternsthey must not contain "." or ".."from recls_api.cpp and recls_ftp_api_win32.cpp into the function IsValidPattern() in recls_util.cpp.
The file-entry functions were rationalized by separating the file-entry functions, which were moved into recls_fileinfo.cpp, from the atomic functions, which remained in recls_fileinfo_unix/win32.cpp. The header recls_atomic.h was introduced, representing an abstraction of the RC_Increment(), RC_PreDecrement(), and RC_ReadValue() functions for UNIX and Win32, containing the common features extracted from recls_fileinfo_unix.cpp and recls_fileinfo_win32.cpp. Although it didn't save much code size, it did nicely clear up the delineation of the code between the atomic operations and the file-entry operations.