Footprints in the Butter: Part I

Matthew kicks off a two-part series on code bloat-source size, object size, and executable size-and finding its sources and rooting them out.


July 01, 2005
URL:http://www.drdobbs.com/footprints-in-the-butter-part-i/184401985

July, 2005: Footprints in the Butter: Part I

Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://stlsoft.org/.


Q: How do you know there's an elephant in your refrigerator?
A: Footprints in the butter.

Because "Positive Integration" focuses on real projects whose evolutions involve dealing with general problems and issues, I seem to be spending a fair amount of time discussing these evolutions. Although this is somewhat outside the official charter of the column, it's worthwhile examining the issues because they inform on the experience of any language integration projects involving evolving libraries. As suggested by the classic "Footprints in the Butter" joke, this two-part column is all about code bloat—source size, object size, and executable size—and finding its sources and rooting them out. This month, I examine the source changes. In the next installment, I look at issues involving binary size.

In addition to adding a recls/Python mapping and other features as part of the evolution of the recls library from 1.5 to 1.6, I've had occasion to have a good go at refactoring the code to redress a fair amount of redundancy between the UNIX and Win32 implementations. And I've also had a concerted, albeit only the first, effort to try and get the elephant out of the fridge.

New Functionality

Prior to Version 1.5.3, recls did not support Universal Naming Convention (UNC) paths (\\HostName\ShareName\SomeDir\SomeFile.exe, for instance) and they broke it in unpleasant ways. This has been completely addressed in 1.6 in terms of correctly parsing UNC paths, and in the addition of the new Recls_IsFileUNC() function to the API.

Another problem, or at best an inconvenience, was the difficulty in getting information about a single file or directory. The various language mappings had to have specific code to catch such a case, and split the path into "directory + file" to be passed to the Recls_Search() function. So 1.6 introduces the Recls_Stat() function, which takes a path of a file or directory and returns a file entry (recls_info_t) that represents that path, from which all the normal attributes—size, full path, directory parts, file extension, and the like—is available. Because it's now handled in a "proper" way, the special case code throughout the rest of the implementation could be removed.

Although neither of these additions represents a huge amount of code, it nonetheless adds to the amount in 1.6, and that should be borne in mind as I compare and contrast the sizes of the 1.5 and 1.6 versions of the library.

Source Changes

A major problem in refactoring C++ libraries is the likelihood of breaking client code. One of the good things about libraries that present a C API (like recls) is that you can completely change the internals, including the implementing class hierarchies, without impacting client code. It is indeed the case that between 1.5 and 1.6 the internal classes were subject to significant change. In 1.5, the class hierarchy looks like Listing 1. Note the separate definitions and implementations of the ReclsFileSearchDirectoryNode class, and the separate implementations of the ReclsFileSearch class. In 1.6, this was tidied up significantly by having the directory node classes share a common parent, and by the movement of more functionality of the search classes into the base class ReclsSearch; see Listing 2. This has the added benefit (as least to those who get a kick out of such things) that you can now define the ReclsSearchDirectoryNode overridden methods private in the ReclsFileSearchDirectoryNode and ReclsFtpSearchDirectoryNode classes.

These changes opened the door to a serious refactoring of all the search and directory node classes, such that the separate UNIX and Win32 implementations of the classes could be merged. Table 1 lists the names and sizes of all the files in the src directory in the recls distribution for Versions 1.5.3 and 1.6.2. Some files have morphed into others; some have been added; some removed. The differences in size are represented for each file, and for the directories as a whole.

As you can see, although there have been significant changes in individual files, the various new files—recls_string.h and the like—have resulted in the overall saving only being just over 24 KB (9.6 percent). Even taking into account the additions to the API, at best you might hope to have saved about 30 KB (~12 percent). I confess to being initially disappointed by this. But there are a large number of files, each of which has about 2 KB of license header and boilerplate file structure comments, namespace, and so on. Given that, it's going to be hard to pare down what is, redundant UNIX/Win32 classes aside, a reasonably tight C++ implementation. We must hope for better with the binary code sizes.

Rationalizing the Code

The effort spent rationalizing the code covered a full gamut of techniques, from the prosaic right through to the baroque. The first simplification was the dumping of the char_type member types from the implementation classes. I have a tendency to put these in automatically, but since the recls sources explicitly support only one character encoding within a given build, it is unnecessary and misleading: Changing ReclsFileSearchDirectoryNode::char_type to recls_char_t in several places saved a few bytes and also simplified the code in terms of readability.

Similar Class Definitions With Similar Components

The most major changes were to the ReclsFileSearchDirectoryNode class (which formerly had separate definitions and implementations for UNIX and Win32). The first files to undergo this coalescence were ReclsFileSearchDirectoryNode_unix.h/ReclsFileSearchDirectoryNode_win32.h. The only differences in these were their respective inclusions in Listings 3(a) and 3(b), and the member type definitions in Listings 3(c) and 3(d).

In refactoring terms, this is pretty much a gift. Just conditionally include the headers in Listing 4(a), use namespace aliasing [NS-ALIAS] like Listing 4(b) to define an implementation namespace for the structurally similar components, and finally, a little pre-processor platform discrimination to select UNIXSTL's glob_sequence [GLS] or WinSTL's find_file_sequence [FFS], as in Listing 4(c).

A more significant task was coalescing the implementation files ReclsFileSearchDirectoryNode_unix.cpp and ReclsFileSearchDirectoryNode_win32.cpp. Although most of the contents were identical, there were some important differences. I used a process of evolutionary transmogrification, wherein the files are converged a step at a time, with compile/build/test at each stage. First, the lines are made as similar to each other as possible. For example, some code accessed the c-string value of path buffers by using the subscript operator—&buffer[0]—whereas other code used string access shims—c_str_ptr(buffer). (The use of the subscript operator is thus clearly reserved for when it's appropriate: in the access of a pointer to nonconst for writing into the buffer.)

The next step was, for each important difference, to have each file include the other's lines inside #if/#else/#endif conditional blocks, and rebuild and test; see Listing 5. Then the last step for the similar lines was to encapsulate them, still within the two files, in identical conditional blocks based on the presence of RECLS_PLATFORM_IS_ symbols, presenting identical blocks in both files, as in Listing 6. In this way, a great proportion of the outstanding differences between the two files could be resolved, verifying the source visually with a difference tool, and validating the runtime behavior by continually testing at each stage. This, then, left only a handful of meatier differences, which had to be inspected on a case-by-case basis, and surrounded in conditional blocks.

Remember that the coalescence of these files is derived by two factors. First, the size of recls—both source and object—is larger than you would expect or desire for a library of its scope. Second, they were originally done entirely separate in order to avoid the preprocessor spaghetti that so often occurs in cross-platform code bases. But because of the close, but not perfect, structural conformance between the (STLSoft) components used for enumerating filesystem entities on UNIX and Win32, and the consequent high degree of similarity in the code bases, this second concern is largely moot. Only in a few places in the coalesced codebase are the hairy fingers of the preprocessor felt.

Search Class Implementation

A similar transformation, albeit smaller in scope, occurred with the search class(es). The formerly fully abstract ReclsSearch class was expanded and given the implementation of the (identical) GetNext(), GetNextDetails(), GetDetails(), and GetLastError() from the ReclsFileSearch and ReclsFtpSearch classes. Note that, contrary to received wisdom, ReclsSearch was given two protected data members, shared by ReclsFileSearch and ReclsFtpSearch, so that the four functions could be pushed into the base class. That's valid to do in this case because these classes are insulated from the outside world by the C API, and this tactic enabled us to save code and data space. In normal circumstances, you should avoid protected data in classes, as it introduces coupling between base and derived classes.

File-scope type definitions—for example, file_path_buffer_t—were changed to be defined in terms of member types of ReclsFileSearchDirectoryNode. The Win32 implementation contained the find_directory_0() function, which finds the first directory in the path, whether UNC, or "drive + directory," or directory; this was resolved by simply adding a find_directory_0() for UNIX (in recls_util_unix.cpp) that simply returns the string passed to it. Finally, the formerly separate UNIX and Win32 implementation files were coalesced.

API Refactoring

API refactoring was straightforward. First, there were some vestigial using declarations that could be immediately dispensed with. Other platform-specific includes and using declarations were taken care of by the changes in ReclsFileSearchDirectoryNode.hpp. Several API functions, such as Recls_IsFileReadOnly(), that were defined separately in recls_api_unix.cpp and recls_api_win32.cpp, were moved to recls_api.cpp and their definitions coalesced with a single point of difference handled by preprocessor discrimination of RECLS_PLATFORM_IS_UNIX and RECLS_PLATFORM_IS_WIN32. The root's API functions were moved from recls_roots_unix/win32.cpp files to recls_api_unix/win32.cpp.

One nontrivial change involved the removal of significant parts of the Recls_Search() implementations that were there to provide a foundation for enumeration of other recursive information trees, such as FTP, and Win32 Registry; these features are now going to be deferred until recls 2.0. This allows Recls_Search() itself to be made entirely platform independent and moved into recls_api.cpp.

The other significant change was to move the code for validating multipart patterns—they must not contain "." or ".."—from recls_api.cpp and recls_ftp_api_win32.cpp into the function IsValidPattern() in recls_util.cpp.

File-Entry Refactoring

The file-entry functions were rationalized by separating the file-entry functions, which were moved into recls_fileinfo.cpp, from the atomic functions, which remained in recls_fileinfo_unix/win32.cpp. The header recls_atomic.h was introduced, representing an abstraction of the RC_Increment(), RC_PreDecrement(), and RC_ReadValue() functions for UNIX and Win32, containing the common features extracted from recls_fileinfo_unix.cpp and recls_fileinfo_win32.cpp. Although it didn't save much code size, it did nicely clear up the delineation of the code between the atomic operations and the file-entry operations.

July, 2005: Footprints in the Butter: Part I

Listing 1

struct ReclsSearch
  -> class ReclsFileSearch (single defn; UNIX and Win32 impls)
  -> class ReclsFtpSearch (Win32 only)
class ReclsFileSearchDirectoryNode (UNIX defn and impl)
class ReclsFileSearchDirectoryNode (Win32 defn and impl)
class ReclsFtpSearchDirectoryNode (Win32 only)

July, 2005: Footprints in the Butter: Part I

Listing 2

struct ReclsSearch
  -> class ReclsFileSearch
  -> class ReclsFtpSearch (Win32 only)
struct ReclsSearchDirectoryNode
  -> class ReclsFileSearchDirectoryNode
  -> class ReclsFtpSearchDirectoryNode (Win32 only)

July, 2005: Footprints in the Butter: Part I

Listing 3

(a)
#include <unixstl.h>
#include <unixstl_filesystem_traits.h>
#include <unixstl_glob_sequence.h>


(b)
#include <winstl.h>
#include <winstl_filesystem_traits.h>
#include <winstl_findfile_sequence.h>


(c)
class ReclsFileSearchDirectoryNode
{
public:
  typedef unixstl::filesystem_traits<recls_char_t>      traits_type;
  typedef unixstl::basic_file_path_buffer<recls_char_t> file_path_buffer_type;
private:
  typedef stlsoft::basic_simple_string<recls_char_t>    string_type;
  typedef unixstl::glob_sequence                      file_find_sequence_type;
  ...


(d)
class ReclsFileSearchDirectoryNode
{
public:
  typedef winstl::filesystem_traits<recls_char_t>       traits_type;
  typedef winstl::basic_file_path_buffer<recls_char_t>   file_path_buffer_type;
private:
  typedef stlsoft::basic_simple_string<recls_char_t>    string_type;
  typedef winstl::basic_findfile_sequence< recls_char_t, 
                                         traits_type> file_find_sequence_type;
  ...

July, 2005: Footprints in the Butter: Part I

Listing 4

(a)
#if defined(RECLS_PLATFORM_IS_UNIX)
# include <unixstl.h>
# include <unixstl_filesystem_traits.h>
# include <unixstl_glob_sequence.h>
namespace platform_stl = ::unixstl;
#elif defined(RECLS_PLATFORM_IS_WIN32)
# include <winstl.h>
# include <winstl_filesystem_traits.h>
# include <winstl_glob_sequence.h>
namespace platform_stl = ::winstl;
#else /* unrecognised platform */
# error The platform is not recognized
#endif /* platform */


(b)
 ...
class ReclsFileSearchDirectoryNode
{
public:
  typedef platform_stl::filesystem_traits<recls_char_t>       traits_type;
  typedef platform_stl::basic_file_path_buffer<recls_char_t> 
                                                    file_path_buffer_type;
private:
  typedef stlsoft::basic_simple_string<recls_char_t>          string_type;
  ...


(c)
 ...
#if defined(RECLS_PLATFORM_IS_UNIX)
  typedef unixstl::glob_sequence  file_find_sequence_type;
#elif defined(RECLS_PLATFORM_IS_WIN32)
  typedef winstl::basic_findfile_sequence< recls_char_t, traits_type>
                                  file_find_sequence_type;
#else /* unrecognized platform */
# error The platform is not recognized
#endif /* platform */
  ...

July, 2005: Footprints in the Butter: Part I

Listing 5

// In ReclsFileSearchDirectoryNode_unix.cpp
#if 1
  m_dnode = . . . // UNIX form of statement
#else
  m_dnode = . . . // Win32 form of statement - introduced here
#endif

// In ReclsFileSearchDirectoryNode_win32.cpp
#if 0
  m_dnode = . . . // UNIX form of statement - introduced here
#else
  m_dnode = . . . // Win32 form of statement
#endif

July, 2005: Footprints in the Butter: Part I

Listing 6

#if defined(RECLS_PLATFORM_IS_UNIX)
  m_dnode = . . . // UNIX form
#elif defined(RECLS_PLATFORM_IS_WIN32)
  m_dnode = . . . // Win32 form
#else /* ? RECLS_PLATFORM_IS_??? */
# error Platform not recognized
#endif /* RECLS_PLATFORM_IS_??? */

July, 2005: Footprints in the Butter: Part I

Table 1: recls src directory contents.

recls 1.5.3 files recls 1.6.2 files delta
EntryFunctions.h 3695 EntryFunctions.h 3623 -72
ReclsFileSearchDirectoryNode_unix.cpp 22625 -22625
ReclsFileSearchDirectoryNode_unix.h 7519 -7519
ReclsFileSearchDirectoryNode_win32.cpp 23161 ReclsFileSearchDirectoryNode.cpp 27611 4450
ReclsFileSearchDirectoryNode_win32.h 7522 ReclsFileSearchDirectoryNode.hpp 8633 1111
ReclsFileSearch_unix.cpp 10027 -10027
ReclsFileSearch_win32.cpp 10403 ReclsFileSearch.cpp 10375 -28
ReclsFileSearch.h 4939 ReclsFileSearch.hpp 4837 -102
    ReclsSearch.cpp 4337 4337
    ReclsSearch.hpp 5620 5620
ReclsFtpSearch.h 5322 ReclsFtpSearch.hpp 5039 -283
ReclsFtpSearchDirectoryNode_win32.cpp 24781 ReclsFtpSearchDirectoryNode_win32.cpp 22791 -1990
ReclsFtpSearchDirectoryNode_win32.h 8048 ReclsFtpSearchDirectoryNode_win32.hpp 8740 692
ReclsFtpSearch_win32.cpp 13725 ReclsFtpSearch_win32.cpp 12175 -1550
recls_api.cpp 15639 recls_api.cpp 28968 13229
recls_api_unix.cpp 13519 recls_api_unix.cpp 3764 -9755
recls_api_win32.cpp 14918 recls_api_win32.cpp 6283 -8635
    recls_atomic.h 2733 2733
    recls_debug.h 6235 6235
recls_fileinfo.cpp 2694 recls_fileinfo.cpp 5899 3205
recls_fileinfo_unix.cpp 9341 recls_fileinfo_unix.cpp 5821 -3520
recls_fileinfo_win32.cpp 6536 recls_fileinfo_win32.cpp 4038 -2498
recls_internal.cpp 4915 -4915
recls_ftp_api_win32.cpp 10409 recls_ftp_api_win32.cpp 5238 -5171
    recls_impl.h 4801 4801
recls_roots_unix.cpp 3583 -3583
recls_roots_win32.cpp 4925 -4925
    recls_string.h 1998 1998
recls_util.cpp 4100 recls_util.cpp 11116 7016
    recls_util.h 2939 2939
recls_util_unix.cpp 2703 recls_util_unix.cpp 5049 2346
recls_util_win32.cpp 3265 recls_util_win32.cpp 4784 1519
recls_wininet_dl.cpp 13297 recls_wininet_dl.cpp 13351 54
recls_wininet_dl.h 6211 recls_wininet_dl.h 6209 -2
  257822 233004 -24818

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.