Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted at http://stlsoft.org/.
Recall that I said in the previous installment of "Positive Integration" that I'd be focusing on the topic of code bloat in this two-part column. In the last installment, I examined source-code size. This month, I look at object-code size.
The first thing to do to trim the binary size was to establish a firm basis for comparison. The 1.5.3 and 1.6.2 libraries were built using the same settings. Tables 2 and 3 show the results. (Table 1, which lists the recls src directory contents appeared in the previous installment.)
The settings used for the five compilers were for minimal size and global/whole-program optimization: -O1 -Oz (Borland); -o+space (Digital Mars); -O1 -Og -GL (Intel); -O1 -Og (Visual C++ 6.0); and -O1 -Og -GL (Visual C++ 7.1).
Given that the 1.6.2 test files contain additional code for UNC paths and the new stat() functionality, the results demonstrate that there's no effective difference for the generated binaries. The library size has dropped, but the size of a statically linked library is not, in and of itself, important, rather the size of the symbols that are selected from it into a binary, hence the comparison of the C and C++ test files.
The values in Table 3 represent a base for all remaining figures, which is expressed relative to these. To make these meaningful, you need to find out how much of the sizes of the executables are a result of the client code and the runtime libraries, and how much to the recls library components linked into the executable. This was done by creating a single file containing empty implementations of all the recls API functions, and linking the compiled client C and C++ program object files to that stub. Table 4 shows the sizes of the executables when linked to the stub API file.
Judging from Scenario A (Table 4), it looks like that code rationalization hasn't yet paid off. The first thing is to consider the effects of linking to standard libraries. Specifically, the recls libraries by default compile using the STLSoft basic_simple_string, to minimize coupling to the Standard Library so that they can be used in small footprint exes and dynamic libraries. (For example, recls is incorporated as a small part of a standard DLL of my company, built with Visual C++ 5 and not linking to the Standard Library, and coming in at less than 70 KB.)
But because the C++ test program links to the Standard Library, the instantiations of the member functions of the std::basic_string template are already paid for, so the next test Scenario B (Table 5) is to determine the effect of using std::basic_string throughout. Table 5 shows the results. (For this and the remaining scenarios, the percentage figures in the fourth and sixth columns represent the relative sizes of each test client's code after subtracting the "stub-size" in Table 4. In other words, it's a good indicator of the actual size of code in the executable that is part of recls itself, rather than the test program as a whole.)
Except Borland, for which there is a significant saving in the use of std::string, the other compilers show an increase, rather than a reduction, in binary code size. (In several cases, tables averages with and without Borland are presented because it tends to have widely different results to the other compilers, and can skew the results.)
In playing around with applying #if 0 here and there, I discovered some surprising amounts of code bloat. Amazingly, in Listing 1, when #ifdef'd out, it reduces the object code size of the containing file by 24 KB (with VC++ 6). Listing 1 is used to ensure that a multipart pattern does not contain a dots directory. This got me to experimenting with the pattern validation. The first thing I tried was using STLSoft's basic_string_view, which is a string-like class template that presents a view onto a C-style string, rather than allocating and writing its own copy. It's perfectly suited to the code previously mentioned because the results of the tokenization are not preserved, just manipulated within the find() algorithm. Using basic_string_view instead costs a slight increase (Scenario C, Table 6).
So there's 20 KB or so of code to go at. Changing the implementation to one based purely on strstr() and strncmp() results in Table 7 (Scenario D).
That's a significant gain in the size of the librariesabout 7 percentbut a modest gain in the executablesabout 3 percent. But we're pretty much down to the maximum gain for that particular area of the code because removing pattern validation entirely (Scenario E, not shown) results in only an approximate additional 1 percent saving across the board.
Customization Versus Generality
Looking at the individual object sizes gives a clearer picture of where the bloat is, as can be seen in Figure 1. The left-most data point for each file represents the average object size for the base 1.6 build (Table 3). Clearly, ReclsFtpSearchDirectoryNode_win32 has the largest footprint.
In the September 2004 installment of this column, I described the use of the searchspec_sequence template:
[M]ultipattern filtering is handled via the...STLSoft searchspec_sequence class. It is parameterized with the underlying sequence type whose model it emulates...and uses the string_tokenizer template to split the pattern. The iterators that it presents...iterate over a number of ranges of the underlying search sequences, corresponding to the number of patterns. In other words, it linearizes N ranges into a single one...I wanted to be able to insert a sequence class into the recls implementation with minimal disruption to the existing, and working code. Because searchspec_sequence presents a functionally identical interface to its underlying sequence, this was almost a no brainer.
The searchspec_sequence is an adapter that presents a view spanning multiple traversals of an underlying filesystem sequence. Unfortunately, this generalization suffers from code bloat to a significant degree. Scenario F (Table 9) uses an updated implementation for inetstl::basic_findfile_sequence that tokenizes the search patterns internally, based on a delimiter passed to the constructor along with the other arguments. By skipping the adapter searchspec_sequence you save on object code size and on runtime cost, at the expense of having to write an enhancement to the basic_findfile_sequence class template. Table 8 shows the significant size savings to be had, roughly about 17.5 percent in object code size.
This translates to about 4 percent saving on the library size and around 7 percent for the C++ test program (see Table 9); the C test program does not exercise the recls FTP functionality.
Combining all these advances, you arrive at Scenario G (Table 10), which represents a saving of about 10 percent in library size and 3-9 percent in executable sizes.
Scenario H (Table 11) summarizes the relative gains for Scenarios B-F, in terms of the library, C, and C++ test programs, with and without Borland. I've managed to make savings of a total of about 10 percent on library and test programs.
There are other optimizations possible, such as the 195 bytes saved by collapsing the switch in lookup_error_string_() (in recls_api.cpp) into a struct table, but such things are grasping at straws, and are probably below the limit at which point it's worthwhile. When you've reached such measures, it's usually clear that a bigger answer is needed.
In broad terms, you can see that:
- Using stlsoft::basic_string_view costs about 1 percent in code size. Because it only comes into play during pattern validation at the start of a search, any memory/speed efficiencies are irrelevant, so it is not used. If it were code that was invoked repeatedly throughout the search process, then the increased speed efficiency would be worth the slight increase in code size, but as it is, it's best left out.
- Using std::basic_string instead of stlsoft::basic_simple_string costs 4-8 percent. The library will continue to use stlsoft::basic_simple_string by default and provide customizability for compilers, such as Borland, for which the opposite relation holds.
- Manual pattern validation saves 1.5 percent. Because it can be implemented with relative ease in terms of Standard C functions, it will be adopted.
- Using the enhanced InetSTL findfile_sequence saves up to 7 percent, so it will be used henceforth. Coupled with the fact that the new implementation also eliminates the memory allocation and the complexity of the searchspec_sequence adapter, I intend to fit this new functionality into the UNIXSTL and WinSTL components, for (local) filesystem searching. You can reasonably guestimate that this brings an additional reduction in code size of, perhaps, 3-6 percent.
In some senses this activity has succeeded, and in others, it has failed. Where it's failed is in bringing significant reductions in source code size. In part, this has been because the refactoring has naturally caused some parts of the software to be segregated into separate files. Each source file has nontrivial overhead, and rightly sotrimming this down to the bare minimum would mean that the source would become impenetrable. We must accept that a medium-sized library, such as recls consists of a large amount of source code, even when it compiles down to a small amount of object code.
It is the object-code area in which I think there's been a measured degree of success. Sure, the refactoring did not result in any significant savings, but the subsequent directed attempts at reduction have identified some areas, specifically, the string tokenization and the searchspec_sequence adaptation, that have resulted in significant amounts of bloat. This is a shame because the STLSoft string_tokeniser component is easy to use and highly speed efficient, and the searchspec_sequence class, well, was a lot of work! But it goes to show that, for many compilers, optimization of complex template code does always compare with manually coding equivalent algorithms. It also shows that generalized component adapters may introduce unexpected amounts of overhead compared with making enhancements to the class(es) being adapted. Whether you choose to eschew simple, tested, reliable, and already written library components depends on factors outside the scope of this article, but it's worthwhile thinking about before you do.
Writing open-source libraries is, more than anything else, an exercise in battling hindsight. Before I did the research that constituted this installment, I was pretty sure, in light of empirical responses from recls users, that recls 2 would have to be written in C to avoid code bloat. Interestingly, now I am less sure because I've seen that, for some compilers, the entire footprint of recls within a compiled executable is no more than around 20 KBa heartening statistic, to be sure. One thing I am certain of is that the implementation will have to take into consideration factors of code size on a component-by-component basis, to keep the sizes down.
I'd intended to cover the recls/Python mappingalso featured in recls 1.6.1but space and time have not been my friend. Although the Open-RJ/Python mapping was covered in the March 2005 installment, there are more lessons to be learned from the recls/Python.