Fast, Nonintrusive String Concatenation

By Matthew Wilson, June 01, 2004

String concatenation is one way you can improve the performance of your C++ programs.

June, 2004: Fast, Nonintrusive String Concatenation

Matthew Wilson is the creator of the STLSoft libraries and author of the forthcoming book Imperfect C++ (Addison-Wesley, 2004); this article is based on Chapter 25. He can be contacted via http://stlsoft.org/.

Awell-known inefficiency in Java involves string concatenation. The way around this problem is to ensure that the concatenation is done within a single statement. This facilitates a compiler optimization in which successive arguments to the "+" operator are silently translated into calls to a hidden StringBuffer instance, resulting in more efficient construction of the string from its constituent parts. Hence:

String s = s1 + " " + s2 + " " + s3;

is automatically converted to

StringBuffer  sb = new StringBuffer();
sb.append(s1);
sb.append(" ");
sb.append(s2);
sb.append(" ");
sb.append(s3);
String        s = sb.toString();

Most C++ libraries also overload the semantics of operator +() for string concatenation, resulting in similar inefficiencies due to the generation of intermediate objects required in the concatenation chains. The inefficiency of C++ string concatenation sequences stems from two factors.

The intermediate string associated with each intermediate concatenation (each + operator subexpression) involves at least one memory allocation to accommodate the total string contents of the concatenation's two parameters. (The exception to this is where the string class uses small-string optimization.)
Every intermediate copies the contents of its two arguments. For an expression with N concatenations, arguments 0 and 1 are copied N times, argument 1 is copied N-1 times, and so on.

Assuming that the following code is compiled with a compiler that supports Named Return Value Optimization (NRVO), there are likely to be between four and eight memory allocations, and 4, 3, and 2 copies taken of the contents of strings s3, s2, and s1, respectively.

String	   s1  = "Goodbye";
char const  *s2 = "cruel";
String	   s3  = "world!";
String	   s   = s1 + ' ' + s2 + ' ' + s3;

In principle, since none of the intermediate results are used (or useful) outside of the statement, all that is required is one memory allocation and one copy of the contents of each source string. Ideally, you would like to see each individual concatenation subexpression resulting only in a record being taken of the arguments, and being passed up the chain until the string needs to be generated. At this point, the allocation of memory and the copying into that memory of the individual pieces of the resultant string can be done.

In Chapter 22 of The C++ Programming Language Special Edition (Addison-Wesley, 1997), Bjarne Stroustrup describes how several operators can be used in a chain to represent a single logical operation for matrix manipulation. This had me wondering if I could apply these principles to string concatenation to achieve performance improvements. The result is the fast_string_concatenator<> template that I present here. fast_string_concatenator<> is part of the STLSoft libraries (http://stlsoft.org/) from Version 1.7.1 upwards.

The canonical way to implement concatenation is as a nonmember function that is implemented in terms of the member operator +=():

String operator +(String const &lhs, 
    char const *rhs)
{
  String result(lhs);
  result += rhs;
  return result;
}

There's an alternate form if operator +=() returns a reference to the instance:

String operator +(String const &lhs,
    char const *rhs)
{
  return String(lhs) += rhs;
}

It's clear from this implementation where all the costly intermediate instances, allocations, and copying come from. There would be similar implementations for the four other overloads:

String operator +(String const &lhs,
    char rhs);
String operator +(char const *lhs,
    String const &rhs);
String operator +(char lhs, String const &rhs);
String operator +(String const &lhs, String const &rhs);

When using fast_string_concatenator, the five operators share a seemingly identical, simple definition:

fast_string_concatenator<String>
  operator +(String const &lhs, char const *rhs)
{
  return fast_string_concatenator<String>(lhs, rhs);
}
fast_string_concatenator<String>
  operator +(String const &lhs, char rhs)
{
  return fast_string_concatenator<String>(lhs, rhs);
}
fast_string_concatenator<String> 
  operator +(String const &lhs, String const &rhs);
fast_string_concatenator<String>
  operator +(char lhs, String const &rhs);
fast_string_concatenator<String>
  operator +(String const &lhs, String const &rhs);

The most obvious difference is that they no longer return instances of String, instead returning instances of fast_string_concatenator<String>. The implementation of each operator is simply to pass the two arguments to the constructor of the anonymous concatenator instance, which is subject to Return Value Optimization (RVO).

Tying the Concatenators Together

Since each operator +() returns a concatenator, you need more if you're going to make multipart concatenation sequences. This is facilitated by a number of standard operator +() overloads that are defined along with the class. The first three each take a concatenator instance as their left-hand parameter:

template < . . . >
fast_string_concatenator<S,C,T>
  operator +(fast_...<S,C,T> const &lhs, S const &rhs);
 ...
  operator +(fast_...<S,C,T> const &lhs, C const *rhs);
 ...
  operator +(fast_...<S,C,T> const &lhs, C const rhs);

Since the + operator is left=>right associative, these three operators enable the result of the left-most concatenation to be bound with the next, and so on. Looking at the example, "Goodbye" and ' ' are concatenated to form a concatenator instance using one of the custom String operator +()s, which is then concatenated with "cruel" via the second of the standard ones previously shown.

In this way, any permutation of character, C-style string, and string class can be combined in a concatenation sequence, resulting in a final fast_string_concatenator instance. There's more to it, such as concatenation seeding and pathological bracing, but that's essentially it for the operators in normal circumstances.

The Concatenator Class

As for the concatenator class itself, you've already seen that it is a template and that it takes three template parameters. The three parameters S, C, and T represent the string type, the character type, and the traits type, respectively:

template< typename S
        , typename C = typename S::value_type
        , typename T = std::char_traits<C>
        >
class fast_string_concatenator;

In the definition of the class, C defaults to S::value_type, and T defaults to std::char_traits<C>. These are provided as default parameters rather than just assumed to allow wider application of the template.

In most cases, therefore, you only have to be concerned about the string type, and leave the rest to the defaults. Since most modern string classes define a value_type member type, it works swimmingly. For those that do not, it is not difficult to stipulate both first and second template parameters in the concatenator instantiation.

All the operator +() implementations follow precisely the same form by returning an anonymous instance of the concatenator constructed from the left and right parameters to the operator. This is reflected in the public interface of the class (Listing 1). The only other public method of the class is an implicit conversion operator to the string_type. Now you can see how to get back the resultant string from the concatenator. This method has a simple implementation:

template< . . . >
fast_string_concatenator<S, C, T>::operator S() const
{
  size_t      len = length();
  string_type result(len, ' ');
  write(&result[0]);
  assert(len == traits_type::length(result.c_str()));
  return result;
}

The concatenator's length() method is called, then a string instance result is created of the given length, packed with spaces. This is how you achieve the single memory allocation for the entire concatenation sequence. You could use any character here—in my implementation I used '~' for an eye-catcher—since they will all be overwritten.

Next the write() method is called, passing the address of the start of result's internal memory. This method walks the concatenation sequence and writes the contents into the given memory. When it returns, result contains the result of the concatenation and is returned. If your compiler supports NRVO, then result is actually the variable receiving the result of the concatenation sequence in the client code.

Internal Implementation

Before looking at the implementation of the length() and write() methods, I present the remainder of the class definition (Listing 2) to illustrate how the arguments to the individual concatenation operators are represented as a network.

The nested structure Data is a discriminated union of a single character ch, a character sequence cstring (of type struct CString), and a pointer to a concatenator concat. Each concatenator instance contains two instances of this structure, m_lhs and m_rhs, representing the two arguments to operator +(). Except for single characters, all the other types—C-strings, string objects, and concatenator instances—are referenced by pointers.

This is valid because C++ requires that temporary objects remain alive (that is, do not have their destructors called) until the end of the statement in which they are constructed. Hence, all the temporary objects in the concatenation sequence are still valid at the point that the final fast_string_concatenator<String> instance is used to create the final string instance.

Given that you know how the network is constructed, the implementations of length() and write() are straightforward.

Listing 3 is the implementation of the concatenator's length() method and Data::length(). The length of a given concatenator is simply the sum of the two arguments it represents. Hence, it defers to its m_lhs and m_rhs members. The implementation of Data::length() switches on the data type, recursing in the concat case. Note that the normal wisdom of having a default case is not followed here. This is because tests showed that it cost 1-5 percent performance on several compilers involved in the test. The code contains an assertion instead. The result of the outermost length() call is used in the implicit conversion operator to create a string of exactly the required size; hence, only one allocation is needed. But how do you achieve the single write operation? This starts out in the concatenator's write() method, presented in Listing 4 along with Data::write(). The left side writes out its contents into the buffer that is passed into the method via Data::write(), which returns the point at which it finished writing. This is then passed to the right side, and the result of that is passed back to the caller. Data::write() follows a similar pattern to the length() method in switching on the data type. Single characters are copied in and the write pointer advanced by one. The contents of string objects and C-style strings are copied into the buffer via memcpy(), and the pointer is advanced by the appropriate amount. For concatenators, the write() method is called and its return value is returned.

Hence, write() copies in the requisite number of characters for the data type, then passes back the address of the next location in which to write. That means that the reversed double call m_rhs.write(m_lhs.write(s)) in fast_string_concatenator<>::write() results in the string being written in the correct order and in a single pass.

This may seem like a lot of code, but you can be assured that good compilers make short work of such things. In optimized builds it is very efficient. Also, it's worth noting that by using only standard, expected public operations of string classes, use of the concatenator with arbitrary string types is nonintrusive.

Performance

I've tested the concatenator with a number of different string classes, and it has the same excellent performance characteristics with them all.

For instance, trivial_string is a custom class written specifically for this test. It has two members, m_len (size_t) and m_s (char_type *), which hold the length and a dynamically allocated character buffer. It uses new and delete to allocate the memory, and stores exactly the amount of memory required for the current contents and the terminating null character.

The second string used is the Standard Library's basic_string<>. Since different compiler vendors use different implementations, the performances between the different standard strings are representative as much of the library implementation as they are of the compiler. Despite this variation, I've included this string since it represents, for many developers, the string class; the other two strings here will clearly demonstrate compiler, rather than library, differences in the application of the concatenator.

The third string is STLSoft's basic_simple_string<>, which stores its capacity and length along with the string contents in a resizable, dynamically allocated buffer, and which performs optimistic allocation on a granularity of 32 characters.

The same test program was used for the three string types, measuring the performance of concatenation sequences of 1, 2, 3, 4, 8, 16, and 32 with and without fast_string_concatenator<>. (The test program, supporting STLSoft header files, and the full results are available at http://www.cuj.com/code/.) The results presented here represent the relative times, as a percentage, of the fast_string_concatenator<> with respect to the normal operator +() implementations provided with the given string class. Values lower than 100 percent indicate a superior performance for the concatenator. The program was compiled and tested with Borland 5.6, CodeWarrior 8, Digital Mars 8.38, GCC 3.2, Intel 7.0, and Visual C++ 6.0 and 7.1.

Before conducting the tests, my expectation was that using the concatenator for sequences of one or two concatenations would actually result in a performance hit, with only longer sequences feeling the benefit of the optimization. Thankfully, in most cases, it seems that I underestimated the effect of the optimization. Though the data points for 16 and 32 concatenations are merely academic—if you're writing code with 32 concatenations, you probably need to take a holiday—the savings to be had for up to, say, 8 concatenations are considerable, up to 80 percent.

With trivial_string (Table 1), every data point demonstrates a superior performance for the concatenator, so using it for this class represents an unconditional win. Roughly speaking, a concatenation sequence of length 2 is twice as fast, and one of length 3 is three times as fast. This is very encouraging, but the string implementation is rudimentary, so perhaps we shouldn't count our chickens just yet.

The results for std::basic_string<> (Table 2) are somewhat less conclusive. Some Standard Library implementations use reference counting and copy-on-write, and this may affect the performance advantage of the concatenator. For a single concatenation, Borland and Digital Mars both suffer a small cost, although at 1 and 33 percent, respectively, they're not particularly off-putting. Of more concern is that both Visual C++ versions exhibit a performance loss for concatenation sequences of length 1 and 2. Of course, you might simply surmise that the Visual C++ runtime library has such a good string implementation that the concatenator cannot keep up until sequences of three or more elements. However, because the Intel performance was obtained with the Visual C++ 7.0 library, whose string implementation is virtually identical to that for Version 7.1, that explanation doesn't really hold water. (Subsequent tests with Intel 7.1 using Visual C++ 7.1 libraries show virtually identical performance to the Intel 7.0 performance shown here.) This goes to show the effect that the template optimizing capabilities of the compiler can have on the use of such a scheme. As compilers continue to improve in their abilities to optimize templates, the exceptional levels of performance afforded by the concatenator with the Intel compiler will be more broadly applicable.

With STLSoft basic_simple_string<> (Table 3), the performance results are almost as encouraging as in Table 1, except that a single concatenation for CodeWarrior and Digital Mars both suffer a small performance penalty of 27 and 2 percent, respectively. Still, it's clear that using the concatenator represents a definite win.

Notwithstanding potential future improvements in compilers, I would suggest that use of the fast concatenator with string libraries represents a net win now, and a significant one at that, although it must be conceded that, for Visual C++ at least, performance profiling might be needed to prove this on a use-by-use basis.

Incorporation into Modifiable Extant Classes

The concatenation operators of extant string classes can be "upgraded" simply and safely by replacing the existing operators with equivalent versions incorporating the concatenator. If your organization has its own string classes, then you will be able to upgrade them, and the only effects to any client code, once it's recompiled, will be that it will run faster; that's not something you often get to say in software engineering.

Interoperation with Nonmodifiable Classes

If you're using a third-party library (such as MFC's CString), you should not alter the headers that accompany the library. Any future update of the library will overwrite your changes. Even worse, some libraries, such as MFC, come in part in binary format, so any changes to the headers will not be reflected in the binary part of the library. At best you're going to get a crash in testing. Don't do it!

If the string class you're interested in does not provide its own operators, then your task is comparatively easy. You can either define the new operators within the global namespace, which is appropriate on an application level, or within your own namespace if you'll be using these operators with your own libraries. Of course, if the string class comes in a third-party namespace, then you could define the operators in that namespace. To do that, however, you must have missed the "Namespaces Dos and Don'ts: 101," as it's foolish to add things into namespaces that you're not responsible for; conflicting things can be placed in that namespace at any time by the authors of the libraries.

If the string class comes with its own operators, you're in murkier waters. If it's a template class, then you can define your own operators for specific instantiations of the string template. In other words, if the string is tp_string<>, you can define your own operators for tp_string<char>, or tp_string<wchar_t>, or whatever, since a compiler can unambiguously select a nontemplate function over a template one. If it's not a template class, you're stuck. Even if you write your own concatenator-returning overloads in a "nearer" namespace, Koenig lookup ensures that the compiler can see multiple equivalent operators, and it rightly fails to compile. For this you need concatenation seeding.

Concatenation Seeding

Since the + operator is left=>right associative, the compiler takes its cue from the left argument prior to looking at the right one. You can take advantage of this to lead it down the path you want it to follow.

In the same namespace where fast_string_concatenator<> is defined, there another class fsc_seed:

class fsc_seed
{};

You saw references to this in the constructor of the Data nested class, and the DataType enum. This class extends the nonintrusive nature of the technique by letting you seed a concatenation sequence involving any string type to use fast concatenation:

String s = fsc_seed() + s1 + " " + s2 + " " + s3;

In this case, all you need to make it work is define a single operator +() overload:

fast_string_concatenator<String>
  operator +(fsc_seed const &lhs, String const &rhs)
{
  return fast_concat_t(lhs, rhs);
}

and the compiler does the rest. The result of the first concatenation is fast_string_concatenator<String>, so the next operator is deduced (via Koenig lookup) to be one of the standard ones in the concatenator's namespace. This is ugly, but it's better than messing around in the headers of third-party libraries, and it's a solution where otherwise none is available. It also lets you explicitly use fast concatenation in some parts of your code, and not others, should you so wish.

When writing templates, you must follow the constraint that the first element in the sequence after the seed is of string class type, rather than a character or a C-style string because there would be no way to deduce the string type from the character type in the corresponding operator +():

template< . . . >
fast_string_concatenator<S, C, T>
  operator +(fsc_seed const &lhs, C const *rhs) // What is S??
{
  return fast_string_concatenator<S, C, T>(lhs, rhs);
}

Pathological Bracing

You might think you've spotted a flaw in the technique whereby unnecessary, but perfectly legal, bracing could break the technique, or at least render it ineffective. In this code, the application of the parentheses has reversed the order of evaluation.

String s = s1 + (" " + (s2 + (" " + s3)));

Although such things would likely only occur as a result of some helpful soul trying to demonstrate that fast concatenation could be broken, it's already been dealt with. You have more overloads to cope with all cases and the corresponding constructors in the class (Listing 5). The interesting thing about the use of this bracing is that not only can it not break the legality of the code but, just as with seeding, it scarcely affects performance. (The results for the pathological tests are also available at http://www.cuj.com/code/.)

Standardization

Since this technique brings only benefits, you may wonder whether it should be adopted as a standard mechanism. Not being a member of the Standards committee, I cannot answer that, and it may be that there are ramifications precluding its adoption that I've not considered. (The only thing I have been able to come up with is that you would not be able to declare pointers to functions such as S (*)(S const &, S const &) and assign to them the address of std::operator +(). I can't imagine any reason why anyone would do that, so it doesn't seem particularly dissuasive.) In any event, there's no reason why you can't use it in your own string libraries or apply it nonintrusively in your own client code, and take advantage of the performance improvements it confers.

1 2 3 4 5 6 7 8 9 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Fast, Nonintrusive String Concatenation

Tying the Concatenators Together

The Concatenator Class

Internal Implementation

Performance

Incorporation into Modifiable Extant Classes

Interoperation with Nonmodifiable Classes

Concatenation Seeding

Pathological Bracing

Standardization

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Fast, Nonintrusive String Concatenation

Tying the Concatenators Together

The Concatenator Class

Internal Implementation

Performance

Incorporation into Modifiable Extant Classes

Interoperation with Nonmodifiable Classes

Concatenation Seeding

Pathological Bracing

Standardization

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content