C++ String Performance

Lev analyzes C++ string implementations, then presents a string class with the speed and efficiency of the automatic C strings, plus the convenience of the C++ string objects that are STL friendly.


October 01, 2003
URL:http://www.drdobbs.com/cpp/c-string-performance/184405453

It's easy to imagine the reaction of seasoned C++ programmers upon seeing yet another string class implementation: "So why do we need another string class? We had too many in the pre-standard C++ Library world, and now that we finally have std::string, why not just use it?"

I agree, but only if the standard implementation doesn't introduce significant performance problems. In this article, I show how the standard string implementation isn't always good enough, particularly for multithreaded applications that create lots of temporary strings. That said, it is worth noting that there's no such thing as a standard string implementation—they all differ. (For examples of the different implementations, see item 15 in Scott Meyers's Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library, Addison-Wesley, 2001.) Here, I analyze one of the more popular string implementations—the one that comes with g++ 2.96 on Red Hat Linux 7.x. To this end, I create a number of multithreaded tests showing the Standard C++ string implementation (which comes with g++ 2.96 on Red Hat Linux 7.3) can have real performance problems for multithreaded applications that need to create lots of unrelated temporary string objects. To address these performance problems, I present a string class that gives you the speed and efficiency of automatic C strings, plus the convenience of C++ string objects that are STL friendly.

std::string Class Design

All string implementations—including g++ 2.96—usually share the same design principles:

The first item is usually a bad idea when you have lots of temporary strings as automatic variables in a multithreaded application. The heap is shared between threads, so some locking is inevitable; otherwise, the heap's internal data is corrupted. This approach defeats the idea of the fast non-blocking creation/access of the stack variables in multithreaded applications (every thread has its own nonshared stack). It looks deceptive to programmers, too. For instance, in Example 1, mystring is an automatic (stack) variable with no indication that memory for "Some C value" is allocated on the heap (unfortunately, that is exactly what happens).


int myfunc()
{
  std::string mystring ( "Some C value" );
  ...
}

Example 1: This defeats the notion of fast, nonblocking creation/access of stack variables in multithreaded applications.

Likewise, the second and third items may sound good until you realize they're a bad fit for multithreaded applications. Value and reference counters are shared between multiple string variables and potentially between multiple threads that absolutely require access serialization (locking) to the reference counter. Atomic increment/decrement/check can be used if these operations are supported by the CPU's command set; see "Optimizations that Aren't (In a Multithreaded World)," by Herb Sutter, C/C++ Users Journal, June 1999.

Listing One (teststring1.cpp) creates 100,000 temporary std::string objects on the stack (most of the memory is allocated on the heap). It has only one thread running, but was built as a multithreaded application, so it does all the required locking (refer to the Makefile, available at the top of this article). I ran this and subsequent tests on a 2-GHz Dell desktop with 1 GB of RAM under Red Hat Linux 7.3. Example 2(a) shows the results.

Listing One

#include 
#include "util.h"
unsigned int createStrings ( int n )
{
 static char *samples[]={"something1","something2","something3","something4"};
 printf ("Thread <%d>: Creating/destroying %d strings...\n",pthread_self(),n);
 unsigned int start = utilGetTickCount ();
 for ( int i=0; i: total time=%d millisecs, average=%f\n",
           pthread_self(), end-start, (float)(end-start)/n );
 return ( end - start );
}
int main ( int argc, char **argv )
{
  if ( argc > 2 )
  {
    printf ( "teststring1  \n" );
    return 0;
  }
  int n = 100000;
  if ( argc > 1 )
    n = atoi ( argv[1] );
  createStrings ( n );
  exit(0);
}



(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=108 millisecs, average=0.001080

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <2051>: Creating/destroying 100000 strings...
Thread <3076>: Creating/destroying 100000 strings...
Thread <4101>: Creating/destroying 100000 strings...
Thread <5126>: Creating/destroying 100000 strings...
Thread <6151>: Creating/destroying 100000 strings...
Thread <7176>: Creating/destroying 100000 strings...
Thread <8201>: Creating/destroying 100000 strings...
Thread <9226>: Creating/destroying 100000 strings...
Thread <10251>: Creating/destroying 100000 strings...
Thread <1026>: total time=7804 millisecs, average=0.078040
Thread <2051>: total time=8749 millisecs, average=0.087490
Thread <10251>: total time=10455 millisecs, average=0.104550
Thread <6151>: total time=10643 millisecs, average=0.106430
Thread <8201>: total time=10718 millisecs, average=0.107180
Thread <9226>: total time=10724 millisecs, average=0.107240
Thread <7176>: total time=10755 millisecs, average=0.107550
Thread <5126>: total time=10760 millisecs, average=0.107600
Thread <4101>: total time=10765 millisecs, average=0.107650
Thread <3076>: total time=10867 millisecs, average=0.108670


Example 2: (a) The results of running teststring1.cpp; (b) results of running teststring2.cpp.

Listing Two (teststring2.cpp) is the same test, but with 10 threads running—each creating/destroying 100,000 automatic string objects. (Again, every one of them allocates memory on the heap and employs some locking and copying.) Example 2(b) presents the less-than-encouraging results of the teststring2.cpp test. With just 10 threads, average automatic string object handling (creation/destruction) increased approximately 100 times (10000/108). Since these results seem unacceptable for high-load multithreaded applications, I decided to find out where time is being spent, then create a new string class that doesn't have those deficiencies. Possible performance culprits include locking introduced by the reference counting optimization, heap allocations (and it's locking) and copying of the value.

Listing Two

#include 
#include "util.h"
void *createStrings ( void *n_void )
{
 int n = (int) n_void;
 static char *samples[]={"something1","something2","something3","something4"};
 printf ("Thread <%d>: Creating/destroying %d strings...\n",pthread_self(),n);
 unsigned int start = utilGetTickCount ();
 for ( int i=0; i: total time=%d millisecs, average=%f\n", 
           pthread_self(), end-start, (float)(end-start)/n );
 return NULL;
}
int main ( int argc, char **argv )
{
  if ( argc > 3 )
  {
    printf ( 
       "teststring1 [] []\n" );
    return 0;
  }
  int n = 100000;
  int n_thr = 10;
  if ( argc > 1 )
    n = atoi ( argv[1] );
  if ( argc > 2 )
    n_thr = atoi ( argv[2] );
  pthread_attr_t attr;
  pthread_attr_init ( &attr );
  pthread_t *tid = new pthread_t[n_thr];
  int i;
  // create threads
  for  ( i=0; i\n", i );
      exit(-1);
    }
  }
  // wait for all threads to terminate
  for  ( i=0; i


A New String Class

In creating a new string class, I wanted to be able to:

Since it supported the last two of these three requirements, I used SString (see Thinking In C++, by Bruce Eckel; Prentice-Hall, 1995, ISBN 0-13-917709-4) as the basis for my new string class. I then implemented the fastest and most important mode, which doesn't do any allocation or copying. It also occurred to me that the string class features should be selectable by programmers who instantiate the string objects, not the string implementation. Consequently, it's up to users of this class (see Listing Three) to instantiate particular string objects in particular modes. The complete implementation of this new CGenStringT<> class (again, Eckel's SString was used as a basis) is in util.h (also available electronically in the source code file listed at the top of this article).

Listing Three

template
class CGenStringT
{
public:
  CGenStringT ( const char *S = "", int bAlloc = 1 ) : m_s(m_buf)
  {
    if ( !bufsize )
 ...
private:
  char *m_s;
   int   m_bAlloc;
  char  m_buf[bufsize+1];
};

The mode of the string being created in Listing Three is set by the combination of two parameters: template parameter bufsize, and constructor parameter bAlloc. CGenStringT is a template with one template parameter, bufsize, that governs the size of the internal buffer. In addition to specifying the size of the internal buffer, this parameter defines the mode of the string in conjunction with a bAlloc constructor's parameter. The possible combinations are:

Once you understand the idea, the implementation is straightforward: Constructor, copy constructor, assignment operator, and destructor are doing the right thing for the three different modes (like allocation, copying, deleting), or none of them. For example, the copy constructor in Listing Four is a good example of handling all three modes (only ASCII null-terminated strings are currently supported).

Listing Four

 ...
 ...
  CGenStringT ( const CGenStringT& rv ): m_s(m_buf)
  {
    if ( !bufsize )
    {
      if ( m_bAlloc = rv.m_bAlloc )
      {
        // make on heap
        m_s = new char [ strlen(rv.m_s)+1 ];
        strcpy ( m_s, rv.m_s );
      }
      else
        // just set a pointer
        m_s = rv.m_s;
    }
    else
    {
      // make on stack
      m_buf[bufsize] = 0;
      strncpy ( m_s, rv.m_s, bufsize );
    }
#ifdef MEM_ALLOC_DEBUG
    printf ("CGenString copy constructor called: this=%x allocation=%s"
            " ptr=%x\n", this, (!bufsize && m_bAlloc ) ? "yes":"no", m_s );
#endif
  }
 ...

Right after the CGenStringT class definition there's a bunch of useful typedefs and a define:

typedef CGenStringT<0> CGenString;

typedef CGenStringT<100> CStackString100;

#define FAST_STRING(id,str) CGenString id(str,0);

CGenString is a Mode 3 string (allocation and copying), CStackString100 is a Mode 2 string (copying) with an internal buffer of 100 bytes, and FAST_STRING defines a Mode 1 string variable ID (no alloc, no copying) and sets its value to str.

The "stack" word is being used in comments throughout the CGenStringT implementation and in CStackString100 to denounce Mode 2 strings (with an internal buffer). In general, this term is incorrect, because the internal buffer is on the stack only when the string variable itself is a stack variable. If this variable is created on the heap (using new()), then the buffer is on the heap as well. So, in general, the buffer as a part of the string object is always allocated in the same class of memory as the string object that owns it.

Performance Testing

To gauge performance, I start by testing the slowest mode (forced heap allocation and copying) to see if the removal of reference counting helps. Teststring3.cpp (available electronically) tests one thread, while teststring4.cpp (also available electronically) tests 10 threads. They are the same, respectively, as teststring1.cpp and teststring2.cpp, except that std::string is replaced with CGenString.

Example 3(a) shows the result of running teststring3, and Example 3(b) the results for teststring4. Compared to std::string, this isn't bad at all for the slowest version—it's 45 percent faster with one thread (59 versus 108) and it's more or less 100(!) times faster with 10 threads running (~100 versus ~10000). And since the heap is still being hit, it looks like std::string reference counting hurts big time!


(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=59 millisecs, average=0.000590

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <9226>: Creating/destroying 100000 strings...
Thread <9226>: total time=60 millisecs, average=0.000600
Thread <4101>: Creating/destroying 100000 strings...
Thread <5126>: Creating/destroying 100000 strings...
Thread <5126>: total time=65 millisecs, average=0.000650
Thread <4101>: total time=125 millisecs, average=0.001250
Thread <6151>: Creating/destroying 100000 strings...
Thread <7176>: Creating/destroying 100000 strings...
Thread <8201>: Creating/destroying 100000 strings...
Thread <10251>: Creating/destroying 100000 strings...
Thread <6151>: total time=130 millisecs, average=0.001300
Thread <1026>: total time=81 millisecs, average=0.000810
Thread <10251>: total time=94 millisecs, average=0.000940
Thread <7176>: total time=166 millisecs, average=0.001660
Thread <8201>: total time=195 millisecs, average=0.001950
Thread <3076>: Creating/destroying 100000 strings...
Thread <3076>: total time=60 millisecs, average=0.000600
Thread <2051>: Creating/destroying 100000 strings...
Thread <2051>: total time=60 millisecs, average=0.000600

Example 3: (a) The results of running teststring3.cpp; (b) results of running teststring4.cpp.

Moving to the faster Mode 2 strings (using automatic string variables so a buffer for every string is on the stack: no heap allocations, just copying), teststring5.cpp and teststring6.cpp (both available electronically) are the same tests as before—but using CStackString100.

Example 4(a) shows the results of running teststring5, while Example 4(b) shows the results of running teststring6. By eliminating the need to allocate on the heap, you get significantly faster performance: for one thread, 23 versus 59 (Mode 3) versus 108 (std::string), and for 10 threads, ~40 versus 100 (Mode 3) versus 10,000 (std::string).


(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=23 millisecs, average=0.000230

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <1026>: total time=23 millisecs, average=0.000230
Thread <3076>: Creating/destroying 100000 strings...
Thread <4101>: Creating/destroying 100000 strings...
Thread <3076>: total time=35 millisecs, average=0.000350
Thread <5126>: Creating/destroying 100000 strings...
Thread <5126>: total time=23 millisecs, average=0.000230
Thread <4101>: total time=71 millisecs, average=0.000710
Thread <6151>: Creating/destroying 100000 strings...
Thread <6151>: total time=24 millisecs, average=0.000240
Thread <7176>: Creating/destroying 100000 strings...
Thread <8201>: Creating/destroying 100000 strings...
Thread <9226>: Creating/destroying 100000 strings...
Thread <9226>: total time=23 millisecs, average=0.000230
Thread <10251>: Creating/destroying 100000 strings...
Thread <10251>: total time=23 millisecs, average=0.000230
Thread <7176>: total time=80 millisecs, average=0.000800
Thread <8201>: total time=80 millisecs, average=0.000800
Thread <2051>: Creating/destroying 100000 strings...
Thread <2051>: total time=23 millisecs, average=0.000230

Example 4: (a) The results of running teststring5.cpp; (b) results of teststring6.cpp.

Teststring7.cpp and teststring8.cpp, both available electronically, test the fastest mode (Mode 1: no heap allocation or copying, just setting a pointer). Example 5(a) is the result for teststring7, and 5(b) for teststring8. As expected, it's blazingly fast. Table 1 summarizes the performance test results.


(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=3 millisecs, average=0.000030

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <1026>: total time=3 millisecs, average=0.000030
Thread <3076>: Creating/destroying 100000 strings...
Thread <3076>: total time=4 millisecs, average=0.000040
Thread <4101>: Creating/destroying 100000 strings...
Thread <4101>: total time=3 millisecs, average=0.000030
Thread <5126>: Creating/destroying 100000 strings...
Thread <5126>: total time=3 millisecs, average=0.000030
Thread <6151>: Creating/destroying 100000 strings...
Thread <6151>: total time=3 millisecs, average=0.000030
Thread <7176>: Creating/destroying 100000 strings...
Thread <7176>: total time=3 millisecs, average=0.000030
Thread <8201>: Creating/destroying 100000 strings...
Thread <8201>: total time=4 millisecs, average=0.000040
Thread <9226>: Creating/destroying 100000 strings...
Thread <9226>: total time=3 millisecs, average=0.000030
Thread <10251>: Creating/destroying 100000 strings...
Thread <10251>: total time=4 millisecs, average=0.000040
Thread <2051>: Creating/destroying 100000 strings...
Thread <2051>: total time=3 millisecs, average=0.000030

Example 5: (a) The results of running teststring7.cpp; (b) results of running teststring8.cpp.

Table 1: Performance test results.

Mode 1 (fastest) is 35 times faster than std::string in a one-thread application and 2500(!) times faster for the application with just 10 threads. A no-frills Mode 3 string with heap allocation and copying is still 100 times faster than std::string for a 10-thread application thanks to the absence of the reference counting/COW optimization.

Real-World Example

In the real world, many C++ applications get their input from C network protocols or database drivers. In the case of strings, one popular use is as keys in the STL map. Listing Five does exactly that—it accepts parameters as C strings, converts them to C++ strings, and uses them as keys for the STL map. Now take a closer look at the function calls ci=m_mymap.find(str); and m_mymap.erase (str);.

Listing Five looks benign until you realize that every time you do a lookup or erase, you create a temporary string object (with heap allocation and value copying). This temporary string object is created through the std::string(char*) constructor. This object's lifespan is just the duration of the map::find/erase call: It's destroyed right before return (with another heap hit, of course). It sounds expensive—and it is. So what can you do? The first solution may be to use pointers to the strings instead of the strings themselves. It would solve a creation-of-objects problem, but burdens you with memory management and comparison issues (remember a map is a balanced tree that always keeps its contents sorted).

Listing Five

class CMyMap
{
private:
  map < string, int > m_mymap;
public:
  void add ( const char *str, int i )
  {
    m_mymap[str] = i;
  }
  void remove ( const char *str )
  {
    m_mymap.erase ( str );
  }
  bool find ( const char *str, int *res )
  {
    map < string, int >::const_iterator ci;
    ci = m_mymap.find(str);
    bool found = ( ci != m_mymap.end() );
    if ( found )
       *res = ci->second;
    return found;
  }
};

The preferred solution would be to have all the beauty of using C++ string objects as keys, but with the C speed of conversion between C strings and C++ strings for the temporary strings. That's where my new Mode 1 (FAST_STRING) shines. To achieve the speed of temporary C strings and still be able to use C++ strings for the lookup, take a look at Listing Six and testmap.cpp (available electronically). Mode 3 (value allocated on the heap) is used only when the new element is added. For the lookup and erase, the fastest Mode 1 strings are used (just setting the pointer).

Listing Six

class CMyMap
{
private:
  map < CGenString, int > m_mymap;
  typedef map < CGenString, int > GenStringIntMap;
public:
  void add ( const char *str, int i )
  {
    m_mymap[str] = i;
  }
  void remove ( const char *str )
  {
    FAST_STRING(s,str);
    m_mymap.erase ( s );
  }
  bool find ( const char *str, int *res )
  {
    map < CGenString, int >::const_iterator ci;
    FAST_STRING(s,str);
    ci = m_mymap.find(s);
    bool found = ( ci != m_mymap.end() );
    if ( found )
      *res = ci->second;
    return found;
  }
};



There is no allocation and no copying, which means that you have a real automatic (stack) string variable with all the performance of C strings and all power/convenience of the STL map. The constructor and destructor are extremely simple for the fastest (Mode 1) strings—just setting a pointer in the constructor, and no-op in the destructor.

Possible Improvements

One area of improvement would be adding support for multibyte characters. A good way to do this would be to follow STL string design (where string is typedefed to basic_string <char>) and add another template parameter for the actual character type.


Lev is a principal engineer at Netscape/AOL and can be contacted at [email protected] or [email protected].

Oct03: C++ String Performance


(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=59 millisecs, average=0.000590

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <9226>: Creating/destroying 100000 strings...
Thread <9226>: total time=60 millisecs, average=0.000600
Thread <4101>: Creating/destroying 100000 strings...
Thread <5126>: Creating/destroying 100000 strings...
Thread <5126>: total time=65 millisecs, average=0.000650
Thread <4101>: total time=125 millisecs, average=0.001250
Thread <6151>: Creating/destroying 100000 strings...
Thread <7176>: Creating/destroying 100000 strings...
Thread <8201>: Creating/destroying 100000 strings...
Thread <10251>: Creating/destroying 100000 strings...
Thread <6151>: total time=130 millisecs, average=0.001300
Thread <1026>: total time=81 millisecs, average=0.000810
Thread <10251>: total time=94 millisecs, average=0.000940
Thread <7176>: total time=166 millisecs, average=0.001660
Thread <8201>: total time=195 millisecs, average=0.001950
Thread <3076>: Creating/destroying 100000 strings...
Thread <3076>: total time=60 millisecs, average=0.000600
Thread <2051>: Creating/destroying 100000 strings...
Thread <2051>: total time=60 millisecs, average=0.000600

Example 3: (a) The results of running teststring3.cpp; (b) results of running teststring4.cpp.

Oct03: C++ String Performance


(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=23 millisecs, average=0.000230

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <1026>: total time=23 millisecs, average=0.000230
Thread <3076>: Creating/destroying 100000 strings...
Thread <4101>: Creating/destroying 100000 strings...
Thread <3076>: total time=35 millisecs, average=0.000350
Thread <5126>: Creating/destroying 100000 strings...
Thread <5126>: total time=23 millisecs, average=0.000230
Thread <4101>: total time=71 millisecs, average=0.000710
Thread <6151>: Creating/destroying 100000 strings...
Thread <6151>: total time=24 millisecs, average=0.000240
Thread <7176>: Creating/destroying 100000 strings...
Thread <8201>: Creating/destroying 100000 strings...
Thread <9226>: Creating/destroying 100000 strings...
Thread <9226>: total time=23 millisecs, average=0.000230
Thread <10251>: Creating/destroying 100000 strings...
Thread <10251>: total time=23 millisecs, average=0.000230
Thread <7176>: total time=80 millisecs, average=0.000800
Thread <8201>: total time=80 millisecs, average=0.000800
Thread <2051>: Creating/destroying 100000 strings...
Thread <2051>: total time=23 millisecs, average=0.000230

Example 4: (a) The results of running teststring5.cpp; (b) results of teststring6.cpp.

Oct03: C++ String Performance


(a)
Thread <1024>: Creating/destroying 100000 strings...
Thread <1024>: total time=3 millisecs, average=0.000030

(b)
Thread <1026>: Creating/destroying 100000 strings...
Thread <1026>: total time=3 millisecs, average=0.000030
Thread <3076>: Creating/destroying 100000 strings...
Thread <3076>: total time=4 millisecs, average=0.000040
Thread <4101>: Creating/destroying 100000 strings...
Thread <4101>: total time=3 millisecs, average=0.000030
Thread <5126>: Creating/destroying 100000 strings...
Thread <5126>: total time=3 millisecs, average=0.000030
Thread <6151>: Creating/destroying 100000 strings...
Thread <6151>: total time=3 millisecs, average=0.000030
Thread <7176>: Creating/destroying 100000 strings...
Thread <7176>: total time=3 millisecs, average=0.000030
Thread <8201>: Creating/destroying 100000 strings...
Thread <8201>: total time=4 millisecs, average=0.000040
Thread <9226>: Creating/destroying 100000 strings...
Thread <9226>: total time=3 millisecs, average=0.000030
Thread <10251>: Creating/destroying 100000 strings...
Thread <10251>: total time=4 millisecs, average=0.000040
Thread <2051>: Creating/destroying 100000 strings...
Thread <2051>: total time=3 millisecs, average=0.000030

Example 5: (a) The results of running teststring7.cpp; (b) results of running teststring8.cpp.

Oct03: C++ String Performance

Table 1: Performance test results.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.