Channels ▼
RSS

Generic: Typed Buffers (II)


October 2001 C++ Experts Forum/Generic<Programming>


In a classic soap-opera style, let's start with highlights from the previous episode. We sketched a class template buffer that is much like std::vector, except that buffer does not have the notion of capacity and that it adds a couple of primitive functions, such as grow_noinit and shrink_nodestroy. In addition, the previous installment talked about type traits as an enabling technology for optimization. Finally, the villain threatened to talk about copying objects and memory allocation.

This article does not treat buffers directly, but rather two operations that you commonly perform with buffers: filling a buffer with a value and copying between buffers and various containers. We will compare with numbers several filling and copying methods.

Filling

You know — copy the same value to all objects in a range. The C++ Standard library provides two filling functions: std::fill and std::uninitialized_fill. The second one assumes the values to fill have not been initialized at all.

The simplest implementation of a generic memory filling function might look like this:

// Example 1: A simple generic filling routine
template <typename T>
void Fill(T* beg, T* end, const T& value)
{
  for (; beg != end; ++beg) *beg = value;
}

The question is, is this the best implementation money can buy? The stock answer is "it's the compiler's responsibility to generate optimal code" — so I set out to test this.

First, I looked at the code for Fill generated by Microsoft Visual C++ 6.0 (MSVC) and Metrowerks CodeWarrior 6.9 (MWCW). They both generate a straight loop in assembler. However, x86, like many modern processors, offers special assembler instructions for quickly filling chunks of memory. The C library function memset is likely to make use of them. Unfortunately, memset is able to set memory only to the same byte; as long as you have to set the memory to any pattern that's longer than one byte, you can't use memset anymore, so memset doesn't "scale" to generic code.

(Let me make an aside here. There is something sacrosanct about C's memory manipulation functions memset, memcpy, and memcmp. They are likely to be highly optimized by the compiler vendor, to the extent that the compiler might detect calls to these functions and replace them with inline assembler instructions — this is the case with MSVC. Consequently, using mem* functions is considered cool style by speed-hunting programmers.)

One idea that comes to mind is to implement filling in terms of copying. That is, you can exploit the fact that, once you partly filled the target range, you can copy the already filled part to the unfilled part using a fast copy routine — and voilà! The cool part is that you can double the size of the chunk that's copied every pass. For example, say you want to fill a range of 1,000 numbers of type double with 1.0. First step, you assign 1.0 to the first slot. Second step, you copy the just assigned slot to the slot next to it. Third step, you copy the two newly assigned slots to the next two adjacent slots. Fourth step, you copy four values and you get eight — and so on. In 10 steps, you filled the entire range of 1,000 doubles. The bulk of the job is actually done in the last step, when 512 slots have already been filled, and 488 of them are copied into the 488 remaining slots. Assuming you benefit from a fast copying routine with the signature:

template <typename T>
void QuickCopy(const T* sourceBegin, 
  const T* sourceEnd, T* destination);

then the FillByCopy algorithm looks as like:

template <typename T>
void FillByCopy(T* beg, T* end, const T& value)
{
  if (beg == end) return;
  *beg = value;
  T* dest = beg + 1;
  for (;;)
  {
    const size_t alreadyFilled = dest — beg;
    if (alreadyFilled >= end — dest)
    {
      QuickCopy(beg, beg + (end — dest), dest);
      break;
    }
    else
    {
      QuickCopy(beg, dest, dest);
      dest += alreadyFilled;
    }
  }
}

FillByCopy is a cool algorithm if QuickCopy is indeed a fast copy routine. FillByCopy is somewhat similar to the Russian Peasant algorithm for computing the integral power of a number with minimal multiplication [1]. Many people invented fill-by-copy in a variety of situations — one that comes to mind is filling a whole hard disk starting with a one-byte file. Were it an original idea, I would have hastened to call filling by copy "The Romanian Peasant" algorithm.

So I excitedly put in place a test, and I got interesting results. But first, let me introduce another contender to you.

Duff's Device

Duff's Device [2] is a C coding trick for speeding up loops. The basic idea is that if, in a for loop, the operation performed is cheap enough (such as, um, an assignment) — then testing the loop condition (beg != end in Example 1 above) occupies a large percentage of the time spent in the loop. Then, the loop ought to be partially unrolled so that several operations are done in a shot, and the test is done less frequently. For example, if you fill a range of objects, you may want to assign to two or more consecutive objects in a shot inside the loop.

You must pay attention to the details of limit conditions, etc. Here's where Duff's Device is an original, creative solution. Let's take a quick look at a Duff's Device-based implementation of a generic filling algorithm:

template <class T> inline void FillDuff
  (T* begin, T* end, const T& obj)
{
  switch ((end - begin) & 7)
  {
  case 0: 
    while (begin != end)
    {
      *begin = obj; ++begin;
  case 7: *begin = obj; ++begin;
  case 6: *begin = obj; ++begin;
  case 5: *begin = obj; ++begin;
  case 4: *begin = obj; ++begin;
  case 3: *begin = obj; ++begin;
  case 2: *begin = obj; ++begin;
  case 1: *begin = obj; ++begin;
    }
  }
}

Now, if you haven't seen it before, go back and look again because it's possible you missed something. The function consists of a switch statement, whose labels land the execution path in a precise spot inside of a while loop (or, in one case, outside). The expression inside the switch computes the remainder of dividing the number of objects by eight. Depending on the remainder, execution starts upper or lower in the while loop and falls through from there. (There's no break.) This way Duff's Device solves the margin condition neatly and concisely. By the way, why is the "case 0:" label outside the loop, breaking a nice symmetry? The only reason is handling empty sequences. When the remainder is zero, an extra test is performed to weed out empty sequences. It all falls in place quite nicely.

The net result is that the test begin != end is performed eight times less frequently. Therefore, the contribution of the test itself to the duration of the loop was slashed by a factor of eight. Of course, you can experiment with other factors as well.

The possible inefficiencies of Duff's Device stem from code bloating (some processors tend to execute tight loops better than larger loops) and its unusual structure. Optimizers might be built to say "aha!" when the familiar straight loop structure comes in sight, but they might be bewildered and generate conservative code when they see something trickier.

Numbers

If there is one thing to know about optimization (after you're past the "don't do it" and the "don't do it yet" phases), that thing must be: optimizations must be validated by measurements. The filling algorithms described above might sound ingenious, but only testing can validate them.

I wrote a simple test program that fills an array of COUNT variables of type double using one of the three algorithms described above: the for loop, fill-by-copy, and Duff's Device. The test is repeated REPEAT times. I tested each algorithm several times on a P III/750 MHz with three compilers: MSVC, MWCW, and GNU's g++ 2.95.2. By varying COUNT and REPEAT, I got the following results:

  • When filling a large buffer (COUNT 100,000 and beyond), the straight for loop and Duff's Device behave practically the same. The fill-by-copy algorithm actually performs worse by 1-5% [3].
  • When filling ranges of 10,000 doubles, FillByCopy on MSVC and MWCW was faster by 23%, but g++ was most comfortable with the for loop, yielding the best results by far (20%-80%) across all compilers and methods.
  • When filling ranges of 1,000 doubles, MSVC and MWCW generated similar results: Duff's Device is 20% faster, and fill-by-copy is 75% faster, compared to the straight for loop. Again, g++ had a separate opinion and yielded amazingly faster results for all methods (100% faster than the competition's best).
  • For 100 doubles, the results stayed about the same for MSVC and MWCW. Again, g++ jovially performed the same task in roughly half the time (Duff's device and fill-by-copy being 12% faster than the for loop).

We find the explanation of this behavior by looking at the architecture of current PCs. The processor is 5-10 times faster than the main memory bus. There are two levels of cache to enhance speed of memory access: one is right inside the processor (Level I), and the other is next to the processor (and comes in the same package in Pentium III).

In the best case, all memory manipulated in one operation fits in the Level I cache. In the worst case, you make random, scattered accesses, so you keep missing the cache and end up hitting on the main memory.

FillByCopy is cache-unfriendly, because in each pass it hits two memory areas at a time — the source and the destination area. If you, for example, fill 1 MB worth of data, the last step will copy 512 KB from one location to another in memory. This will make the cache manager unhappy, or at least less happy than in the case of a straight filling loop. That's why FillByCopy is slightly slower than a for loop for filling large memory chunks.

(Exercise: There is a simple change you can make to one line of code of FillByCopy to improve its cache friendliness. Hint: think of locality of access.)

When filling large amounts of data, you can't benefit too much from caching, so the filling speed will saturate at the main memory bandwidth. The optimization you bring to the loop itself does not bring much improvement because the bottleneck is dictated by the memory, not by processor activity. No matter how smart you write the loop, use registers, or unroll, the processor will wait for the main memory anyway. That's why Duff's Device and the for loop perform the same for large memory chunks.

When filling smaller amounts of data, the landscape changes. The data is much more likely to fit in the cache, which is as fast as the processor. Now, the code executed by the processor dictates the speed of the loop.

memcpy (the routine ultimately used by FillByCopy in my test case) uses a special looping assembler statement (rep stos in x86 jargon). For cache-to-cache copying, the assembler statement is faster than a loop built out of jumps, assignments, and comparisons. That's why FillByCopy is fastest for medium amounts of data. Likewise, Duff's Device also has an edge over the for loop because it performs less comparisons.

Quick Copying

Another common buffer operation is to copy data to and from buffers. In a quest for a fast copy routine, I tested three variants with data of type double: a straight for loop, a Duff's Device-based implementation, and memcpy. (Oddly enough, although there is a fill-by-copy algorithm, a copy-by-fill algorithm does not exist.)

There's not a lot of competition to expect — memcpy ought to leave everything else in the dust. After all, memcpy is the highly optimized, finely tuned copy method provided by your standard library implementer. Too bad it doesn't apply to all types! Then Duff's Device should compare to the for loop pretty much the same as for the filling case. Of course, as it often happens, reality turns out to be a bit different. The results were as follows:

  • When copying large amounts of data (megabytes), the speed is about the same for all methods (and all compilers). This is, again, because for large amounts of data the main memory bandwidth dictates the operating speed.
  • When copying a smaller quantity of data, the compilers start disagreeing. For example, when copying 100,000 objects of type double:
    • MWCW's for loop was pretty slow; Duff's Device and memcpy both bypassed it by 20%.
    • MSVC and g++ were equal-opportunity code generators: all methods performed about the same — and about as good as MWCW's best.
  • When decreasing the quantity of data copied, divergence becomes deeper. Below are the results for copying 10,000 doubles:
    • MSVC generated 25% faster code for Duff's Device and 67% faster code for memcpy.
    • MWCW (be prepared) generated 9% faster code for Duff's Device and 20% slower code for memcpy, compared to a for loop. The for loop speed was about the same as MSVC's.
    • g++ was really cool. For starters, g++'s for loop was 42% faster than MSVC's and MWCW's. Then, memcpy performed as good as MSVC's memcpy, and 10% faster than g++'s for loop. But g++ implementers must love Duff's Device because the Duff-based copy blew away all competition, performing 11% better than memcpy, which is supposed to be la crème de la crème!
  • When decreasing the amount of data to 1,000 doubles:
    • MSVC's for loop was really slow, Duff was 90% faster, and memcpy was 200% faster.
    • MWCW generated a better for loop, and then Duff performed 5% faster and memcpy performed 130% faster.
    • g++ generated a for loop as good as MWCW's. Duff was 75% better, and memcpy was 130% better.
    • All compilers yielded similar, and best, results for memcpy.
  • And, finally, when copying 100 doubles, I got similar results as with 1,000 doubles, except that g++ preferred Duff's Device again, which was speedier by 75% than the for loop, and 40% than memcpy.

If you are confused, so am I. It seems hard to decide which method is best across all compilers and all data amounts. It also came as a surprise to me that a freely available, open-source compiler not only bypassed, but crushed, two respected commercial compilers.

All these measurements make me envious of compiler-targeted library writers — they have the luxury of optimizing their code for one compiler and one computer architecture. However, this column's title is "Generic <Programming>," not "Specific<Programming>" — which makes it all the more challenging and interesting. By the way, let's now talk about generic in "generic programming" — how generic are the routines we just talked about?

Considerations on Genericity

So far, we have assumed that the type being tested can be copied with memcpy. In reality, only a subset of C++ types has this property. That subset is called POD ("Plain Old Data"). The POD set includes all primitive types and C-like structures with only public member variables, no constructors and destructors, and no virtual functions. For all other types, the effect of copying them with memcpy is undefined.

This gives Duff's Device an important advantage over FillByCopy and memcpy: Duff's Device is 100% "portable" across types. Better yet, Duff's Device works not only with in-memory ranges iterated with pointers, it works with any random iterator type.

It is important, however, to figure out if you can copy a type with memcpy This is very interesting not only for speed purposes, but also, paradoxically, for allocation purposes, as you'll see in the next installment. Defining a MemCopyable flag is a matter of opening the TypeTraits namespace (defined in the previous installment) and planting a definition like this:

namespace TypeTraits
{
  template <typename T> struct IsMemCopyable
  {
    enum { value = IsPrimitive<T>::value != 0 };
  };
}

Then you can easily write a NativeCopy<T> generic routine that dispatches at compile time to memcpy or to a more conservative method, depending on TypeTraits::IsMemCopyable<T>::value. This task is left as an exercise to the reader.

Duff's Device works well only when the type to copy/fill is cheap enough to copy/fill. Otherwise, the comparison's cost becomes negligible in (here's a pun for you) comparison to the assignment itself. So we need to define a type trait CheapToCopy as shown below:

namespace TypeTraits
{
  template <typename T> struct CheapToCopy
  {
    enum { value =
      CopyThrows<T>::value == 0 && sizeof(T) <=
      4 * sizeof(T*) };
  };
}

Interestingly, CheapToCopy defaults to a little Sherlock Holmesism: a type is considered cheap to copy if its copy operation does not throw (which often implies that no costly resources are allocated) and its size is below a threshold. If a type is more than four times bigger than a pointer, that means that the loop is kind of unrolled already.

Using TypeTraits::CheapToCopy<T>::value, you can decide at compile time between using Duff's Device or something else.

Conclusion

Things are clearly hazy, aren't they? First off, maybe it came as a surprise to you that there's more than one way to fill and copy objects. Then, there's no single variant of fill and copy that works best on all compilers, data sets, and machines. (I guess if I tested the same code on a Celeron, which has less cache, I would have gotten very different results. To say nothing about other architectures.)

As a rule of thumb, it's generally good to use memcpy (and consequently fill-by-copy) if you can — for large data sets, memcpy doesn't make much difference, and for smaller data sets, it might be much faster. For cheap-to-copy objects, Duff's Device might perform faster than a simple for loop. Ultimately, all this is subject to your compiler's and machine's whims and quirks.

There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. We've done electronic computing for more than 50 years now, and we strive to design more and more complex systems, with unsatisfactory results. Software development is messy. Could it be because the fundamental tools and means we use are low-level, inefficient, and not standardized? Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory.

I'm approaching the word limit for this article without exhausting these buggers, sorry, buffers. Moving objects is another important topic, and we haven't even started talking about memory allocation (the Caudine Forks, remember?). But, as a movie script says, the mind can only enjoy what the butt can endure, so I'll have to stop here for now. See you soon.

Bibliography and Notes

[1] Matt Austern. "The Russian Peasant Algorithm," C++ Report, June 2000. Matt didn't invent the algorithm, but the article gives a very good description of it.

[2] See <www.lysator.liu.se/c/duffs-device.html>.

[3] It must be said here that g++ must be using some quantum effect for filling data, an effect that works only sometimes. In repeated tests with filling 10 million doubles, Duff's Device compiled by g++ sometimes unexplainably yielded 40%-50% faster results than anything else on Earth. Frailty, thy name is speed measuring...

Andrei Alexandrescu is a Development Manager at RealNetworks Inc. (<www.realnetworks.com>), based in Seattle, WA, and author of the acclaimed book Modern C++ Design. He may be contacted at www.moderncppdesign.com. Andrei is also one of the featured instructors of The C++ Seminar (<www.gotw.ca/cpp_seminar).


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video