Channels ▼
RSS

C/C++

Parallel Counting Sort (Part 2)


Another Processor

Tables 4 and 5, along with Figures 4 and 5, show performance measurements of algorithms on Intel dual-core E8300 processor. Table 3 compares the E8300 processor architecture with i7 860 processor.

[Click image to view at full size]
Table 3

In other words, the E8300 CPU is dual-core without hyperthreading, with 32 KB L1 cache for data and a separate one for instructions for each core, and a shared 6 MB L2 cache. The i7 860 CPU is a hyperthreaded quad-core with the same L1 cache per core, with 256 KB L2 cache per core, and an 8 MB shared L3 cache. Also, E8300 uses a 1.3 GHz 64-bit wide front side bus to access system memory, with a theoretical peak bandwidth of 10.4 GBytes/sec, whereas i7 860 uses a dual-channel 1.3 GHz DDR3 memory directly connected to the processor with 21 GBytes/sec of peak bandwidth.

[Click image to view at full size]
Table 4: unsigned 8-bit, E8300.

[Click image to view at full size]
Figure 4: unsigned 8-bit, E8300.

[Click image to view at full size]
Table 5: unsigned 16-bit, E8300.

[Click image to view at full size]
Figure 5: unsigned 16-bit, E8300.

8-bit algorithm on the dual-core E8300 processor, the non-parallel Counting Sort is faster than Intel IPP Radix Sort for arrays 10K and larger. This implementation is also between 20X and 55X faster than STL sort for arrays of 10K and larger. The parallel implementation shows speedup between 10K and 10M array size, but is slower for other sizes. Interestingly, the parallel version that parallelized both for loops is the fastest version on the E8300 processor, whereas the parallelized inner or outer for loop version are slower than the non-parallel implementation.

16-bit algorithm on the dual-core E8300 processor, the non-parallel implementation is fastest for all but 1M and 10M array sizes, where two parallel implementations beat it slightly. The fastest parallel implementation is "parallel outer", for 100K elements and larger. Two parallel implementations, "parallel inner" and "parallel both" are significantly slower than the non-parallel implementation, especially for small and medium array sizes. The "parallel both" implementation on the E8300 is slower than non-parallel implementation by up to 22X, which is not as much as on the i7 860 CPU, where the difference is up to 110X. This is most likely due to the differences in the cache architecture of the two processors, with E8300 having a 6 MB shared L2 cache, versus i7 860 having a 256 KB L2 cache per core (and the count array being 256 KB for the 16-bit algorithm).

Interestingly, for 16-bit algorithm, E8300 beats i7 860 performance for all array sizes that fit into the cache (by 2-4X), but is slower 2-3X once the array does not fit into the cache, probably due 2X higher memory bandwidth on the i7 860 processor. However, for 8-bit algorithm the i7 860 beats the E8300 between 2-6X for all but the smallest array sizes.

Multicore Speedup

Speedup is a useful measure of the benefit that multi-core processing brings. Speedup is defined as:

S = T1 / Tp

where P = number of processors or cores, and T1 is the time to execute the non-parallel algorithm, and Tp is the time to execute on P cores. Tables 6 and 7, along with Figures 6 and 7 show measurements of Parallel Counting Sort when limiting the number of cores being used (by using TBB task_scheduler_init constructor to specify the number of cores).

[Click image to view at full size]
Table 6: unsigned 16-bit, i7 860 "no parallel loops".

[Click image to view at full size]
Figure 6: unsigned 16-bit, i7 860 "no parallel loops", varied number of cores.

[Click image to view at full size]
Table 7: unsigned 8-bit, i7 860 "no parallel loops".

[Click image to view at full size]
Figure 7: unsigned 16-bit, i7 860 "no parallel loops", varied number of cores.

8-bit algorithm shows speedup of up to 3.3X, and shows speedup for all number of cores with or without hyperthreading for arrays of 10K elements of larger. Smaller arrays show as much as 10X slow down. 16-bit algorithm shows speedup of up to 2.4X for arrays of 10M elements and larger. Smaller arrays show as much as 50X slow down. All parallel versions are slower than non-parallel versions for small array sizes.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video