Another Processor
Tables 4 and 5, along with Figures 4 and 5, show performance measurements of algorithms on Intel dual-core E8300 processor. Table 3 compares the E8300 processor architecture with i7 860 processor.
In other words, the E8300 CPU is dual-core without hyperthreading, with 32 KB L1 cache for data and a separate one for instructions for each core, and a shared 6 MB L2 cache. The i7 860 CPU is a hyperthreaded quad-core with the same L1 cache per core, with 256 KB L2 cache per core, and an 8 MB shared L3 cache. Also, E8300 uses a 1.3 GHz 64-bit wide front side bus to access system memory, with a theoretical peak bandwidth of 10.4 GBytes/sec, whereas i7 860 uses a dual-channel 1.3 GHz DDR3 memory directly connected to the processor with 21 GBytes/sec of peak bandwidth.
8-bit algorithm on the dual-core E8300 processor, the non-parallel Counting Sort is faster than Intel IPP Radix Sort for arrays 10K and larger. This implementation is also between 20X and 55X faster than STL sort for arrays of 10K and larger. The parallel implementation shows speedup between 10K and 10M array size, but is slower for other sizes. Interestingly, the parallel version that parallelized both for loops is the fastest version on the E8300 processor, whereas the parallelized inner or outer for loop version are slower than the non-parallel implementation.
16-bit algorithm on the dual-core E8300 processor, the non-parallel implementation is fastest for all but 1M and 10M array sizes, where two parallel implementations beat it slightly. The fastest parallel implementation is "parallel outer", for 100K elements and larger. Two parallel implementations, "parallel inner" and "parallel both" are significantly slower than the non-parallel implementation, especially for small and medium array sizes. The "parallel both" implementation on the E8300 is slower than non-parallel implementation by up to 22X, which is not as much as on the i7 860 CPU, where the difference is up to 110X. This is most likely due to the differences in the cache architecture of the two processors, with E8300 having a 6 MB shared L2 cache, versus i7 860 having a 256 KB L2 cache per core (and the count array being 256 KB for the 16-bit algorithm).
Interestingly, for 16-bit algorithm, E8300 beats i7 860 performance for all array sizes that fit into the cache (by 2-4X), but is slower 2-3X once the array does not fit into the cache, probably due 2X higher memory bandwidth on the i7 860 processor. However, for 8-bit algorithm the i7 860 beats the E8300 between 2-6X for all but the smallest array sizes.
Multicore Speedup
Speedup is a useful measure of the benefit that multi-core processing brings. Speedup is defined as:
S = T1 / Tp
where P = number of processors or cores, and T1 is the time to execute the non-parallel algorithm, and Tp is the time to execute on P cores. Tables 6 and 7, along with Figures 6 and 7 show measurements of Parallel Counting Sort when limiting the number of cores being used (by using TBB task_scheduler_init constructor to specify the number of cores).
8-bit algorithm shows speedup of up to 3.3X, and shows speedup for all number of cores with or without hyperthreading for arrays of 10K elements of larger. Smaller arrays show as much as 10X slow down. 16-bit algorithm shows speedup of up to 2.4X for arrays of 10M elements and larger. Smaller arrays show as much as 50X slow down. All parallel versions are slower than non-parallel versions for small array sizes.


