## Exploiting the power of 64-bit platforms

*Anatoliy is currently working on projects with the National Center for Biotechnology Information and National Institutes of Health. He can be contacted at anatoliy_ kuznetsov@yahoo.com. *

Addressing and 64-bit operations are useful in applications that deal with large amounts of data, such as scientific and engineering applications, large databases, and the like. There are a number of CPUs and operating systems that natively support 64-bit computing. Probably the biggest advantage they provide is a huge available address space, in which applications can allocate more than 4GB of memory, easily maintain large files, and more. But to fully utilize the power of 64-bit CPUs, applications need to exploit the wider machine word. In this article, I focus on performance optimization techniques that take advantage of that power in this way.

### 64-Bit Safety

Unfortunately, much of today's software doesn't take advantage of 64-bit microprocessors and often can't even be compiled and operated in 64-bit mode. Consequently, software runs in 32-bit compatibility mode—a clear waste of silicon. Moreover, there are a number of common C coding "malpractices" when coding for 32-bit systems with a hypothetical 64-bit CPU in mind:

- Reliance on the fact that the size of
*pointer*is equal to the size of*int*. For 64-bit systems,*sizeof(void*) == 8*and*sizeof(int)*usually remains 4. Ignoring this can result in an incorrect assignment and crash. - Reliance on a particular byteorder in the machine word.
- Using type
*long*and presuming that it always has the same size as*int*. Direct assignment of this type causes value truncation and leads to a rare and difficult-to-detect problem. - Alignment of stack variables. In some cases, stack variables can have addresses not aligned on 8-byte boundaries. If you typecast these variables to 64-bit variables, you can get into trouble on some systems. But if you place a 64-bit variable (
*long*or*double*) on the stack, it is guaranteed to be aligned. Heap allocated memory is aligned, too. - Different alignment rules in structures and classes. For 64-bit architectures, structure members are often aligned on 64-bit boundaries. This leads to problems in sharing binary data through IPC, network, or disk. Packing data structures to save resources can cause problems if alignment is not taken into consideration.
- Pointer arithmetic. When a 64-bit pointer is incremented as a 32-bit pointer, and vice versa. The 64-bit pointer is incremented by 8 bytes and the 32-bit pointer by 4 bytes.
- In the absence of function prototypes, the return value is considered to be
*int*, which can cause value truncation.

### Parallel Programming: Getting the Most From Each Cycle

The key to high-performance 64-bit C programming is wide integer and FPU registers. CPU registers are at the top of the food chain—the most expensive type of computer memory there is. In 64-bit CPUs, registers are 8-bytes wide, although a corresponding 128- or 256-bits wide memory bus is also common.

Figure 1 illustrates typical operation on a 32-bit system. The CPU crunches data coming from memory 4 bytes at a time. Figure 2 shows that a 64-bit system having wide registers can process 8 bytes at a time.

Listing One performs a bit XOR operation on a block of memory, representing an integer-based bitset. You can optimize this code for 64-bit mode. Listing Two, for instance, relies on the *long long* C type, which is not supported by some compilers. As you can see, I did not change the total size of the bit set block, although it now takes twice fewer operations to recombine vectors. Listing Two reduces the loop overhead and equivalent to the loop unrolling with coefficient 2. The disadvantage of this code, of course, is its pure 64-bitness. Being compiled on a 32-bit system gives a wrong result because of the different *long* size.

You can make further modifications, as in Listing Three, which uses wide registers to do the job on 32-bit and 64-bit CPUs. When typecasting like this, remember pointer alignment. If you blindly typecast *int* pointers to 64-bit *long* pointers, the address might not be 8-bytes aligned. On some architectures, this causes a bus error and crash; on others, it leads to performance penalties. Listing Three is not safe because it is possible that the 32-bit *int* variable placed on the stack will be 4-bytes aligned and the program will crash. Heap allocation (*malloc*) is a guarantee against this occurring.

### Bit Counting

One of the most important operations in bit set arithmetic is counting the number of 1-bits in bit strings. The default method splits each integer into four characters and looks up a table containing precalculated bit counts. This linear approach can be improved by using 16-bit-wide tables, but at the cost of a much larger table. Moreover, larger tables will likely introduce some additional memory fetch operations, interfere with a CPU cache, and won't deliver a significant performance boost.

As an alternative, I present code inspired by "Exploiting 64-Bit Parallelism" by Ron Gutman (*DDJ*, September 2000). Listing Four does not use lookup tables, but computes the two *int*s in parallel.

### Bit String Lexicographical Comparison

Another application for 64-bit optimization is lexicographical comparison of bitsets. The straightforward implementation takes two words out of the bit sequence and performs bit-over-bit shifting with comparison. This is an iterative algorithm with *O(N/2)* complexity. *N *here is the total number of bits. Listing Five illustrates iterative comparison of two words. This algorithm cannot be significantly improved by 64-bit parallelization. However, Listing Six, an alternative numerical algorithm with complexity proportional to half the number of machine words (not bits), has good 64-bit potential.

### The Challenge

The $64,000 question here is whether 64-bit is worth the trouble. Contemporary 32-bit CPUs are superscalar, speculative execution machines that often provide several execution blocks that can execute several commands in parallel and out-of-order, without intervention from programmers. The truth is that 64-bit processors exhibit the same properties and can run code in parallel—but only in 64 bits. Plus, some architectures, such as Intel Itanium, specifically emphasize parallel programming and concentrate efforts on explicit optimization on the compiler level. Making code 64-bit ready and optimized is a necessity in this case.

Another objection is that performance is often limited not by the raw MHz-based CPU performance, but by CPU-memory bandwidth, which is bus limited; our algorithms are not going to show the top performance, anyway. This is a fact of life and hardware designers know it. We all see implementation of high-performance dual-channel memory controllers and steady hikes in the memory speed. This effort certainly makes bus bottlenecks less critical, and optimized 64-bit algorithms are going to be better prepared for modern hardware.

### Algorithmic Optimization, Binary Distances

One candidate for 64-bit optimization is the computing of binary distances between bit strings. Binary distances are used in data mining and AI applications doing clustering and finding similarities between objects, which are described by binary descriptors (bit strings). The optimization hotspot here is a distance algorithm, which can be repeated for every pair of objects in the system.

The most-known distance metric is the Hamming distance, which is a minimum number of bits that must be changed to convert one bit string into another. In other words, you combine bit strings using bitwise XOR and compute the number of bits ON in the result.

The starting point for the analysis is code like Listing Seven. The obvious optimization here is to get rid of the temporary bitset and compute both XOR and population count in parallel. The creation of temporaries is a "favorite in-house sport" of C++ compilers and wastes performance on reallocations and memory copying; see Listing Eight.

This optimization immediately achieves several goals: reduction of memory traffic, better register reuse, and, of course, 64-bit parallelism (see Figure 3). The essential goal here is to improve the balance between CPU operations and memory loads. The objective has been achieved by combining the algorithms in Listings Three and Four.

This optimization technique can be further extended on any distance metric that can be described in terms of logical operations and bit counting. What's interesting is that the effect of optimization of more complex metrics like the Tversky Index, Tanamoto, Dice, Cosine function, and others, is more pronounced.

To understand why this is happening, consider the Tversky Index:

TI = BITCOUNT(A & B) /

[a*(BITCOUNT(A-B) +

b*BITCOUNT(B-A) + BITCOUNT(A & B)]

The formula includes three operations: *BITCOUNT_AND(A, B), BITCOUNT_SUB(A, B)* and *BITCOUNT_SUB(B, A).* All three can be combined into one pipeline; see Figure 4. This technique improves data locality and better reuses CPU caches. It also means fewer CPU stalls and better performance; see Listing Nine.

### Is There Life After 64-Bits?

Many of the algorithms I've described can be coded using vector-based instructions, single instruction, multiple data (SIMD). CPUs that are SIMD-enabled include special, extended (64- or 128-bits) registers and execution units capable of loading several machine words and performing operations on all of them in parallel. The most popular SIMD engines are SSE by Intel, 3DNow! by AMD, and AltiVec by Motorola, Apple, and IBM. SIMD registers are different from general-purpose registers; they do not let you execute flow-control operations such as *IF*. This makes SIMD programming rather difficult. Needless to say, portability of SIMD-based code is limited. However, a parallel 64-bit optimized algorithm conceptually can be easily converted to a 128-bit SIMD-based algorithm. For instance, in Listing Ten, an XOR algorithm is implemented using the SSE2 instruction set; I used compiler intrinsics compatible with the Intel C++ compiler.

### For More Information

Ron Gutman. "Exploiting 64-Bit Parallelism." *DDJ*, September 2000.

Brian T. Luke. "Clustering Binary Objects" (http:://fconnyx.ncifcrf.gov/~lukeb/ binclus.html).

Ian Witten, Alistair Moffat, and Timothy Bell. *Managing Gigabytes: Compressing and Indexing Documents and Images*. Morgan Kaufmann, 1999. ISBN 1558605703.

Wi-Fen Lin and Steven K. Reinhardt. "Reducing DRAM Latencies with an Integrated Memory Hierarchy Design." *Seventh International Symposium on High-Performance Computer Architecture (HPCA'01).*

Intel Corp. "Intel Pentium 4 and Intel Xeon Processor Optimization."

Henry S. Warren, Jr. *Hacker's Delight*. Addison-Wesley Professional, 2002. ISBN 0201914654.

**DDJ**

{ int a1[2048]; int a2[2048]; int a3[2048]; for (int i = 0; i < 2048; ++i) { a3[i] = a1[i] ^ a2[i]; } }Back to article

**Listing Two**

{ long long a1[1024]; long long a2[1024]; long long a3[1024]; for (int i = 0; i < 1024; ++i) { a3[i] = a1[i] ^ a2[i]; } }Back to article

**Listing Three**

{ int a1[2048]; int a2[2048]; int a3[2048]; long long* pa1 = (long long*) a1; long long* pa2 = (long long*) a2; long long* pa3 = (long long*) a3; for (int i = 0; i < sizeof(a1) / sizeof(long long); ++i) { pa3[i] = pa1[i] ^ pa2[i]; } }Back to article

**Listing Four**

int popcount(long long b) { b = (b & 0x5555555555555555LU) + (b >> 1 & 0x5555555555555555LU); b = (b & 0x3333333333333333LU) + (b >> 2 & 0x3333333333333333LU); b = b + (b >> 4) & 0x0F0F0F0F0F0F0F0FLU; b = b + (b >> 8); b = b + (b >> 16); b = b + (b >> 32) & 0x0000007F; return (int) b; }Back to article

**Listing Five**

int bitcmp(int w1, int w2) { while (w1 != w2) { int res = (w1 & 1) - (w2 & 1); if (res != 0) return res; w1 >>= 1; w2 >>= 1; } return 0; }Back to article

**Listing Six**

int compare_bit_string(int a1[2048], int a2[2048]) { long long* pa1 = (long long*) a1; long long* pa2 = (long long*) a2; for (int i = 0; i < sizeof(a1) / sizeof(long long); ++i) { long long w1, w2, diff; w1 = a1[i]; w2 = a2[i]; diff = w1 ^ w2; if (diff) { return (w1 & diff & -diff) ? 1 : -1; } } return 0; }Back to article

**Listing Seven**

#include <bitset> using namespace std; const unsigned BSIZE = 1000; typedef bitset<BSIZE> bset; unsigned int humming_distance(const bset& set1, const bset& set2) { bset xor_result = set1 ^ set2; return xor_result.count(); }Back to article

**Listing Eight**

{ unsigned int hamming; int a1[2048]; int a2[2048]; long long* pa1; long long* pa2; pa1 = (long long*) a1; pa2 = (long long*) a2; hamming = 0; for (int i = 0; i < sizeof(a1) / sizeof(long long); ++i) { long long b; b = pa1[i] ^ pa2[i]; b = (b & 0x5555555555555555LU) + (b >> 1 & 0x5555555555555555LU); b = (b & 0x3333333333333333LU) + (b >> 2 & 0x3333333333333333LU); b = b + (b >> 4) & 0x0F0F0F0F0F0F0F0FLU; b = b + (b >> 8); b = b + (b >> 16); b = b + (b >> 32) & 0x0000007F; hamming += b; } }Back to article

**Listing Nine**

{ double ti; int a1[2048]; int a2[2048]; long long* pa1; long long* pa2; pa1 = (long long*) a1; pa2 = (long long*) a2; ti = 0; for (int i = 0; i < sizeof(a1) / sizeof(long long); ++i) { long long b1, b2, b3; b1 = pa1[i] & pa2[i]; b2 = pa1[i] & ~pa2[i]; b3 = pa2[i] & ~pa1[i]; b1 = popcount(b1); b2 = popcount(b2); b3 = popcount(b3); ti += double(b1) / double(0.4 * b2 + 0.5 * b3 + b1); } }Back to article

**Listing Ten**

void bit_xor(unsigned* dst, const unsigned* src, unsigned block_size) { const __m128i* wrd_ptr = (__m128i*)src; const __m128i* wrd_end = (__m128i*)(src + block_size); __m128i* dst_ptr = (__m128i*)dst; do { __m128i xmm1 = _mm_load_si128(wrd_ptr); __m128i xmm2 = _mm_load_si128(dst_ptr); __m128i xmm1 = _mm_xor_si128(xmm1, xmm2); __mm_store_si128(dst_ptr, xmm1); ++dst_ptr; ++wrd_ptr; } while (wrd_ptr < wrd_end); }Back to article