 Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Fast, High-Quality, Parallel Random-Number Generators: Comparing Implementations

In the first part of this two-part series, Fast, High-Quality, Parallel Random Number Generators , I described a new family of fast random number generators based on the subcycles that can be implemented using simple arithmetic. That article provides a detailed description of the concepts behind these generators as well as some examples in C. In this article, the final in this two-part series, I examine the kinds of subcycle generators that exist and describe their respective benefits.

To review, a simple C statement such as `x-=rotl(x,21);` will produce a surprisingly long cycle of quasi-random numbers with a fixed, known period. rotl is a simple macro that performs a left rotation. The previous expression actually can produce several cycles of different periods, and you usually want to use the longest one. Many other simple expressions also produce such cycles. If the expression is reversible, meaning you can calculate it backward and thus undo it, you can get periods of nearly 232 on 32-bit processors. If an expression is irreversible, you get periods of around 220 or 240 on 32-bit and 64-bit processors respectively, if you select constants carefully. I've named these subcycle-based generators using the order of their operations when done on a calculator, so the example above is RS (for rotate, subtract).

A simple subcycle generator like RS is not random enough for most uses, but several (typically three) such generators can be computed and their results combined using adds or XORs. Such a combination generator, including all those shown in this article, is random enough to pass the stringent tests in the BigCrush test-suite . Listing One is an example of a good 32-bit combination generator:

Listing One: A good 32-bit combination generator.

```void Seed3Lsr (uint32_t seed) {
uint32_t n;
x = 1; y = 1; z = 1;
for (n=((seed>>22)&0x3ff)+20; n>0; n--) { x -= (x<<3); x = rotl(x,17); }
for (n=((seed>>11)&0x7ff)+20; n>0; n--) { y -= (y<<7); y = rotl(y,21); }
for (n=((seed    )&0x7ff)+20; n>0; n--) { z -= (z<<5); z = rotl(z, 9); }
}

uint32_t Rand3Lsr() {  // Combined period = 2^94.68
x -= (x<<3); x = rotl(x,17); // LSR: Period=4077769180=2*2*5*203888459
y -= (y<<7); y = rotl(y,21); // LSR: Period=3996418898=2*1998209449
z -= (z<<5); z = rotl(z, 9); // LSR: Period=3905814513=3*67*97*200329
return x + y + z;
}
```

All three subcycle generators in Listing One are of type LSR (Left-shift, Subtract, Rotate-left), which is reversible, and the constants chosen above for the shifts and rotations yield periods near 232. When you multiply these individual periods together (and divide out a common factor of 2), you get the combined period of 294.68, which is close to 296, the maximum possible. Based on my tests, periods of at least 248 are adequate.

Subcycle generators cannot be seeded directly because they might not end up in the desired (longest) cycle. So the seeding function above, `Seed3Lsr,` sets each generator to a value known to be in the correct cycle (which is 1 for all of them), and steps it forward (within the cycle) based on bit-fields in the seed.

### Scaling a Random Number to N Choices

Before diving into the subcycle generators, it's worth explaining ways to solve the common problem of creating a uniform random variable in the range of 0 to `N-1`, which is `N` choices. Define `B = ceiling(log2(N))`, which is simply the number of bits needed to hold the number `N-1`. If `N` is a power of two, it's easy: Simply use the `B` least significant bits (LSBs) of the random number generator. In C, this would be `n=r&0x1fu;` where `r `is the output of the random number generator, `n` is the number you want, and the mask `0x1fu` isolates `B=5` bits in this example. This method works well for subcycle generators, but do not use the LSBs of other types of generators, as they often have poor randomness.

In general, `N` is not a power of two, so some math is needed. If` B` is less than half the processor's word-width of 32 (i.e.,` B < 16`), you can do this:

`n = ((r>>B) * N) >> (32-B);`

The right shift by `B` prevents the multiplication from overflowing. For a 64-bit processor, change the 32 above to 64, and that calculation will work for `B` up to 32. On a 32-bit processor, if `B` is 16 or more, the calculation above would lose precision, which can be solved by using the upper half of a 64-bit product as shown in Listing Two:

Listing Two: Improving precision on 32-bit processors.

```uint32_t r, n;
uint64_t product;
r = RandomNumberFunction();
product = (uint64_t)r * (uint64_t)N;
n = product >> 32;
```

The gcc compiler is smart enough to do a 32x32 multiply here instead of a 64x64, but it's prudent to check the output of other compilers.

Returning to the topic of generators, Part I of this article only gave us a glance at a few of the subcycle generators available, so we'll cover them thoroughly here. Until theory has been developed that can calculate the periods of reversible generators for 64 bits, they'll be restricted to 32 bits because their cycles can be traversed using exhaustive search. We'll cover the 32-bit reversible generators first, followed by the 32-bit and 64-bit irreversible generators.

### Reversible 32-bit Subcycle Generators

Reversible generators have periods of nearly 232, but are in many cases a little less random than the irreversibles (CMR and CMFR are exceptions). In Table 1, `Const` is (you guessed it) a constant.` L` is a left shift count, and `R` is a rotate-left count. In this article, rotations are always to the left. All of these subcycle generators have a period of nearly 232. Table 1: Reversible Subcycle generators.

CMR: x = Multiplier*x; x = rotl(x,RL);

Based on tests, CMR is the most random of all the reversible subcycle generators, so I recommend that you use it if your processor has a fast multiply, as most modern processors do nowadays. To have reversibility, Const must be odd. For best bit-mixing, it's best if RL is in the neighborhood of 16, perhaps 13 to 19. Table 2 lists CMR generators that have large seed-ranges and periods near 232: Table 2: Selected 32-bit CMR subcycle generators.

CMR is important, so I supplied a long table. A table with over 1000 entries is available for download here. Each row is a generator using a multiplier of `Multiplier`, a left-rotation count of `RL`, and having a period of `Period`, which has the given prime factors. The generator can be directly seeded with up to `SB` bits of seed, and the seed-value must be in the range `SeedStart to SeedStart+SeedCnt-1`, which merely means that you must add `SeedStart `to your `SB`-bit-wide seed. Note that `2SB <= SeedCnt`.

Several computers ran searches for several weeks to find these multipliers. I'm especially proud of the second one — 1422968075 — as it accepts 24 bits of seed, and its cycle covers all but 1377 of the 232 possible values. That means its cycle covers 99.999968% of the 32-bit number-range. Finally, its prime factors are large numbers, making it easy to avoid duplicating factors in other generators. Table 3 provides the list of cycles produced by this multiplier: Table 3: Cycles for CMR multiplier 1422968075 with 16-bit left-rotate.

All subcycle generators have several cycles, as does this one. Clearly, it's crucial that a generator be seeded correctly so that the desired cycle will be used. A bad seed of 210935030, for example, would enter cycle 8, which has a period of only one value, which would make this generator worthless. Listing Three shows a combination generator that uses three CMR generators.

Listing Three: A combination generator that uses three CMR generators.

```void Seed3Cmr (uint32_t seed) {
x =  735593496u + (seed & 0x00ffffffu); // 24 bits of seed
y = 1640766258u + (seed & 0x000fffffu); // 20 bits of seed
z =  481793190u + (seed>>13);           // 19 bits of seed
}

uint32_t Rand3Cmr() {  // Combined period = 2^95.999955
x = 2648253259u*x; x = rotl(x,18); // period=4294965140=2^2*5*214748257
y =  773663125u*y; y = rotl(y,16); // period=4294937531=379*1187*9547
z = 1834882833u*z; z = rotl(z,15); // period=4294865569 (prime)
return x + y + z;
}
```

The combined period of 295.999955 is infinitesimally less than the maximum possible period of 296, so the loss of period is negligible. But the improved randomness compared to an LCG (linear congruential generator) makes this subcycle generator worthwhile. If a period of 264 will suffice, you can drop the third subcycle generator (`z`), and the resulting combination of just `x+y` will still pass BigCrush. It is directly seeded and bits in the seed were overlapped to improve dispersion of sequences. Modern processors pipeline (overlap) the multiplies and rotations, so this generator will be fast.

The multipliers with the largest periods have the fewest omitted numbers out of the 232 possible numbers. For example, the multiplier of 3563976171 has a period of 4294966876, which omits only 420 numbers. Seeding for such multipliers can be done directly with all 32 seed-bits if you check an auxiliary table of omitted numbers, and change an omitted number to something else (such as adding one to it). Such omission-tables would only consume a few kilobytes for these multipliers.

CMFR is a variation of CMR that does a bitwise complement to the product. The "F" in the name stands for "flip bits," which allows this generator to include zero in its cycle. Otherwise, it behaves like CMR. It is useful in combination with CMR, which you've seen already in` RandCmfrCmrCers`. Its multiply will be overlapped with the CMR, yielding maximum performance, and the complement improves mixing of bits in the combination. A table of good multipliers is available for download.

Skipping ahead to 64 bits, a 64-bit CMR will have periods in the ballpark of 263. This is too large to traverse, so I did a partial traversal for the following multipliers. With a rotation of 32 bits, these multipliers are guaranteed to have a period of at least 245: 3141592653589793239, 3279502884197169399, 3751058209749445923, 7816406286208998629, 3482534211706798215, 2718281828459045235. These constants are digits from pi and e, and they will probably make fine 64-bit CMR generators with periods around 263. But you won't know their exact periods. Also, CMRES is a variation of CMR which is irreversible and suitable for 64-bit processors, and will be described shortly with the other irreversibles. In fact, CMRES is the most random of all the subcycle generators I've developed and can even be used alone.

RCA: x = Const + rotl(x,RL);
CERS: x = Const – rotl(x,RL);

CERS may provide more randomness than RCA because the subtraction inverts the bits, so most of my effort has been on CERS. But a special case of RCA occurs when` RL` is zero (no rotation): The constant (which must be odd) is merely added to `x`. Marsaglia refers to this generator as a "Weyl sequence" . It's a full-cycle generator with a period of 2`w`, and is not a subcycle generator. I mention it because it's a fast way of boosting the period in a combination generator. To maximize the period, no other generator should have a factor of 2 in its period. I prefer CERS because it adds more randomness, has choices of periods and their factors, and is almost as fast.

Because CERS uses only one position of its value (the parameter `x` in `rotl`), it is less random than the other subcycle generators. Like RCA, its main use is as a fast (two clock cycle) method of boosting the period of the combination generator that's using it. Table 4 some CERS generators with large seed-ranges and periods. The meanings of the columns are the same as in CMR: Table 4: Selected 32-bit CERS Subcycle generators.

LAR: x = x + (x<<SL); x = rotl(x,RL);
LSR: x = x - (x<<SL); x = rotl(x,RL);
LESR: x = (x<<SL) - x; x = rotl(x,RL);

If you must avoid the multiplication required by CMR, then this group of fast generators should work well. All of them add or subtract a left-shift and the value, which is similar to CMR's multiply, so these generators behave like CMR. In fact, LAR is a special case of CMR. Table 5 is a short list of what I consider the best choices of shift- and rotate-counts, based primarily on maximizing the period, and secondarily on avoiding small numbers in the prime factors in order to make it easier to assemble a combination generator without duplicate factors. Table 5: Selected LAR, LSR, and LESR Subcycle generators.

Listing Four shows a combination generator that uses one each of these three kinds of subcycle generators, each of which is directly seeded:

Listing Four: A generator with three subcycle generators with low multiplication.

```void SeedLarLsrLesr (uint32_t seed) {
x = 2191221356u + ((seed>>20)&0x0fffu); // 12 bits of seed
y = 2569780889u + ((seed>>8 )&0x0fffu); // 12 bits of seed
z =  186447614u + ((seed    )&0x00ffu); // 8  bits of seed
}

uint32_t RandLarLsrLesr () {  // Combined period = 2^95.87
x = x + (x<<6); x = rotl(x, 6); // LAR: period=4282054541=11941*358601
y = y - (y<<2); y = rotl(y,23); // LSR: period=4277166515=5*37*53*179*2437
z = (z<<5) - z; z = rotl(z,17); // LESR:period=3949227389=353*1181*9473
return x + y + z;
}
```

One problem with these subcycle generators is their few good shift-rotate combinations limit your choices of periods and period-factors, and you're restricted to small seed-ranges of 7 to 12 bits, making it difficult to use all 32 bits of seed. If none of the available choices suit you, or if you need a larger seed-range, there's nothing you can do about it. The generators below solve this problem.

LARCA: x = x + (x<<SL); x = Const + rotl(x,RL);
LSRCA: x = x - (x<<SL); x = Const + rotl(x,RL);
LESRCA: x = (x<<SL) - x; x = Const + rotl(x,RL);

These generators are the same as LAR/LSR/LESR, except that a constant is added to the rotation, which allows a search for constants to be done to find larger periods and seed-ranges. The constant also improves randomness slightly. I ran a multi-day search, and the resulting cycle table is available for download here. Listing Five shows a combination generator that uses one each of these three kinds of subcycle generators:

Listing Five: A combination generator using multiple reversible subcycle generators that use constants.

```void SeedLarcaLsrcaLesrca (uint32_t seed) {
x = 1411095840u + (seed >> 16);      // 16 bits of seed
y = 3295935573u + (seed & 0x1ffffu); // 17 bits of seed
z = 1927078987u + (seed & 0x1ffffu); // 17 bits of seed
}

uint32_t RandLarcaLsrcaLesrca () {  // Combined period = 2^95.99960
x = x + (x<<10); x = 3483234673u + rotl(x,14); //LARCA: period=4294437379
y = y - (y<< 9); y = 2456424491u + rotl(y,13); //LSRCA: period=4294703122
z = (z<<5) - z;  z =   36615259u + rotl(z,18); //LESRCA:period=4294565593
return x + y + z;
}
```

At 295.99960, the combined period is only a hair below 296. Each subcycle generator uses 4 clock cycles of arithmetic operations, so the combination generator uses 14 clock cycles, which is faster than needed for most applications, but is slower than most other combinations in this article.

### More Insights To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

## Featured Reports ## Featured Whitepapers 