Channels ▼
RSS

Parallel

Fast, High-Quality, Parallel Random Number Generators


The random number generators presented in this article are simple, fast, and suitable for 32-bit and 64-bit software and hardware. They are also parallelizable, improving performance significantly in existing CPUs using existing compilers. They consist of a combination of primitive generators, each of which outputs a cycle of numbers with a period of less than 232 or 264. Such below-maximum-period cycles are called "subcycles," and they provide good randomness. When used in combination, the resulting random number generator passes the most stringent tests. Portable examples are provided in ANSI C. Testing was done under Linux and Microsoft Windows.

Primitive Subcycle Generators

Computers don't have a true source of randomness in them, so randomness must be synthesized with arithmetic. Only a few methods exist to do this. Simple and fast combinations of a few arithmetic operations produce quasi-random sequences of numbers. For example, the following rotation and subtraction is a primitive random number generator:

#define rotl(r,n) (((r)<<(n)) | ((r)>>((8*sizeof(r))-(n))))
x -= rotl(x,21);  // RS: rotate-subtract generator

If you set x to 6247 initially (called the "seed"), and repeatedly execute the line of code above, the value of x will transition through a cycle of 615,434 values. That cycle-length is called the "period." Most generators like this one have many unique cycles, and every possible 32-bit or 64-bit seed-value will enter one of these cycles. For best randomness, you want to select a long cycle, and the only way to select a cycle is by using a seed that's in (or leads to) the desired cycle. The aforementioned seed of 6247 is the minimum value in the longest cycle, but any value in that cycle can be used as a seed.

Because a primitive generator has multiple cycles, I call each such cycle a "subcycle." The literature about random number generators refers to 2w (where w is the number of bits in the CPU's word) as the "full cycle" (where all possible numbers are generated). The term "subcycle" makes it clear that not all possible numbers are output by that cycle. Lest you think that these shorter cycles are a problem, reversible subcycle generators (described below) have periods very close to 2w.

Note the previous rotl macro, which performs a left rotation. Because of the sizeof operator, it works with either 32-bit or 64-bit operands. Modern compilers such as gcc and Visual Studio are smart enough to convert this expression into a single rotate instruction that consumes one clock cycle in the CPU, so rotations are as fast as shifts, adds, subtracts, and exclusive-ors (XORs). Most subcycle generators rely on rotations because they do a good job of mixing bits.

The period of a subcycle generator mostly depends on whether it's reversible or irreversible. Reversible means that the calculation can be undone starting with only the result. An example is x = C – rotl(x,12), where C is a constant. You can reverse this calculation by doing an add followed by the opposite rotation. Based on theory beyond the scope of this article (see Bob Jenkins [1] and Knuth [2]), the consequences are:

  • Reversible generators have periods around 2w-1. In fact, through careful selection of constants, most of the reversible generators given in this article have periods within one percent of 2w, and thus the loss of period from using such a subcycle generator (versus a full cycle one) is negligible. Reversible generators are well suited for 32-bit processors because they maximize the period of the combination generator (described below).
  • Irreversible generators have periods around 2w/2, and somewhat higher for selected values of constants. For 64-bit generators, periods of 235 to 242 are common. A greater variety of irreversible generators are available, and they generally provide more randomness (as measured by tests) for a given number of clock cycles than reversible generators. An example is the rotate-subtract generator introduced above. Subtracting rotl(x) from x cannot be reversed, but it does a better job of mixing bits than C-rotl(x). Both consume two clock cycles of arithmetic. Irreversible generators can be used in combination with reversible generators in 32-bit applications, and they are superb for 64-bit generators where their shorter periods are not an issue because the combined period is over 2100 anyway.

Naming Subcycle Generators

There are so many kinds of subcycle generators that a naming scheme is needed for them. I chose to name them after the order of operations you'd do on a calculator. For example, the aforementioned primitive generator is named "RS" for "rotate" (assumed to be left) and "subtract." The "R" is first because you must do the rotate first. More precisely, I mimicked a traditional HP calculator, which is stack based, so there are also two stack operators: "E" is "exchange top two elements of stack," and "D" is "duplicate top of stack." The whole stack is initially assumed to be set to the generator's value. An example is C-rotl(x), which is named "CERS," where "C" means "enter a constant." The "E" makes the operand order of the subtraction correct, because the top of stack is subtracted from the second-to-top. The only non-obvious letter is "I" for "right shift." This naming convention has the advantage of exactly encoding the formula, so you can write down the C code knowing only the name.

The randomness of a subcycle generator is largely a function of how many positions of the input bits it uses. Each shift or rotate moves the bits to a new position. For example, CERS (C-rotl(x)) uses one position of its input bits, whereas RS (x-rotl(x)) uses two positions — the original position and a rotation. As a result, RS mixes bits better and thus produces more randomness. For generators using a multiply, each "1" bit in the multiplier yields a different position, so multiplication-based generators have around w/2 positions, making them superior to the others.

You'll be meeting more subcycle generators later, such as RERS, LESR, and CMRES. They all have interesting personalities.

Combination Generators

The problem with a primitive generator such as RS is that it's not random enough for most uses. Even a generator with a multiply is often not random enough. Such a one-line generator could be used for hardware tests or games, especially if extremely high speed were needed; but applications such as simulations need better randomness. The solution is to combine two to four such primitive subcycle-based generators together. The resulting combination generator will pass stringent tests. Here's a complete example in ANSI C:

typedef unsigned uint32_t;  // or use <stdint.h> if available
uint32_t x, y, z;

void Seed2CmrRsr(uint32_t seed) {
   x = (seed >> 16)    + 4125832013u; // upper 16 bits + offset
   y = (seed & 0xffff) +  814584116u; // lower 16 bits + offset
   z = 542;
}

uint32_t Rand2CmrRsr() {  // Combined period = 2^81.95
   x *=  255519323u; x = rotl(x,13); // CMR, period = 4294785923 (prime)
   y *= 3166389663u; y = rotl(y,17); // CMR, period = 4294315741 (prime)
   z -= rotl(z,11);  z = rotl(z,27); // RSR, period = 253691 = 2^3*3^2*71*557
   return x ^ y ^ z;
}

This combination generator executes only nine arithmetic operations per number output. Multiplications are fast in modern CPUs (typically 3-4 clock cycles), and they are pipelined (overlapped), so the second and following multiplies only consume one extra clock cycle each. The remaining operations (rotates, subtracts, and XORs) take one clock cycle each, and they'll also be pipelined some, making this generator very fast.

This generator has a period of about 282, which is far more numbers than anyone can use. The period of any combination generator is the product of the periods of its primitive generators. But this is true only if those primitive periods have no prime factors in common; If they do, those common factors must be divided out of the product.

Where did all the magic constants used in those subcycle generators come from? They were carefully selected after weeks of exhaustive computer search on multiple computers for numbers that gave large periods. Each type of subcycle generator has a table of its cycles (for 32-bit and 64-bit), which you can download here.

Note that the standard type uint32_t used in the code above is not supported by all compilers, so you might need to use the typedef. The same is true for uint64_t (which is unsigned long long) if you need a 64-bit random number generator (which will be described shortly).

By way of convention, "subcycle generator" refers to a primitive generator. "Combination generator" always refers to two or more subcycle generators that are combined (usually with adds or XORs) to form the returned random number.

Testing Random Number Generators

It's important that random number generators be tested thoroughly, and several suites of tests have been published. Marsaglia published his "Diehard Battery" of tests in 1995 [3]. While fine in its day, today's faster computers allow applications to consume more random numbers, thus better tests are needed. Pierre L'Ecuyer and Richard Simard implemented the TestU01 software library [4], and their excellent work has become the standard test suite for random number generators. Users may define their own batteries of tests, but the three predefined batteries contained in this library (SmallCrush, Crush, and BigCrush) provide successively increasing levels of thoroughness for testing. SmallCrush is roughly equivalent to the Diehard Battery, and provides only a basic level of testing by today's standards. Crush is more severe, and BigCrush is the most thorough of all, consuming several hours of CPU time. It is so severe that no well-known linear congruential generator (LCG) passes it. In fact, few popular LCGs pass all the tests even in SmallCrush, much less the stricter batteries.

Only the most robust generators can pass BigCrush, yet all the simple subcycle-based combination generators discussed in this article pass it. The reason that subcycle generators do so well is that they can mix bits better because they are not restricted to outputting the "full cycle" of 2w. This freedom from the full cycle also allows us to use rotations, which excel at mixing bits. Yet combination generators, which contain several subcycle generators, generally can output all 2w numbers, and their periods are large, so nothing has been lost by using subcycle generators in combination.

Seeding

As mentioned previously, you cannot seed a primitive generator with arbitrary values because it's not possible to determine beforehand which of its many cycles will be entered. It must be seeded with a value that's known to be in (or lead to) the desired cycle. In our ANSI C code sample, Rand2CmrRsr ranges of numbers encompassing 16 bits are known to be in the desired cycles of its two CMR generators. In the following code, the seeding function uses bits from the seed and maps them into the appropriate number ranges:

x = (seed >> 16)    + 4125832013u; // upper 16 bits + offset
y = (seed & 0xffff) +  814584116u; // lower 16 bits + offset

Reversible generators have large seed ranges, which are usually 10-20 bits wide. Thus, if you're using three subcycle generators (the usual number), a 32-bit seed can be split into three pieces, each used to seed one generator. The example above shows a seed being split into two pieces.

Also, the subcycle generators together will generally total more than 32-bits of seed range. For example, the two CMR generators in our code sample have seed ranges of 18 and 16 bits, totaling 34. In such cases, it's helpful to overlap the pieces of seed, so that some bits will seed two generators. Doing so will cause seeds to create more divergent number sequences.

Unfortunately, irreversible generators don't have large seed ranges, so this method won't work for them. But because multiple primitive generators are used in combination, a simple method of seeding them is:

  1. Set each generator to a known value in its desired cycle.
  2. Split the seed into pieces (as was done in our code sample).
  3. Treating its piece of seed as an integer, advance each generator by that number of steps.

Here's a complete combination generator that shows this method of seeding:

void SeedRsResCers (uint32_t seed) {
   uint32_t n;
   x = 6247;  y = 3848;  z = 0;
   for (n=((seed>>22)&0x3ff)+20; n>0; n--) x -= rotl(x,21);
   for (n=((seed>>11)&0x7ff)+20; n>0; n--) y = rotl(y,11) - y;
   for (n=((seed    )&0x7ff)+20; n>0; n--) z = 3286325185u – rotl(z,19);
}

uint32_t RandRsResCers() { // Combined period = 2^70.9
   x -= rotl(x,21);              // RS,   period = 615434
   y = rotl(y,11) – y;           // RES,  period = 1703271
   z = 3286325185u – rotl(z,19); // CERS, period = 4294921861
   return x ^ y ^ z;
}

Note that x, y, and z are initialized to 6247, 3848, and 0. These are the minimum values in the selected cycles of the corresponding subcycle generators. Any other values in those cycles could have been used equally well. Each subcycle generator is advanced by a number of steps obtained from some bits in the seed. 10 bits are used for x (the 0x3ff mask), and 11 for y and z (the 0x7ff mask), consuming all 32 bits in the seed. In the worst case, the for loops above will execute about 5000 times, taking approximately 30 microseconds on a 1 GHz processor. Because this initialization is only done once per job, this time is insignificant. Notice the +20 in each seeding loop: It ensures that the loop causes the generator to be at least that many steps away from its minimum value. This is prudent, but not essential.

If only two subcycle generators are used, they can be initialized with the following for loops:

for (n=((seed>>16)&0xffff)+20; n>0; n--) ...
for (n=((seed    )&0xffff)+20; n>0; n--) ...

These loops iterate about 130,000 times in the worst case, consuming around 1 millisecond on a 1 GHz processor. Like the case above, this once-per-job time is typically not significant.

If for some reason these times are a problem, one solution is to use a look-up table of known in-cycle values for each subcycle generator. In the first example, a table of 2048 (211) values can be used instead of the for loops for y and z, and a 1024-value table for x. These tables would total 12 Kbytes, and would result in the following fast initialization function:

void SeedRsResCers (uint32_t seed) {
   x = x_table[(seed>>22) & 0x3ff];
   y = y_table[(seed>>11) & 0x7ff];
   z = z_table[(seed    ) & 0x7ff];
}

A combination of these two approaches could also be used, with some bits selecting a table entry and some more bits selecting the loop count for advancing that generator. In conclusion, seeding irreversible subcycle generators should not be an issue in any system, including those with tight time requirements.

Take another look at RandRsResCers. It performs only eight fast arithmetic operations for each number it outputs. Despite its simplicity, it passes all the tests in the severe BigCrush test battery. This example is a combination of three subcycle generators: RS, RES, and CERS. Each of them consists of one rotation and one subtraction, which are combined using XOR operations. The periods of the primitive generators x, y, and z are 615434, 1703271, and 4294921861, respectively. These periods have no common factors, so the period of the resulting combination generator is the product of these periods, which is 271.93. A mixture of reversible (CERS) and irreversible (RS, RES) generators is used. RS and RES provide most of the randomness, and the main purpose of CERS is to boost the period (while adding a little randomness).

Finally, the reversible generator CERS in RandRsResCers can accept 19 bits of seed directly, so the code above could have given 19 bits to CERS, and used loops (or arrays) to seed RS and RES with the other 13 bits, split as 6 and 7 bits each. Using a reversible generator in a combination makes seeding the irreversible generators easier.

Large Periods?

How large of a period is needed in a combination generator? It appears that a common rule of thumb is to not consume over ,b>sqrt(p) values from a generator having period p, thus requiring that p be at least the square of the number of values consumed. I call this the "square-root rule." Long-running Monte Carlo simulations might consume O(235) random numbers, requiring that the period of the generator be at least 270. Is such a large period actually necessary? Knuth mentioned that LCGs can experience poor "accuracy" in two dimensions when exceeding the square root of the period [5]. But what about subcycle generators?

To determine adequate period length for 32-bit generators, I created combination generators having periods in the range 231 to 248, each composed of three subcycle generators, and ran them through the BigCrush test suite. The results are summarized in the table below.

[Click image to view at full size]
Table 1: Results from the BigCrush test suite.

Some generators with a period of 235 passed BigCrush. Knowing that BigCrush consumes approximately 235 random numbers, I believe that these results indicate that the best of the subcycle-based combination generators can deliver good results over their entire period. Unfortunately, it's not practical to test periods much over 240, so it's impossible to determine by testing whether such a generator is among the best (I believe it's prudent to have a period of at least 248). I also conclude that the square-root rule does not apply to subcycle generators; but the combination generators presented in this article have periods of at least 270, so you should be in good shape even if you follow the square-root rule.

Parallelizable in Software and Hardware

Manufacturers cannot boost CPU speed because doing so increases heat too much. This is why CPU clock rates have stayed flat at around 3 GHz since 2002. So the only way to improve performance is by parallelizing calculations, which is why "parallelizability" of algorithms is important nowadays.

A combination of subcycle generators is easily parallelizable, because the subcycle generators are independent of each other. In fact, modern compilers for general-purpose processors place instructions in an order that causes the processor to pipeline the computation of the generators. As a result of this parallelizing, these "fast" generators execute even faster than expected. As mentioned for our Rand2CmrRsr example, the second multiply only adds one more clock cycle. In RandRsResCers, the rotations and adds/subtracts can (and typically will) be pipelined so that several of those calculations will be performed in parallel. In the future, such parallelizing in software and CPUs will result in all subcycle generators being computed at the same time, resulting in maximum possible speed.

In a Field Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC), all three primitive generators in RandRsResCers can be computed in parallel, and their outputs directed to the XOR gates that combine them. The result is a flat combinational circuit that can operate at a high clock rate, and yet consumes little die space. Given that rotations and shifts are virtually free in hardware, the only substantial die space consumed by RandRsResCers are three 32-bit adders and 64 XOR gates.

Comparison with Existing Generators

Why not simply call rand() and be done with it? Because you don't know what you're getting. You don't know how random it is, nor how fast it runs. It's fine for casual use, but for important work, you need to know what's in there. That's why most experts advise using a generator that you understand and have tested. Most random number generators these days are one of the following:

  • LCGs: An LCG consists of a multiply followed by an add. The theory behind LCGs is well developed, and they were adequate when developed in the 1960s, but the results are poor by today's standards. They are still the most popular generator in use, and your library's rand() function is probably an LCG. As mentioned previously, they don't even pass SmallCrush. A multiply is, by definition, repeated additions; and the carries from those additions propagate only to the left. In an LCG, this results in uneven mixing of bits, so the bits on the right always have poor randomness. Interestingly, the CMR's multiplies in Rand2CmrRsr are similar to an LCG, but their products are rotated, which causes the carries to affect bits on the right. As a result, a primitive CMR used by itself is superior to an LCG, which is verified by my test results.
  • Xorshift: This is a clever sequence of shifts and XORs invented by Marsaglia [6]. My tests show that it is about as random as one CMR generator. Unfortunately, it's not parallelizable, because its six arithmetic steps rely on prior values, and thus are not independent. It is slower than the various subcycle generators that perform only 2 or 3 arithmetic steps.
  • Lagged Fibonacci: Often abbreviated "LagFib," this generator adds two prior values fetched from an array. If the array is large (hundreds or thousands of values), LagFib has good randomness and is at least as fast as the fastest combinations of subcycle generators. But it has some drawbacks. It needs to be seeded with random numbers, so you'll need another random number generator. Although some theory can tell you its minimum period, you don't know the exact period you're getting. With a large array, that theoretical minimum is large enough, but the large array causes an increase of cache misses, which in turn, reduces the performance of the system. These issues are minor compared to the Achilles' heel of LagFib, which is its two lags. If an application happens to have some kind of internal rate or cycle matching either of those lags, the results will be poor. And given that the lags are on the order of hundreds, there's too high a chance of a destructive rate match. So LagFib is not suitable for general use. A combination of subcycle generators has a similar weakness: If an application has a rate matching the period of one of the subcycle generators, then randomness will be reduced. This will not be an issue if subcycle generators with large periods are selected because the probability of a rate match will be infinitesimal and the consequences will be less severe.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video