## Dividing data into meaningful groups

*Joe is a database consultant and contributing editor to DBMS magazine. He can be contacted on CompuServe at 71062,1056.*

Formally, a partition of a set of objects is a collection of subsets such that: 1. The union of all the subsets in the collection is equal to the original set; 2. the intersection of all the subsets in the collection is empty; and 3. no subset in the collection is empty. This means that every element of the original belongs to one and only one subset in the partition. Informally, a partition is a way of dividing up the loot, cutting up the cake, or grouping data into classes for a report.

If you are familiar with SQL, the GROUP BY clause is a partitioning table. The GROUP BY clause results in the minimum number of groups where, for each grouping column of each group of more than one row, all values of that grouping column are either the NULL value, or equal to each other. For example, if I used *GROUP BY states* on a geographic database, I would expect to get (at most) 51 rows in the grouped table (one grouped sub-table per state and one for all NULL values).

In computer science, there is a whole class of NP-complete problems called the "Knapsack" and "Bin-Packing" problems which depend on partitioning. Informally, NP-complete problems are ones which get so big, so fast that it is not practical to work them out. They often involve a factorial calculation hidden in them.

The Knapsack problem takes its name from the way it is usually presented. Imagine that you are packing a knapsack for a hike. The items which you can stuff into the knapsack have a weight (or cost) and a value to you. A compass is extremely light, but very valuable. A bag of wet sand is extremely heavy and totally worthless on a hike. A canteen of water is somewhere between these two extremes in both weight and value.

The problem is to get the most value in the knapsack under a certain weight limit. This means that you want to partition the items into all possible subsets, and then rate them for weight and value. Of those whose weight is no greater than the limit of your back, you wish to find the ones which have the greatest value. The packing may or may not be unique, if it exists at all.

The Bin-Packing problem is usually explained with a warehouse full of identical empty bins and a pile of different-sized boxes which must be put into the bins. The goal is to put all the boxes into the smallest number of bins. The size of the bins and boxes are usually given as integer units to make the problem easier. The partition we want to find has the smallest number of subsets, each with a total size no greater than the bin size.

Partitions have been an area of study for mathematicians for over 300 years, ever since Leibniz asked Bernoulli if he had investigated P(*n*), the number of partitions of a set of (*n*) objects.

The mathematician Eric Temple Bell came up with the "Bell numbers," a formula for finding the number of partitions of a set. There are two formulas to compute the Bell number; one uses Sterling numbers, and the other uses Binomial coefficients. The Sterling number, S(*m*,*n*), is defined as the number of ways of picking a subset of n elements from a set of *m* elements. Binomial coefficients, C(*n*,*r*), represent the number of combinations from a set of *n* elements taken *r* at a time.

The function for C(*n*,*r*) can be coded iteratively and will run very fast, but the procedure which calls it will be recursive. The function for S(*m*,*n*) can be coded in a simple, recursive fashion, but the procedure which calls it will be iterative. You simply cannot escape recursion in partitioning problems. I've given the Sterling version here because it is short and uses simpler arithmetic.

If you wish to try coding the Binomial coefficient version, the function for C(*n*,*r*) is also given. I have not compared the run times. Figure 1(a) shows the recursive formula.

The Bell number simply tells you how many partitions to expect, but does not show you what the partitions will look like. For example, Figure 1(b) lists all 15 possible partitions of the set *{1, 2, 3, 4}*. The vector beside each partitioning is the signature of the collection. The first partition has all four elements in one set; the second partition has the first three elements in one set and the fourth element in another; and so forth until we have four partitions of one element each. This signature can be used for generating partitions. The procedure Part(*n*) is an algorithm for finding the signatures. The procedure *OutSignature* can be replaced by one that will print the actual subsets.

Inspect the signatures in the example. There are really only five different signature patterns in this partition, namely:

- One set of four elements.
- One set of three elements and one set of one element.
- Two sets of two elements each.
- One set of two elements and two sets of one element.
- Four sets of one element each.

*n*), and there is no known formula to calculate it; you just have to work it. As you have likely already figured out, p(

*n*) is the number of ways that you can write the integer (

*n*) as the sum of smaller integers. An approximation formula does exist for p(

*n*), but it is a little high for smaller values of (

*n*). The formula is shown in Figure 2.

The two tricks to generating the signature patterns are relatively simple. The first is that no signature pattern can be longer than (*n*) elements; the second is to realize that if you know the sum of all but one of the elements, you can compute it by subtracting it from the total of the others. The total of a signature pattern is always equal to (*n*).

This means that you can start with a single element of (*n*) and expand it in an orderly fashion, building each series of partitions of a particular length. See *GenPattern1* in Listing One .

This algorithm can be improved upon by a change in notation for a multiset. Let the *x @ y* sign mean "*x* sets of *y* members each" so that you can condense the notation as in Figure 3. This algorithm is given in *Genpattern2* in Listing One. The program could be made more readable by introducing a record data type with *size* and *count* fields to use as an array, but I have not done so.

Lotteries usually involve picking a subset of numbers from some range. As an exercise, you're invited to take your state lottery rules and use these algorithms to make yourself a millionaire. Who says computer science isn't practical?

### References

Djokic, B., et al. "A Fast Iterative Algorithm for Generating Set Partitions." *The Computer Journal*, 1989.

Er, M.C. "A Fast Algorithm for Generating Set Partitions." *The Computer Journal*, 1988.

Nijenhaus, A. and H.S. Wilf. *Combinatorial Algorithms*. San Diego, CA: Academic Press, 1978.

Reingold, E.M., J. Nievergelt and N. Deo. *Combinatorial Algorithms: Theory and Practice*. Englewood Cliffs, NJ: Prentice-Hall, 1977.

Semba, I. "An Efficient Algorithm for Generating All Partitions of the Set {1,_n}." *Journal of Information Processing,* 1984.

**Figure 1:** (a) Computing the Bell number via Binomial coefficients; (b) 15 possible partitions of the set {1, 2, 3, 4}.

(a) Bell(n+1) = Sum(C(n, k) * Bell(n)) from k :=0 to n; (b) {1, 2, 3, 4}} = (1, 1, 1, 1) {{1, 2, 3}, {4}} = (1, 1, 1, 2) {{1, 2, 4}, {3}} = (1, 1, 1, 2) {{1, 2}, {3, 4}} = (1, 1, 2, 2) {{1, 2}, {3}, {4}} = (1, 1, 2, 3) {{1, 3, 4}, {2}} = (1, 1, 1, 2) {{1, 3}, {2, 4}} = (1, 1, 2, 2) {{1, 3}, {2}, {4}} = (1, 1, 2, 3) {{1, 4}, {2, 3}} = (1, 1, 2, 2) {{1, 4}, {2}, {3}} = (1, 1, 2, 3) {{1}, {2, 3, 4}} = (1, 2, 2, 2) {{1}, {2, 3}, {4}} = (1, 2, 2, 3) {{1}, {2, 4}, {3}} = (1, 2, 2, 3) {{1}, {2}, {3, 4}} = (1, 2, 3, 3) {{1}, {2}, {3}, {4}} = (1, 2, 3, 4)

**Figure 2:** Function to approximate the partition count.

FUNCTION PartitionCount(n : INTEGER): INTEGER; { approximate number of partition patterns for a set of (n) elements. Formula tends to guess high. } CONST pi = 3.141592653; BEGIN { math can be optimized because of constants } PartitionCount :=Round ((1.0 / (4.0 * n * Sqrt(3.0))) * Exp (pi * Sqrt((2.0 * n)/3.0))) END;

**Figure 3:** Condensing notation for multisets.

{{1, 2, 3, 4}} becomes (1 @ 4) {{1, 2, 3}, {4}} becomes (1 @ 3, 1 @ 1) {{1}, {2}, {3}, {4}} becomes (4 @ 1)

#### Listing One

PROGRAM TestBellNumbers; VAR n : INTEGER; FUNCTION Bell(m : INTEGER) : INTEGER; { total number of partitions for a set of n elements } VAR i, sum : INTEGER; FUNCTION Sterling (m, n : INTEGER) : INTEGER; { partition of m things in sets of size n } BEGIN IF ((m = n) OR (n = 1)) { subset equal to original } THEN Sterling := 1 ELSE IF (m < n) { subset greater than original } THEN Sterling := 0 ELSE IF (n = 0) { empty set } THEN Sterling := 0 ELSE Sterling := Sterling((m-1), (n-1)) + (n * Sterling((m-1), n)); END; BEGIN sum := 0; FOR i := 0 TO m DO sum := sum + Sterling (m, i); Bell := sum; END; BEGIN Write('Give an n: '); ReadLn (n); n := Bell(n); WriteLn('Bell Number is ', n); ReadLn; END. PROGRAM TestCombinations; { version using combination operator } VAR n, k : INTEGER; FUNCTION Comb (n, k : INTEGER) : INTEGER; { Binomial coefficient or number of k element subsets in set of n } Var i, top, bottom : INTEGER; BEGIN top := 1; FOR i := n DOWNTO (n - k + 1) DO top := top * i; bottom := 1; FOR i := k DOWNTO 2 DO bottom := bottom * i; Comb := top DIV bottom; END; BEGIN Write('Give an (n): '); ReadLn (n); Write('Give an (k): '); ReadLn (k); n := Comb(n, k); WriteLn('Comb Number is ', n); ReadLn; END. PROGRAM GenPattern1; CONST big = 100; VAR p : ARRAY [0..big] OF INTEGER; i, j, n, PatternSize : INTEGER; PROCEDURE OutPattern; { Display the current partition pattern } VAR i : INTEGER; BEGIN Write('(', p[1]); FOR i := 2 TO PatternSize DO Write(', ', p[i]); WriteLn(')'); END; FUNCTION Sum(a, b :INTEGER) : INTEGER; { compute total of subarray p[a:b] } VAR total, i : INTEGER; BEGIN total := 0; FOR i := a TO b DO total := total + p[i]; Sum := total; END; BEGIN PatternSize := 1; p[0] := -1; { sentinel value } WriteLn('Give me n: '); ReadLn(n); p[1] := n; { load starting value into array } WHILE (PatternSize <= n) DO BEGIN OutPattattern; i := PatternSize - 1; WHILE ((p[PatternSize] - p[i]) < 2) DO i := i - 1; IF (i <> 0) THEN FOR j := (PatternSize - 1) DOWNTO i DO p[i] := p[i] + 1 ELSE BEGIN FOR j := 1 TO PatternSize DO p[j] := 1; PatternSize := PatternSize + 1; END; p[PatternSize] := n - Sum(1, (PatternSize - 1)) END; END. PROGRAM GenPatterns2; { generate partitions in dictionary order } CONST BIG = 100; VAR m, p :ARRAY [-1..BIG] OF INTEGER; i, sum, n, left : INTEGER; PROCEDURE OutList; { this procedure uses the new notation, but can be easily modified to print out full multisets } VAR i : INTEGER; BEGIN Write('(', m[1] ,' @ ', p[1]); FOR i := 2 TO left DO Write(', ', m[i] ,' @ ', p[i]); WriteLn(')'); END; BEGIN left := 1; p[-1] := 0; m[-1] := 0; WriteLn('Give me n: '); ReadLn(n); p[0] := n + 1; m[0] := 0; p[1] := 1; m[1] := n; WHILE (left <> 0) DO BEGIN OutList; sum := m[left] * p[left]; IF (m[left] = 1) THEN BEGIN left := left - 1; sum := sum + (m[left] * p[left]); END; IF (p[left - 1] = p[left] + 1) THEN BEGIN left := left - 1; m[left] := m[left] + 1; END ELSE BEGIN p[left] := p[left] + 1; m[left] := 1; END; IF (sum > p[left]) THEN BEGIN p[left + 1] := 1; m[left + 1] := sum - p[left]; left := left + 1; END; END; END. PROGRAM Partitions; CONST big = 100; VAR p, q: ARRAY [0..big] OF INTEGER; n : INTEGER; PROCEDURE OutSignature; { display the signature of a partitioning } VAR i : INTEGER; BEGIN Write('(', q[1]); FOR i := 2 TO n DO Write (', ', q[i]); WriteLn(')'); END; { OutSignature} PROCEDURE AllSubsets(n : INTEGER); LABEL 10; VAR i, last, m, ClassCount : INTEGER; BEGIN ClassCount := 1; FOR i := 1 TO n DO q[i] := 1; p[1] := n; OutSignature; { display single set } REPEAT m := n; WHILE (TRUE) DO BEGIN last := q[m]; IF (p[last] <> 1)THEN GOTO 10; q[m] := 1; m := m -1 END; 10: ClassCount := ClassCount + m - n; p[1] := p[1] + n - m; IF (last = ClassCount) THEN BEGIN ClassCount := ClassCount + 1; p[ClassCount] := 0; END; q[m]:= last + 1; p[last] := p[last] - 1; p[last + 1] := p[last + 1] + 1; OutSignature; UNTIL (ClassCount = n); END; { AllSubsets } BEGIN WriteLn ('Give me an n: '); ReadLn(n); AllSubsets(n); ReadLn END. PROCEDURE OutSubsets; { Build actual subsets from signature array } VAR i, j, ThisSet : INTEGER; BEGIN ThisSet := 1; { current set id # } BEGIN FOR i := 1 TO n { scan array for elements } DO IF (p[i] = ThisSet) THEN BEGIN { format display } Write('{ ', i); { first element } FOR j := (i+1) TO n { scan rest of set } DO IF (p[j] = ThisSet) THEN Write (', ', j); Write(' } '); { close up brackets } ThisSet := ThisSet + 1; { set up for next set } END; END; WriteLn; END;

Copyright © 1994, *Dr. Dobb's Journal*