Ted Johnson and Dennis Shasha
Ted is an assistant professor at the University of Florida. Dennis is a professor at New York University and author of a popular puzzle book, entitled The Puzzling Adventures of Dr. Ecco and an upcoming book on database performance tuning. They can be reached via Internet at [email protected] or [email protected]
The versatility of B-trees is the reason they are ubiquitous in database programs from mainframe packages (such as IBM's VSAM) to PC products (such as dBase and its competitors, or the database facility in OS/2 Extended Edition).
Since their invention by Rudolph Bayer in 1972, there have been any number of variations in both data structure and algorithms. Nevertheless, there is always room for improvement. This article first briefly reviews the B-tree concept, and then summarizes our investigation into a simpler, more efficient method for managing B-trees that grow on the average.
A B-tree is a data structure that allows you to store and retrieve a set of key-value pairs. For example, the keys may be social security numbers and the values may be names or addresses. Normally, the values take up more space than the keys.
Unlike a binary tree, a B-tree is a balanced tree structure. Every leaf is at the same distance from the root. The classic B-tree structure stores some key-value pairs in the interior nodes of the structure; a B+ tree is a variation that places all key-value pairs at the leaves. B-trees are now almost always used as B+ trees. The main findings in this article apply to both kinds of structures.
B-trees are useful for two kinds of searches: exact searches (for example, find all information associated with employee 10) and range searches (for example, find all information associated with employees whose IDs are between 35 and 43). The reason is that the key-value pairs are always kept in sorted order. By contrast, if you use hashing as an access method rather than B-trees, you get faster exact searches--but cannot easily conduct range searches.
The interior (nonleaf) nodes of a B+ tree do not contain key-value data, but rather consist of navigational information: pairs of so-called separators and pointers to nodes. Specifically, an interior node consists of the sequence S1 P1 S2 P2 ... Sn Pn, Where the sis are separators sorted in ascending order and the pis are pointers to child nodes.
The maximum number of children that an interior node may have is called its "fanout." On big mainframe B-trees, where a node may span an entire track, its fanout can be over a thousand. A leaf node usually has fewer elements than an interior node because values are usually bigger than pointers, and the keys are at least as big as the separators. Normally, the key-value pairs at the leaf nodes are kept in sorted order, but some researchers advocate implementing a hash table at the leaves.
The principal advantage of B-trees is that they offer consistently fast access times for both exact and range searches. The consistent access times are due to the balanced tree. All searches require traversing an equival number of blocks from the root down to the leaves. Even for databases of hundreds of megabytes, the search path is limited to a small number of blocks. Combine the B-tree structure with a disk cache for frequently accessed blocks and you get an access method that's hard to beat.
A principal disadvantage of B-trees is the storage overhead consumed by the interior nodes of the tree. Another is the programming complexity involved in managing this balanced data structure efficiently.
Given a key k, searching a B+ tree for the associated value is easy. First, start at the root node of the tree. At each node, find the first separator "s" such that s<=k, and follow the pointer to the right of s. At the leaf, simply see whether there is a record associated with key k. (With a classical B-tree, you may find the data earlier, at the interior node rather than the leaf, though this will rarely happen if the fanout is large.)
Modifying a B+ tree works as follows: To insert a key-value pair, search for the key you want to insert. If already present, then replace the existing value with the new value. Otherwise, add the key-value pair into the leaf node (assuming a B+ tree)--a successful insert.
Sometimes, an insert may encounter a leaf that is already full. In that case, the insert must "split" the node into two nodes. The left node takes the lower half of the elements, and the right node takes the upper half. Then the algorithm must insert a separator-pointer pair into the parent. This may cause the parent to split. In the worst case, the split will propagate recursively through the interior nodes all the way to the root of the tree. This would then add a new level to the balanced tree structure.
There is not much one can do about splits--if a node is full, one cannot put more data in it without destroying old data. However, the analogous operation for deletes, the "merge," allows more options.
One of the trickier tasks in programming a B-tree is what to do with a node when it falls below half full. Rudolph Bayer decreed that nodes should be merged with their neighbors when this occurs; this approach is called "merge-at-half." Other researchers have proposed many other possibilities, but the chief competitor is "free-at-empty"--a much simpler approach in which you wait until the node is empty and then just free it and adjust its parent.
Intuitively, the choice seems to be a programming effort vs. space utilization trade-off. Virtue--one might think--would consist of merging way before the node becomes empty. But virtue here is considered harmful.
As our research has discovered, it turns out that the merge-at-half strategy is harder to implement, detrimental to performance, and gives only a negligible gain in space utilization compared with the free-at-empty strategy--provided the B-tree is growing on the average.
Space utilization holds many surprises for the unwary. To test your wariness, see how you would answer the following questions:
- Suppose you are doing only inserts into a B-tree (or B+ tree). Do you think the distribution of your data will affect the average space utilization?
- Because (normal) splitting results in transforming a full node into two half-full nodes, all nodes (except the root) are at least half full in a pure insert B-tree. Does this mean that the average utilization is 75 percent in a pure insert B-tree?
Why will it suffer? Because every node will be half full as a result of the normal splitting algorithm and will not receive any more values after it is split. The heuristic that is often used is to detect whether the split occurred because of an insert of the highest key-value pair or the lowest. In either case, change the splitting algorithm by distributing the key-value pairs, as follows: three-fourths to the existing node and one-fourth to the new node.
The answer to the second question is also negative. Andrew Yao showed that the average utilization is about 69 percent. The "intuitive" reason is that a full node is replaced by two half-full nodes, reducing the utilization to below 75 percent.
We found another trap for the unwary. Suppose successful inserts and deletes are the only modifications. Almost all applications insert more than delete, so we are most interested in that case.
At 50-percent deletes, free-at-empty gives utilization below 40 percent, whereas merge-at-half gives 60 percent. One point for virtue! However, when deletes comprise at most 47 percent of all modifications, the space utilization difference between free-at-empty and merge-at-half nearly disappears. In fact, the utilization is close to the pure insert value for both methods. The bigger the nodes are, the sharper the "knee" of this curve becomes. See Figure 1.
What's more, merge-at-half requires significantly more restructuring activity than free-at-empty, thus hurting performance. See Figure 2.
There is actually another benefit to free-at-empty in those high-performance online transaction applications that allow concurrent accesses to B-trees. By concurrent accesses, we mean that many inserts/deletes/searches may occur during the same time interval.
Many locking schemes are used in such cases, but all require exclusive locks on all nodes that are changed. Because merge-at-empty requires locks on more nodes than free-at-empty, and because the locks are held for longer times, merge-at-empty reduces the possible amount of concurrency and can therefore hurt performance.
The moral of this story is simple and sweet: When programming B-trees, do it the easy way!
Bayer, R. and E. McCreight. "Organization and Maintenance of Large Ordered Indexes." Acta Informatica 1, 1972.
Johnson, T. and D. Shasha. "Utilization of B-trees with Inserts, Deletes, and Modifies." ACM Symposium on Principles of Database Systems Conference, 1989.
Shasha, D. and N. Goodman. "Concurrent Search Structure Algorithms." ACM Transactions on Database Systems, (March, 1988).
Yao, Andrew. "On Random 2-3 Trees." Acta Informatica 9, 1978.
Copyright © 1992, Dr. Dobb's Journal