Garbage Collection on the Run

By Joshua W. Burton, April 01, 2000

Joshua examines several incremental memory-management algorithms, including simple user-defined reference counts, before turning to analyzing the global connectedness of pointer structures.

Apr00: Garbage Collection on the Run

Joshua isa software engineer at Geodesic Systems. He can be reached at jwb@ geodesic.com.

It's Not All Black and White

Despite its many advantages over manual memory management, garbage collection suffers from a lingering perception of performance problems. Every programmer who has ever used GNU Emacs knows in her fingertips that "Garbage collecting..." takes too long and can interrupt interactive work. What can a modern collector offer in the face of decades of negative subliminal advertising?

Using techniques like those described by Mike Spertus in "C++ and Garbage Collection" (DDJ, December 1997), a fast scanning algorithm married to a speedy allocator can achieve sustained raw collection speeds of over 50 Mbits/sec. -- even on the cheapest current-generation processors. Yet user-perceived performance problems due to collector latency can be disastrous, even in situations where the average throughput is excellent. Some programs simply cannot wait for a full atomic collection to run to completion, and yet doing automatic memory management while the memory is being modified has been compared to performing brain surgery on a jogger. My company has been down this road, and we've learned to keep our running shoes and scalpels handy.

In this article, I'll focus on the latency problem. In the process, I'll examine several incremental memory-management algorithms, including simple user-defined reference counts. The shortcomings of this approach teach us what to demand from a true garbage collector, and lead us to algorithms that analyze the global connectedness of pointer structures.

Do You Want it Fast, or Soon?

Latency concerns do not always justify a so-called incremental collector, which performs part of a garbage collection scan, then either picks up where it left off (true-incremental collection) or at least safely abandons its work and starts over later (pseudoincremental collection). Such algorithms are sometimes appropriate or even indispensable, but there are many latency problems that are best solved within the scope of traditional atomic (nonstop) collection.

Incrementality comes at a price. No fully interruptible collection scheme will ever be as fast as an atomic scan. Furthermore, a speedy collection algorithm impacts not only wall time, but also latency time: Making an atomic garbage collection complete faster is always the simplest, and often the best, way to meet a latency constraint. Many applications that cannot tolerate a pause of half a second, for instance, can live comfortably with 0.1 second interruptions, which are comparable to what they already see in context switches or net delays. So optimizations of the collector's pointer-scanning algorithm remain the first line of attack, even for intrinsically latency-bound problems.

Furthermore, most latency constraints are not uniform. If an application cannot live with being stopped for an atomic collection at a particular time, it rarely follows that such a pause is intolerable at all times. Many outside events other than garbage collection may compete for cycles or resources and threaten the responsiveness of an application; therefore, applications with latency concerns are usually designed with some awareness of which critical sections may not be interrupted. Any high-speed garbage collector should be prepared to support this design by giving the programmer a simple API to forbid collections at some times and encourage them at others. Because this involves semantic knowledge of the program, the decision about when to permit collection must be offered to the programmer.

An obvious extension of this idea is to hand off the decision to the thread scheduler by putting collection in a low-priority background thread. This seductive idea has often been tried with Java collectors, where the philosophy of the language discourages direct user control of collection policy. The serious (and for the C/C++ programmer, often fatal) difficulty with background collection is that the garbage can pile up if the CPU is too busy. Garbage collection is a performance optimization, not merely a performance drain: The cost of not collecting far outweighs any collection costs the moment garbage is paged out to disk. In short, you can throw garbage collection into the background or you can forget it, but you cannot afford to do both at once. This pitfall appears to be the primary reason why languages relying on background collection have such disappointing collector performance.

In some situations, tight latency requirements arise at unexpected times. For example, users suddenly hit a key in the middle of a background calculation, the tolerable latency has just dropped to whatever users are likely to perceive as annoying. If this occurs while an atomic collection is underway, the thread that is waiting to process the user's input will be frozen and the program will fail to meet the user latency expectation.

Even this situation, however, can be handled through programmer control of an intrinsically atomic collector. All that is needed is a way for you to hand the garbage collector a function pointer to an idle test function, like that in Example 1. The collector then performs a collection looking over its shoulder, as it were, calling the idle test after each small unit of work; the Great Circle (a garbage collector from Geodesic Systems, the company I work for) deployment collector does this about once a millisecond. If the test ever shows that things have suddenly become busy, control is returned to the program at once, and the results of the partial collection are simply discarded.

This is about as far as we can go without getting into the realm of true incremental collection, but many apparently latency-bound situations can be gracefully handled with the aforementioned techniques. The only remaining realm for a true incremental collector (one that can actually pick up where it left off, at some cost, and continue a collection that has been interrupted) is in code that will be busy all the time, not just any time. For example, a video-streaming application that must process an incoming frame 60 times per second cannot abandon a collection in midstream and come back later, because "later" will be equally bad. In this extreme case, new algorithms are required because the collector must safely support true incrementality.

Incrementality and Homebrew Collectors

The simplest automatic memory- management algorithm is reference counting: If a counter is incremented every time a pointer to a heap object is copied, and decremented every time a copy is destroyed, then objects with reference counts that go to zero can be safely reclaimed, eliminating the need to know at compile time which pointer will be the last to go away. For the independent developer, reference counting has the advantage of simplicity. Instead of reimplementing the malloc/new interface from scratch, you allocate a header field for each object, with both the current count and the object's extent. (The latter is needed so that the object can itself be scanned for pointers when its reference count goes to zero; it may point to other objects that can then also be reclaimed.) Objects are accessed through wrapped pointers that update reference counts when they are modified.

This algorithm looks as if it should have excellent latency; reference count updates are interleaved throughout the code and reclamation occurs one object at a time (or, if the reclaimed object contains pointers, a few at a time). However, destructor cascades and the overhead involved in locking and unlocking pages, all over memory, all the time, make this potential practically unrealizable in C/C++. In any case, good latency seldom compensates for bad throughput. Every time a pointer is modified -- as often as one machine instruction out of three in typical C/C++ code -- one count must be incremented and another decremented, sometimes on different pages, thus incurring an enormous cache fault penalty. The chief charm of C is its ability to write through pointers at machine-native speed. For this performance benefit, we willingly pay many inconvenience costs (the lack of bounds checking, for example). Giving up the naked-pointer invariant will grossly affect wall-time performance, making latency a secondary concern for most real-world users.

Virtual memory systems further multiply the cost of touching pages, and in the case of a reference count, this aspect of the problem is especially acute. If you are routinely touching dead objects to decrement their reference counts and dismiss them, you must consider that programmers tend to be done using objects with references that they discard. In fact, programmers have often been done with them for long enough that the dead objects have been paged out to disk. You might avoid some cache and paging overhead by directing the reference counts through a table kept in fast memory, but this would not solve the touching-the-dead problem. When an object is declared dead, it must be scanned for pointers to other objects that it may have been keeping alive. Garbage collectors do not make good undertakers.

A more serious problem with reference counting is correctness: Not all garbage is unreferenced. Any data structure that is not treelike (doubly linked lists, trees with backtraces, multiply inherited objects) will have pointer cycles, and the reference count for any object in such a structure will never go to zero. This is a symptom of the fundamental fact that "liveness" is not a local property of heap objects, but a global fact about the way they are connected to the "root set" of automatic and static variables that are always accessible to the program. Reference counting achieves short latencies only because it is answering a global question ("Is this object still pointed to by the program?") with a local approximation to the answer ("This object is still pointed to"). For some purposes, this is good enough. But for long-running applications, a safety net that misses a third or more of the leaked objects has very limited utility.

GC Meets the Mutator

So the problem of true-incremental collection comes down to identifying the large set of heap objects connected to the root set by pointer chains, without interrupting the program for as long as a full scan of this set will take. Scanning faster, scanning at judicious times, and counting references to approximate the results of a scan have already been discussed. What options remain when none of these will do?

An academic jargon already exists to discuss this problem. The "collector" is tracing out the "liveness graph" of memory, while the "mutator" (user program) obstructs it by modifying the graph. The term "mutator" conveys an ironic nuance of impatience with users, whose unwanted attempts to mutate the liveness graph are getting in the way of the collector's serious work. At any given moment, all objects can be classified as "black" (live and fully scanned), "gray" (known live, but not yet fully scanned; may contain more pointers), or "white" (the objects whose liveness is in doubt). A collection begins with the root set gray and everything else white -- this can be taken as a definition of the root set; namely, those objects that are known from the start to be alive. It ends when there is no more gray in the liveness graph. Any objects that remain white at the end of a collection are garbage and may be returned to the free list promptly or lazily, depending on implementation. The mutator must be prevented from doing anything that would result in a black object holding the sole pointer to a white one, because at reclamation time this will result in a dangling pointer from the black object to nowhere.

As described in the accompanying text box entitled "It's Not All Black and White," a pointer to a white object must be read from a nonblack object and written to a black object for chaos to result. This suggests a natural division between "read-barrier" and "write-barrier" algorithms. With a read barrier, collectors must block the mutator whenever a pointer to a white object is read, usually graying the white object, then scanning and blackening the object that pointed to it (to prevent expensive multiple recurrences). Some fascinating recent work (see "Barrier Techniques For Incremental Tracing," by Pekka P. Pirinen, ISMM98, October 1998) emphasizes the basic architectural fact that the mutator must load the pointer into one of a small number of registers when it reads it, and suggests that instead of scanning, we might backtrack and gray the mutator register. This is a novel example of a strategy that does not make monotonic forward progress and, as with other such strategies, careful thought must be given to the danger that collection will not complete at all. There are other practical concerns for a performance-oriented C/C++ collector: How can the barrier's impact on mutator performance be minimized when dirty-bit support for read and write barriers is available in most operating systems on a per-page, not per-object, level? Both scanning order (which the tricolor scheme records but does not dictate) and mutator blocking must work efficiently together in the context of the actual memory-management architecture on the target platform.

Geodesic's Great Circle (GC) uses dirty bits to implement a write barrier, instead of a read barrier, for incremental collection. The most obvious reason for this is efficiency. Many more words are read out of the heap than are ever written to it, and for a noncopying collector, the mere reading of a pointer to white does not incur a future obligation to update that pointer when the object is scanned (as it would if scanned objects were being moved). A write-barrier collector needs to protect exactly those pages that it has scanned, which seems tailor-made for efficient implementation. If the mutator should ask us for more memory while an incremental collection is in progress, we can allocate it white rather than black, thriftily discarding it if it leaks before we get around to scanning it. Even though GC is a conservative collector with regard to pointer identification (as it must be in a weakly typed language), its incremental response to changes in the liveness graph is as aggressive as possible, short of deliberately going back and revisiting black objects.

Stop the World!

Multithreaded systems pose challenges for any allocator, not just a garbage collecting one. Single heaps must be locked during allocation to ensure thread safety, and multiple heaps should be assigned to threads on different processors if thread efficiency is to be achieved on parallel machines. Fast locking implementation on diverse hardware architectures is challenging and far outside the scope of this article. But the same thread-switching overheads have specific implications for incremental collection.

Just as it makes little sense to scan memory in smaller units than the memory-management hardware can write-protect, it is wasteful to spend less time under any thread lock than the time it cost to obtain it. Allocations of like size or like use can be buffered through a function such as Great Circle's gcMallocMany(), which obtains a page of objects under a single lock, or this optimization can be simulated at the user level by pooling. Similarly, the collector should define a minimum unit of scanning, at least as large as the page size, which will take long enough to amortize the time cost of obtaining and releasing the needed locks, and yet less time than expected for a full atomic scan.

An atomic collection with no concept of tricolor marking must stop all other threads (or at least, all threads that try to touch its heap) and this, too, is a complex, platform-dependent procedure. Comparing the likely cost of incremental and straight collection is made more difficult by the fact that either or both could suddenly become I/O bound thanks to virtual memory paging. In practice, our results are so variable that we shy away from true-incremental collection on certain platforms, convinced that further improvements in our straight collection, along with pseudoincremental tricks already discussed, will bring our customers more value than a working (but less than optimal) incremental collector. Who wants half a dozen 5 to 50 millisecond pauses, if we can offer a full collection in 50 milliseconds?

One Step at a Time

Much more academic effort has been devoted to clever incremental garbage collection than to the speed optimizations of a simple mark-and-lazy-sweep that drive Great Circle. Inevitably, our founders experimented with what is perhaps the prettiest incremental algorithm of all -- the Baker treadmill. This is a write-barrier algorithm based on a circular doubly linked list. New objects are allocated black, so there is little resemblance to the incremental mode of our main product (although GCPointers was designed with a similar API to "GCTransparent," as Great Circle was then called). Figure 1 explains the algorithm almost completely. White objects in "from-space" are unsnapped from the linked list like toy boxcars, and reinserted at the front of "to-space," when pointers to them are found in the gray objects at the current leading edge of to-space. The gray edge advances counterclockwise while new objects are allocated clockwise on the back. When there is no more gray, what's left of from-space is added implicitly to the free list, black to-space and new-space are whited out and defined as the next to-space, and the scan can begin again at once. Incrementality is as small as the boxcar size (a single allocated object), and latencies can be as short as a few microseconds.

In practice, the prohibitive cost of GCPointers was programmer intrusiveness. Its typical wall-time performance was somewhat poorer than GCTransparent's, and there was contention on the locks guarding to-space and from-space when multiple threads hit the write barrier at once. But its fatal flaw was that it required programmers to wrap all their pointers, because the write barrier was achieved through class inheritance rather than page protection. Smart pointers are often used for much less radical benefits than GCPointers offered, as in user-built reference counting schemes. But deployed garbage collection, at least in enterprise C/C++ development, was mostly needed to ensure stability in large heterogeneous applications where wholesale code modification was not an option. The tool has now dropped out of our product line, except as a consulting solution.

The Future?

Fast uniprocessor C/C++ garbage collection is already fairly mature, and the pragmatic approaches I have laid out here give some sense of where true incremental collection brings added value. As long as physical memory with tiered speed shares one user-level address space transparently, hard real-time guarantees for collection will be elusive, though automatic memory management is at least as likely to boost average cache and VM performance as to hurt it. As address spaces begin to routinely span many processors as well as many threads, multiple-heap allocators will outscale conventional thread-safe ones. Performance of garbage collectors (and of most programs as well) will depend heavily on how well allocators can avoid cross-heap pointers.

When memory is shared between address spaces or across the net through distributed objects (many of which are persistent by design), memory leak recovery gives way to memory management of a more general sort. No one technique can do the whole job, simply because there isn't one job -- there are many. As distributed systems become ubiquitous, you may find yourself performing collective brain surgery not on just one jogger, but on whole triathlons as they run, swim, and roll at widely varying speeds.

Acknowledgment

Thanks to Michael Spertus, Charles Fiterman, and Richard Brooksby for numerous helpful remarks. Errors are my own, of course.

DDJ

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Garbage Collection on the Run

Do You Want it Fast, or Soon?

Incrementality and Homebrew Collectors

GC Meets the Mutator

Stop the World!

One Step at a Time

The Future?

Acknowledgment

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Garbage Collection on the Run

Do You Want it Fast, or Soon?

Incrementality and Homebrew Collectors

GC Meets the Mutator

Stop the World!

One Step at a Time

The Future?

Acknowledgment

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content