Low-latency garbage collector merged for GHC 8.10

Ben Gamari – Monday, 28 October 2019

Over the past two years my colleague Ömer Sinan Ağacan and I have been working with our client, Standard Chartered, on a new low-latency garbage collector for GHC. Happily, last week this project reached a major milestone: being merged for inclusion in GHC 8.10.

In this post would like to discuss the motivation for this new collector, as well as its current state, and directions for future work.

GHC’s current collector

Garbage collection schemes are generally forced to trade-off between latency and throughput. Today, GHC’s primary garbage collector is a moving collector design [Marlow2008; Ungar1984]. Under this scheme heap reclamation involves copying all live data out of its current location (which we will call from-space) to a new segment of memory (to-space). The collector can then free from-space and consequently reclaim the memory occupied by the now-dead allocations ¹. This scheme has a number of advantages:

allocation can be simply and efficiently serviced via a bump-pointer scheme (see Figure 1): the allocator need only reserve a large block of memory and sequentially allocate into it until it is full. This avoids most heap fragmentation troubles as well as the need for expensive data structures to track free memory regions.
collection improves memory locality: since the collector tends to traverse the heap in topological order objects which refer to one another tend to end up near one another in to-space.
implementation is reasonably straightforward
parallel collection is possible: multiple cores can work on collection in parallel

Figure 1. Allocation in GHC’s moving collector is provided via a bump-pointer allocator. Here we reserve a contiguous allocation area and maintain a pointer to the first unallocated byte. Allocation simply involves returning free and advancing it by the allocation size. — **Figure 1.** Allocation in GHC’s moving collector is provided via a *bump-pointer allocator*. Here we reserve a contiguous allocation area and maintain a pointer to the first unallocated byte. Allocation simply involves returning `free` and advancing it by the allocation size.

However, moving collection also has two serious shortcomings:

collection cannot be performed while the user’s program is running ² since the program may race with the collector as it moves an object
the amount of work necessary for collection is proportional to the amount of live data on the program’s heap

Figure 2. In the case of a moving garbage collector, concurrent GC is complicated by the potential for races with the program. Here we see that the collector is copying an array into to-space but has lost a concurrent mutation made by the program. — **Figure 2.** In the case of a moving garbage collector, concurrent GC is complicated by the potential for races with the program. Here we see that the collector is copying an array into *to-space* but has lost a concurrent mutation made by the program.

These two properties imply that the collector must stop the user’s program as it collects and these pauses will grow in length as the live heap grows. Such pauses can be extremely problematic for programs that have even moderate latency requirements.

GHC’s new non-moving collector

In order to address the latency issues laid out above we designed a garbage collector with the following goals:

use of the new collector must not require recompilation
the average pause time on a “typical” server application should improve by a factor of three
the runtime of a “typical” server application may regress by at most 10%

After considering a wide range of possible designs, we settled on a concurrent non-moving, mark-and-sweep collector similar to Ueno [Ueno2016]. Let’s examine each of these qualifiers in turn:

concurrent: we want to allow garbage collection to proceed concurrently with the execution of the user’s program. This is in contrast to incremental collection, another approach to bounding pause times where the collection pauses the program but can be suspended on demand.
non-moving: as mentioned above, moving heap objects during concurrent collection requires great care and generally comes at the expense of performance. We side-step this by simply not moving closures.
mark-and-sweep: collection will use a mark-and-sweep strategy where first the live objects are marked and in a second phase un-marked objects are freed.

This concurrent mark-and-sweep approach is used in both the Go and Java runtimes. However, we employ a hybrid collection strategy where the existing bump-pointer allocator is used to service mutator allocation requests and the non-moving allocator services requests from minor garbage collector, which promotes long-lived objects to the non-moving heap. In this way the mutator enjoys the beneficial locality and efficiency characteristics of the moving collector. Moreover, we avoid the need for recompilation while allowing long-lived objects to live in the non-moving heap where they have a much smaller contribution to GC pause times.

Those seeking more detail on the design itself are referred to my talk given at MuniHac 2018 and the design document.

Usage and future plans

Our new collector has been merged to GHC’s master branch and will be present in GHC 8.10.1. For a program to use the concurrent non-moving collector it must be compiled with GHC’s -threaded flag and invoked with the +RTS -xn runtime system flag. The user-facing behavior of the collector is no different than that of the old moving collector.

There are a number of improvements to the collector that we will be investigating in the coming months:

Improved diagnostics output to help better understand the behavior of the collector.
More robust scheduling heuristics to bring heap residency of heavily-allocating programs into line with that maintained by the moving collector.
Further reduce the pause associated with pre-mark synchronization by moving more preparatory work into the concurrent phase of collection
Implement parallel marking, allowing the compiler to better keep pace with highly parallel programs.
Investigate the feasibility of using the non-moving collector to service mutator allocations directly; with a more sophisticated synchronization scheme this would eliminate the need to stop-the-world during the initiation of a major GC and potentially allow better scaling across high core counts.

In closing, we would like to thank Standard Chartered for their support and in particular Atze Djikstra and Pepe Iborra for their testing and feedback.

We hope that users will try running their program under this new collector during the upcoming 8.10 alpha period. If you have any questions or comments don’t hesitate to be in touch.

References

[Kermany2006]

Kermany, Haim; Petrank, Erez: The Compressor: Concurrent, incremental, and parallel compaction. In: PLDI : ACM

[Marlow2008]

Marlow, Simon; Harris, Tim; James, Roshan P.; Peyton Jones, Simon: Parallel generational-copying garbage collection with a block-structured heap. In: ISMM

[Ossia2002]

Osssia, Y.; Ben-Yitzhak, O.; Goft, I.; Kolodner, E.K.; Leikehman, V.; Owshanko, A.: A parallel, incremental and concurrent GC for servers. In: SIGPLAN Concference on Programming Languages Design and Implementation : ACM

[Ueno2016]

Ueno, Katsuhiro; Ohori, Atsushi: A fully concurrent garbage collector for functional programs on multicore processors. In: ICFP : ACM, 2016

[Ungar1984]

Ungar, D.: Generation scavenging: A non-disruptive high performance storage management reclamation algorithm. In: SIGPLAN Software Engineering Symposium on Practical Software Development Environments : ACM

An introductory discussion of garbage collection strategies is beyond the scope of this post. If you would like to know more see my previous talk.↩︎
It isn’t quite true that moving collection cannot be done concurrently with program execution. There are a few examples of designs [Kermany2006; Ossia2002] which can perform such collection, although these schemes tend to either require hardware support or expensive software barriers.↩︎