This is a guest post by Domen Kožar.
In this post I’ll dive into how low-latency garbage collection (GC) has improved developer experience for Cachix users.
The need for low latency
Cachix serves the binary cache protocol for the Nix package manager.
Before Nix builds a package, it will ask the binary cache if it contains the binary for a given package it wants to build. For a typical invocation of Nix there can be hundreds or even thousand of packages that need to be checked. However, if a binary is present in the cache, building the package is no longer required, potentially saving a lot of time and CPU.
It is crucial for the backend of such a binary cache service to respond in a timely manner so that the optimisation of skipping builds actually pays off.
Monitoring GC pauses
The easiest way to monitor and graph how long GC pauses last is via ekg-statsd, by exposing the rts.gc.gc_wall_ms
metric.
For Cachix, a typical plot of this metric used to look like this:
From the picture, we can see that under load, we experience GC pause times of up to nearly 800 ms. Having 800 ms pauses stopping the world is far from ideal (I’ve even observed some pauses that last over a second under really heavy load), since the endpoint for checking if a certain binary exists normally takes only about 2–4 ms.
Switching to the low-latency GC
If you want to try the low-latency GC in your own code, please make sure to use GHC 8.10.3 or later since it fixes a few crashing bugs that you don’t want to encounter. Then, to enable the low-latency GC, append the following flags when invoking an executable built by GHC:
myexecutable +RTS --nonmoving-gc
For Cachix, a typical picture of GC pauses plotted over time then looks as follows:
While this is comparing apples to oranges since the load is not exactly the same between the two pictures, you can see that the distribution is now significantly different. By far the most pauses are now actually in the range of just a few milliseconds.
Unfortunately, non-moving GC can occasionally still cause relatively long pauses in the worst case (measured at 150–200 ms). We believe that this is due to the workload spawning many threads. There is still work to be done to further reduce pause times of the low-latency collector under such circumstances.
The throughput impact of the low-latency GC hasn’t been measured for this case. The response time of a “does this binary exist?” request is still within the 2–4 ms range most of the time.
Conclusion
Monitor GC pauses to understand how they impact your application response times.
Non-moving GC has been running in production for over a month, reducing worst-case response time for a performance-sensitive endpoint without any issues.
Being able to monitor the total number of threads in the RTS would improve production insights, but that is yet to be implemented.