What can we do when it takes GHC a lot of time to compile a given module? Where is it spending its time? Where can we start to get a 10,000 feet view of what GHC is doing? This blog post covers one possible answer, using the eventlog mechanism.
(This post has also been published on the GHC blog.)
GHC has a mechanism that allows us to record all sorts of “events” to an eventlog (a simple binary file), during a program’s execution, attaching a timestamp to it to later allow tools to reconstruct as much of a program’s execution as the events allow. This includes RTS events (garbage collection, HEC/thread activity) but also user defined events, where “user” designates the author of a Haskell library or program;
base provides functions that anyone can use to emit events, in addition to the ones that the RTS itself will emit.
Those functions are implemented in terms of primitive operations that are backed by GHC’s runtime system. Its design and implementation is discussed, along with other topics, in Parallel Performance Tuning for Haskell by Don Jones Jr., Simon Marlow and Satnam Singh.
After the program’s execution, one can then use libraries like ghc-events or tools like ghc-events-analyze, or threadscope to consume the eventlog in order to gain some insights into where time was spent.
While profiling lets us gather more detailed information about where time is spent and where allocations are made, it requires rebuilding our program’s code and using a dedicated RTS flavour. The code generated by GHC for a profiled program is quite different, adding lots of instrumentation to support profiling the execution of Haskell code. The extra code generated to support profiling can get in the way of some optimisations and can therefore drastically affect the performance of a program. On the other hand, generating the eventlog for a program only requires re-linking it (against a flavour of the RTS that supports tracing) with
-eventlog and running it with
+RTS -l. The eventlog mechanism also has much lower impact on runtime performance, since emitting events is merely about putting a few values in a buffer, that the RTS will then regularly flush to the eventlog file or whatever the destination is. The aforementionned paper has some precise numbers on the overhead of the eventlog mechanism, but it’s of course quite low.
This can therefore be an interesting solution when you want to get a big picture without having to rebuild your whole program and some of its dependencies. All you have to do is set up some events that cover the fragments of the program’s execution that you’re particularly interested in.
Events are in general emitted using one of the following 4 functions from
traceEvent :: String -> a -> a
traceEventIO :: String -> IO ()
traceMarker :: String -> a -> a
traceMarkerIO :: String -> IO ()
traceEvent[IO] functions should be used for all sorts of events, particularly the ones that are likely to happen a lot during the execution of your program, while
traceMarker[IO] are generally used to mark certain points or phases in the execution and see that visually in the profile. This is particularly helpful with tools like eventlog2html (see the last section of this post) that allow you to visualize heap profiles, drawing those “marker events” on top so that users can get a sense of when some particularly allocations or deallocations take place, with user-supplied labels instead of trying to guess from timestamps.
Starting with this commit, by Ben Gamari about 3 months ago, GHC started emitting eventlog entries for all calls to
withTiming, which is a function we use to measure how long various parts of GHC take. We then started adding a few more
- 688a1b89 added tracing around calls to various external tools (C compiler, linker, assembler, etc);
- 0c5cd771 added tracing around all the individual passes that make up the codegen pipeline (we previously only tracked codegen as a whole, single event);
- e3cbe319 added tracing around the package database initialization.
And other similar patches later on. As a result, we can trace most of the execution of GHC when compiling a trivial hello world module:
First, you need to get a somewhat recent checkout of GHC’s
master branch - e3cbe319 or newer. Then you need to build that source tree and make sure that the stage 2 GHC is linked with
$ ./boot && ./configure # Build with Hadrian: $ hadrian/build.sh -j "stage1.ghc-bin.ghc.link.opts += -eventlog" # stage 2 ghc at: _build/stage1/bin/ghc # Build with Make: $ make -j GhcStage2HcOpts+=-eventlog # stage 2 ghc at: inplace/bin/ghc-stage2 # If you have a GHC build around already, both of those commands should not # cause a lot of work to be done: just linking the GHC executable against a # slightly different RTS flavour! No recompilation needed.
You can then build any module, library or program with the resulting stage 2 executable as you would normally do. Except that if you pass
+RTS -l with one of
-ddump-timings, GHC will produce an eventlog at
ghc.eventlog with all the standard RTS events, but also events for each pass in GHC’s pipeline. Let’s see this in action by compiling `hello.hs’ from earlier.
We now have an eventlog. But what can we do with it? You’ve got several options:
- The ghc-events library and program, used by all the other options, provide primitives for decoding the eventlog and extracting data from it. The accompanying program provides various commands for querying eventlog files.
- The threadscope program lets you visualize eventlogs with a GTK+ frontend.
- ghc-events-analyze produces per-label totals, SVG visualizations and per-label timing data given an eventlog file as input.
- … and maybe other options that I’m not remembering or aware of at all.
I personally use
ghc-events-analyze a lot, since it can produce pretty SVG pictures that I can then embed in GitLab comments or… blog posts. :-)
To get the timings and SVG picture, you need to install
ghc-events-analyze, and then ask it to process
ghc.eventlog with the right “markers”:
$ ghc-events-analyze --timed --timed-txt --totals \ --start "GHC:started:" --stop "GHC:finished:" \ ghc.eventlog # --timed: produce SVG visualization of the eventlog, in ghc.timed.svg # --timed-txt: produce per-label groupings of timings that report when the events were emitted, in ghc.timed.txt # --totals: produce per-label totals, reporting how much time was spent in a given label, in total # --start: all events that we're interested in are wrapped in # --stop GHC:started:... / GHC:finished:... events, so we just tell ghc-events-analyze that it should be looking for those markers and report about those events.
Here are the totals that I get for our hello world program compilation.
GC 389461503ns 0.389s USER EVENTS (user events are corrected for GC) systool:linker 2386891920ns 2.387s systool:cc 801347228ns 0.801s systool:as 145128851ns 0.145s Renamer/typechecker [Main] 45709853ns 0.046s initializing package database 20877412ns 0.021s CodeGen [Main] 20754058ns 0.021s CoreTidy [Main] 8262122ns 0.008s NCG 7566252ns 0.008s Chasing dependencies 2441212ns 0.002s Cmm pipeline 2040174ns 0.002s Desugar [Main] 1657607ns 0.002s STG -> Cmm 1103737ns 0.001s Simplifier [Main] 1045768ns 0.001s Cmm -> Raw Cmm 319442ns 0.000s Parser [Main] 286350ns 0.000s CorePrep [Main] 88795ns 0.000s TOTAL 3445520781ns 3.446s THREAD EVENTS 1 460138ns 0.000s IOManager on cap 0:2 670044ns 0.001s TimerManager:3 143699ns 0.000s 4 153050783ns 0.153s weak finalizer thread:5 43518ns 0.000s weak finalizer thread:6 10677ns 0.000s weak finalizer thread:7 33126ns 0.000s weak finalizer thread:8 23787ns 0.000s weak finalizer thread:9 1534ns 0.000s weak finalizer thread:10 8142ns 0.000s weak finalizer thread:11 1352ns 0.000s weak finalizer thread:12 10080ns 0.000s weak finalizer thread:13 10603ns 0.000s weak finalizer thread:14 8767ns 0.000s weak finalizer thread:15 10849ns 0.000s 16 69745ns 0.000s 17 307420ns 0.000s 18 83017ns 0.000s 19 76866ns 0.000s weak finalizer thread:20 10083ns 0.000s 21 96582ns 0.000s 22 103373ns 0.000s 23 62562ns 0.000s 24 97655ns 0.000s weak finalizer thread:25 11676ns 0.000s 26 238116ns 0.000s 27 245821ns 0.000s weak finalizer thread:28 8268ns 0.000s 29 80235ns 0.000s 30 86759ns 0.000s 31 21571ns 0.000s weak finalizer thread:32 396ns 0.000s TOTAL 156087244ns 0.156s
And the SVG image:
We can see two time windows where we don’t have an “active” label:
- towards the beginning, where GHC likely parses settings, sets everything up and reads the input file;
- after code generation and before calling the assembler, where it turns out we spend most of the time getting some information about the C compiler being used.
We otherwise have a pretty good picture of what GHC is doing the rest of the time. It might be nice in the future to add a few more events, e.g to track when we’re running Template Haskell code.
If you want GHC to only emit so-called “user events” (= non-RTS events) to the eventlog, pass
+RTS -l-au instead of
+RTS -l. More options can be found in the relevant section of the GHC user manual.
Of course, if you’re looking into a performance problem, once you narrow down the part of the execution that is taking too long, you will quite likely start using slightly fancier profiling methods, nothing new there. You can however emit your own events when working on GHC, so as to collect some simple timing data about those codepaths that you are working on, without affecting GHC’s performance much. You simply have to use
where the forcing function is used to force the evaluation of the action’s result just as much as the user wants; the time it takes to perform this evaluation is included in the timings reported by
<label> would then show up in the eventlog reports, if the corresponding code path is entered. See Note [withTiming] for more explanations about the
withTiming function and how to use it.
All the eventlog markers that
withTiming emits can then be integrated in heap profiles using eventlog2html, written by Matthew Pickering; see here for an example. If you put your mouse on one of the vertical lines, you will see the label of the phase that begins or ends at that moment.
Before ending this post, I would like to mention that we have some infrastructure to collect timings and allocations for various compilation phases in GHC, over the head.hackage package set. This can be used locally or triggered manually from the CI interface of the GHC repository, and is set up to run nightly against GHC’s
master branch. The resulting data can be loaded in PostgreSQL and queried at will. We plan on collecting more data about GHC’s behaviour when compiling all those packages and on relying more on the eventlog than GHC output parsing to do so, with two major goals in mind:
- have an easy way to track GHC’s performance over time and in particular provide some initial data about where a performance regression might be coming from;
- expose enough data to be able to identify “real world modules” where GHC doesn’t perform as well as we would like it to, which could in turn lead to an investigation and possibly a patch that improves GHC’s performance on the said module.