TL;DR Build your Haskell projects 10-15% faster with this one simple trick!
(Spoiler: the simple trick is to wait for the next major cabal-install release.)
In previous work (paid for by the Sovereign Tech Fund) we
did a lot of heavy lifting to make a major architectural change to Cabal. That
work is now paying off with practical benefits. This post covers follow-on
architectural improvements to cabal-install which then enable us to eliminate
redundant work in the configure phase, yielding significant reductions in
build times.
The changes will be available to everyone in the next major cabal-install
release. For a large project like pandoc (including all of its dependencies)
we measure a 10% (std.dev. 0.6pp) reduction in wall clock time for a 16-way
parallel build with --semaphore. No user changes are needed to take advantage
of this improvement.
History: Cabal and cabal-install
The genesis: the Cabal specification
First, there was Cabal. Its design was laid out in A Common Architecture for Building Applications
and Tools. Fundamentally, it defines the notion of a package, with
each package being built and installed with the following sequence of commands:
> hc Setup.hs
> ./Setup configure
> ./Setup build
> ./Setup install
Each package must be built in dependency order, with hc-pkg registering each
installed library into a package database.
Orchestrating the build of multiple packages
cabal-install was then born to plan and execute a build plan consisting of
many packages. With its solver, it determines a build plan, which is then
orchestrated by running the above sequence of commands for each package,
in dependency order.
There is however one architectural mismatch: for the solver to be able to compute a build plan, it already needs a lot of information about the current system:
- What Haskell compiler are we using?
- What system libraries are available (
pkgconfig-depends)? - What build tools are available (
build-tool-depends)?
This means that cabal-install already has in its hands most of the information
necessary for configuring a package; in particular it has already resolved all
the conditionals in every package description. We should thus be able to skip
most of the steps in the package’s ./Setup configure phase. However,
the command-line interface of ./Setup configure makes it practically
impossible to do so: passing a fully resolved dependency graph would require many
additions to the already bloated ConfigFlags datatype,
and a lot more data being serialised/deserialised.
Because of this limitation, cabal-install’s approach was to take its hard-won
build plan and convert it into ConfigFlags that specify exact dependency
versions and flag assignments. This amounts to passing ./Setup configure
an already fully constrained configuration; the configure step would then
re-probe the system, re-read package databases… only to re-discover exactly
what cabal-install already knew!
A new architecture for cabal-install
The paradigm shift proposed in our Sovereign Tech Fund proposal
is that cabal-install should be responsible for orchestrating the whole build
process instead of running the conceptually independent build systems provided by
each package. With cabal-install now in control, it can directly call Cabal
library functions, which in turn allows skipping steps in the configure phase
that waste time re-discovering information that cabal-install is already
aware of.
To implement such a change, we first needed to prepare the terrain: when invoking
an external executable such as the Setup executable – say via the
process library as Cabal uses – we can set the working
directory, environment variables and redirect input/output handles.
It was not possible to do this directly via the Cabal library, so we first
needed to add Cabal library support for setting the working directory
and for choosing logging handles. Once this was done,
it allowed us to refactor cabal-install to directly call Cabal library functions to build packages.
Performance impact
This architectural change provides a solid foundation for further improvements.
The two main time sinks in the Cabal configure phase were determined to be
(using a new --build-timings flag to cabal-install):
- (~50% of
configuretime) Re-configuring the compiler program database. The compiler andhc-pkgwere already pre-configured, but other programs such ashaddock,ar,ldetc were re-configured anew for each package. - (~40% of
configuretime) Re-probing the installed package database, viahc-pkg dump.
We can skip this extra work by pre-configuring the compiler ProgramDb
and keeping a running InstalledPackageIndex. These
two changes, taken together, reduce the time spent in the configure phase by
over 90%.
While most of the time in builds is unsurprisingly spent… actually compiling
Haskell code [citation needed], the impact on full builds is still rather
significant. For example, when compiling aeson with -j1, we saw a reduction
in total build time of ~16.6% (std.dev. 1.9pp) in our benchmarks.
The fact that the configure phase is inherently serial also means that these
improvements have a notable impact when combined with the -jsem feature.
This is because the -jsem feature allows us to assign more capabilities to
the build phase. As per Amdahl’s law, this results in the
configure phase becoming more of a bottleneck. For example, when compiling
pandoc with cabal install pandoc -j16 --semaphore, we saw a reduction in
total build time of ~10% (std.dev. 0.6pp).
Further improvements
These improvements provide a small glimpse of what is possible after our changes
to cabal-install’s architecture. A more ambitious long-term goal would be for
cabal-install to manage a “giant build graph” on a finer granularity level
than whole Cabal components. For example, if package q depends only
on module P1 from package p, we could imagine starting to compile q after
compiling P1 but before we have finished compiling the rest of p. This
would unlock build-time reductions by increasing available parallelism,
and also enable more accurate progress and error reporting.