Choreographing a dance with the GHC specializer (Part 1)

Finley McIlwaine – Monday, 15 April 2024

all ghc hasura performance tutorial unfolder youtube

Specialization is an optimization technique used by GHC to eliminate the performance overhead of ad-hoc polymorphism and enable other powerful optimizations. However, specialization is not free, since it requires more work by GHC during compilation and leads to larger executables. In fact, excessive specialization can result in significant increases in compilation cost and executable size with minimal runtime performance benefits. For this reason, GHC pessimistically avoids excessive specialization by default and may leave relatively low-cost performance improvements undiscovered in doing so.

Optimistic Haskell programmers hoping to take advantage of these missed opportunities are thus faced with the difficult task of discovering and enacting an optimal set of specializations for their program while balancing any performance improvements with the increased compilation costs and executable sizes. Until now, this dance was a clunky one involving desperately wading through GHC Core dumps only to come up with a precarious, inefficient, unmotivated set of pragmas and/or GHC flags that seem to improve performance.

In this two-part series of posts, I describe the recent work we have done to improve this situation and make optimal specialization of Haskell programs more of a science and less of a dark art. In this first post, I will

give a comprehensive introduction to GHC’s specialization optimization,
explore the various facilities that GHC provides for observing and controlling it, and
present a simple framework for thinking about the trade-offs of specialization.

In the next post of the series, I will

present the new tools and techniques we have developed to diagnose performance issues resulting from ad-hoc polymorphism,
demonstrate how these new tools can be used to systematically identify useful specializations, and
make sense of their impact in terms of the framework described in this post.

The intended audience of this post includes intermediate Haskell developers who want to know more about specialization and ad-hoc polymorphism in GHC, and advanced Haskell developers who are interested in systematic approaches to specializing their applications in ways that minimize compilation cost and executable sizes while maximizing performance gains.

This work was made possible thanks to Hasura, who have supported many of Well-Typed’s successful initiatives to improve tooling for commercial Haskell users.

I presented a summary of the content in this post on The Haskell Unfolder:

The Haskell Unfolder Episode 23: specialisation

Overloaded functions are common in Haskell, but they come with a cost. Thankfully, the GHC specialiser is extremely good at removing that cost. We can therefore write high-level, polymorphic programs and be confident that GHC will compile them into very efficient, monomorphised code. In this episode, we’ll demystify the seemingly magical things that GHC is doing to achieve this.

Ad-hoc polymorphism

In Haskell, an ad-hoc polymorphic or overloaded function is one whose type contains class constraints. For example, this f is an overloaded function:

f :: (Ord a, Num a) => a -> a -> a
f x y =
    if x < y then
        x + y
    else
        x - y

For some type a such that Ord a and Num a instances are provided, f takes two values of type a and evaluates to another a.

Importantly, unlike type arguments, those class constraints are not erased at runtime! Actually, they will be passed to f just like any other value argument, meaning f at runtime is more like:

f :: Ord a -> Num a -> a -> a -> a
f ord_a num_a x y = ...

How does the definition of f change to represent this? And what do these ord_a and num_a values look like? This is how it works:

Instances are compiled to records, typically referred to as dictionaries, whose fields are the definitions provided in the instance.
Class functions (e.g. < in the body of f) become record selectors that are applied to the dictionaries to look up the appropriate definitions.

Thus, f at runtime is more like:

f :: Ord a -> Num a -> a -> a -> a
f ord_a num_a x y =
    if (<) ord_a x y then
        (+) num_a x y
    else
        (-) num_a x y

The previously-infix class operators are now applied in prefix position to select the appropriate definitions out of the dictionaries, which are then applied to the arguments.

We can see this for ourselves by compiling the definition of f in a module F.hs and emitting the intermediate representation (in GHC’s Core language):

ghc F.hs -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-ds

The -O flag enables optimizations, and the -ddump-ds flag tells GHC to dump the Core representation of the program after desugaring, before optimizations. The other flags make the output more readable.

For a comprehensive introduction to GHC Core and the flags GHC accepts for viewing it, check out The Haskell Unfolder Episode 9: GHC Core.

The above command will output the following Core for f:

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

The if has been transformed into a case (Core has no if construct). The $dOrd and $dNum arguments are the Ord a and Num a instance dictionaries, respectively. The < operator is applied in prefix position (as are all operators in Core) to the $dOrd dictionary to get the appropriate implementation of <, which is further applied to x and y. The - and + operators in the branches of the case are similar.

The extra allocations required to pass these implicit dictionary arguments and apply selectors to them do result in a measurable overhead, albeit one that is insignificant for most intents and purposes. As we will see, the real cost of ad-hoc polymorphism comes from the optimizations it prevents rather than the overhead it introduces.

Specialization

In this context, specialization refers to the removal of ad-hoc polymorphism. When we specialize an overloaded expression e :: C a => S a, we create a new binding eT :: S T, where T is some concrete type for which a C T instance exists. Here eT is the specialization of e at (or to) type T.

For example, we can manually create a specialization of f at type Int. The source definition stays exactly the same, only the type changes:

fInt :: Int -> Int -> Int
fInt x y =
    if x < y then
        x + y
    else
        x - y

At the Core level, the dictionaries that were passed as value arguments to f are now used directly in the body of fInt. If we add the definition of fInt to our example module and compile it as we did before, we get the following output:

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

fInt
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

fInt no longer accepts dictionary arguments, and instead references the global Ord Int and Num Int dictionaries directly. In fact, this definition of fInt is exactly what the GHC specializer would create if it decided to specialize f to Int. We can see this for ourselves by manually instructing GHC to do the specialization using a SPECIALIZE pragma. Our whole module is now:

module F where

{-# SPECIALIZE f :: Int -> Int -> Int #-}

f :: (Ord a, Num a) => a -> a -> a
f x y =
    if x < y then
        x + y
    else
        x - y

fInt :: Int -> Int -> Int
fInt x y =
    if x < y then
        x + y
    else
        x - y

And the -ddump-ds Core output becomes:

fInt
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

$sf
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

The GHC generated specialization is named $sf (all specializations that GHC generates are prefixed by $s). Note that our specialization (fInt) and the GHC generated specialization ($sf) are exactly equivalent!

Why is this an optimization?

The above transformation really is all that the GHC specializer does to our programs. It may not be immediately clear why this optimization is a meaningful optimization at all. That is because specialization is an enabling optimization: The real benefit comes from the optimizations that it enables later in the pipeline, such as inlining.

Inlining is the replacement of defined (top-level or let-bound) variables with their definitions. Although f and its specialization $sf look similar, the key difference is that f includes calls to “unknown” functions passed as part of the dictionary arguments, while $sf includes calls to “known” functions contained in the $fOrdInt and $fNumInt dictionaries. Since GHC has access to the definitions of those dictionaries and the contained functions, they can be inlined, exposing yet more opportunities for optimization.

We can see this in action by comparing the fully optimized bindings of our example module to those just after desugaring. To do this, compile using the same command as above but add the -ddump-simpl flag, which tells GHC to dump the Core at the end of the Core optimization pipeline (also add -fforce-recomp to force recompilation, since we haven’t changed the code since our last compilation):

ghc F.hs -fforce-recomp -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-ds -ddump-simpl

The dumped output is:

==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 57, types: 37, coercions: 0, joins: 0/0}

fInt
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

$sf
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

==================== Tidy Core ====================
Result size of Tidy Core
  = {terms: 44, types: 29, coercions: 0, joins: 0/0}

fInt
  = \ x y ->
      case x of { I# x1 ->
      case y of { I# y1 ->
      case <# x1 y1 of {
        __DEFAULT -> I# (-# x1 y1);
        1# -> I# (+# x1 y1)
      }
      }
      }

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

------ Local rules for imported ids --------
"USPEC f @Int" forall $dNum $dOrd. f $dOrd $dNum = fInt

The output of the desugaring pass is in the “Desugar (after optimization)” section, while the fully optimized output is in the “Tidy Core” section. The name “Desugar (after optimization)” only means it is the desugared Core output after GHC’s simple optimizer has run. The simple optimizer only does very lightweight, pure transformations to the Core program. We will still refer to the Core output of this stage as “unoptimized”.

During the full optimization pipeline, GHC identified the equivalence between fInt and $sf and decided to remove $sf. The fully optimized binding for fInt is unboxing the Ints (pattern matching on the I# constructor) and using efficient primitive operations (<#, -#, +#), while the fully optimized binding for f is the same as the unoptimized binding. The optimizer simply couldn’t do anything with those opaque dictionaries in the way!

At the bottom of the output is the rewrite rule that the SPECIALIZE pragma created, which will cause any calls of f known to be at type Int to be rewritten as applications of fInt. This is what allows the rest of the program to benefit from the specialization. The rule simply discards the dictionary arguments $dNum :: Num Int and $dOrd :: Ord Int, which is safe because of global typeclass coherence: any dictionaries passed explicitly must have originally come from the same global instances.

In summary, by replacing the opaque dictionary arguments to f with references to the concrete Ord Int and Num Int dictionaries in fInt, GHC was able to do a lot more optimization later in the pipeline.

Automatic specialization

In our example module, we manually instructed GHC to generate a specialization of f at Int using a SPECIALIZE pragma. In reality, we often rely on GHC to figure out what specializations are necessary and generate them for us automatically. GHC needs to be careful though, since specialization requires the creation and optimization of more bindings, which increases compilation costs and executable sizes

GHC uses several heuristics to avoid excessive automatic specialization by default. The heuristics are very pessimistic, which means GHC can easily miss valuable specialization opportunities that programmers may wish to manually address. This is precisely the manual effort that our recent work aims to assist, so before we go any further it’s important that we understand exactly when and why GHC decides specialization should (or should not) happen.

When does automatic specialization happen?

GHC will only potentially attempt automatic specialization in exactly one scenario: An overloaded call at a concrete, statically known type is encountered (we’ll refer to such calls as “specializable” calls from now on). This means that automatic specialization will only ever be triggered at call sites, not definition sites. Even in this scenario, there are other factors to consider which the following example will demonstrate.

Let’s add a binding foo to our example module F.hs from above:

foo :: (Integer, Integer) -> Integer
foo (x, y) = f x y

foo makes a specializable call to f at the concrete type Integer, so we might expect automatic specialization to happen. However, the inliner beats the specializer to the punch here, which is evident in the -ddump-simpl output:

$wfoo
  = \ ww ww1 ->
      case integerLt ww ww1 of {
        False -> integerSub ww ww1;
        True -> integerAdd ww ww1
      }

foo = \ ds -> case ds of { (ww, ww1) -> $wfoo ww ww1 }

Instead of specializing, GHC decided to eliminate the call entirely by inlining f, thus exposing other optimization opportunities (such as worker/wrapper) which GHC took advantage of. This is intended, since f is so small and GHC knows that inlining it is very cheap and likely worth the performance outcomes.

Another way we can observe the inlining decision by GHC here is via the -ddump-inlinings flag, which causes GHC to dump the names of any bindings it decides to inline. Compiling our module with

ghc F.hs -O -fforce-recomp -ddump-inlinings

results in output indicating that GHC did decide to inline f:

Inlining done: F.f

To inline or to specialize?

GHC prefers inlining over specialization, when possible, since inlining eliminates calls and doesn’t require creation of new bindings. However, excessive inlining is often even more dangerous than excessive specialization. So, even when a specializable call is deemed too costly to inline, GHC will still attempt to specialize it.

We can artificially create such a scenario in our example by adjusting what GHC calls the “unfolding use threshold”. An “unfolding” is, roughly, the definition of a binding that GHC uses when it decides to inline or specialize calls to that binding. The unfolding use threshold governs the maximum effective size¹ of unfoldings that GHC will inline, and it can be manually adjusted using the -funfolding-use-threshold flag. Let’s set the unfolding use threshold to -1, essentially making GHC think all inlining is very expensive, and check the -ddump-simpl output:

ghc F.hs -O -fforce-recomp -ddump-simpl -funfolding-use-threshold=-1

As we can see, GHC did specialize the call:

...
f_$sf1
  = \ x y ->
      case integerLt x y of {
        False -> integerSub x y;
        True -> integerAdd x y
      }

foo = \ ds -> case ds of { (ww, ww1) -> f_$sf1 ww ww1 }

------ Local rules for imported ids --------
"SPEC f @Integer" forall $dOrd $dNum. f $dOrd $dNum = f_$sf1
...

The name of the specialization (f_$sf1) and the rewrite rule indicate that GHC did successfully automatically specialize the overloaded call to f.

Interestingly, the Core terms for foo and its specialization f_$sf are alpha-equivalent to the terms we arrived at when GHC inlined the call and applied worker/wrapper instead², with the specialization playing the same role as the worker.

Cross-module automatic specialization

We have now discussed two prerequisites for automatic specialization of a call:

The call must be specializable (i.e. it must be a call to an overloaded binding at a known type).
Other optimizations, such as inlining, that remove the call or otherwise ruin the specializability of the call must not fire before specialization can occur.

In fact, for specializable calls which occur in the definition module of the overloaded binding (as was the case in our previous example), these are the only prerequisites. When the overloaded binding is imported from another module (as is most often the case), there are additional prerequisites which we’ll discuss now.

Exposed unfoldings and the `INLINABLE` pragma

GHC performs separate compilation (as opposed to whole program compilation), compiling one Haskell module at a time. When GHC compiles a module, it produces not only compiled code in an object file, but also an interface file (with suffix .hi) . The interface file contains information about the module that GHC might need to reference when compiling other modules, such as the names and types of the bindings exported by the module. If certain criteria are met, GHC will include a binding’s unfolding in the module’s interface file so that it can be used later for cross-module inlining or specialization. Such unfoldings are referred to as exposed unfoldings.

Now, you might reasonably wonder: If unfoldings are used to do these powerful optimizations, why does GHC only expose unfoldings which meet some criteria? Why not expose all unfoldings? The reason is that during compilation, GHC holds the interfaces of every module in the program in memory. Thus, to keep GHC’s own default performance and memory usage reasonable, module interfaces need to be as small as possible while still producing well-optimized programs. One way that GHC achieves this is by limiting the size of unfoldings that get included in interface files so that only small unfoldings are exposed by default.

There’s another wrinkle here that impacts cross-module specialization: Even if GHC decides to expose an overloaded binding’s unfolding, and a specializable call to that binding occurs in another module, GHC will still never automatically specialize that call unless it has been given explicit permission to create the specialization. Such explicit permission can only be given in one of the following ways:

Mark the overloaded binding with either an INLINABLE or INLINE pragma.
Enable the -fspecialize-aggressively flag while compiling the calling module.

Let’s explore this fact by continuing with our example. Move foo, which makes a specializable call to f, to another module Foo.hs that has -funfolding-use-threshold set to -1 to fool the inliner as before:

{-# OPTIONS_GHC -funfolding-use-threshold=-1 #-}
module Foo where

import F

foo :: (Integer, Integer) -> Integer
foo (x, y) = f x y

Also remove everything from F.hs except f, for good measure:

module F where

f :: (Ord a, Num a) => a -> a -> a
f x y =
    if x < y then
        x + y
    else
        x - y

Since f is so small, we might expect GHC to expose its unfolding in the F.hi module interface by default. If we compile with just

ghc F.hs

we get the object file F.o and the interface file F.hi. We can determine whether GHC decided to expose the unfolding of f by viewing the contents of the interface file using GHC’s --show-iface option:

ghc --show-iface F.hi -dsuppress-all

Specific information for each binding in the module is listed towards the bottom of the output. The GHC Core of any exposed unfoldings will be displayed under their respective bindings. In this case, the information for f looks like this:

bcb4b04f3cbb5e6aa2f776d6226a0930
  f :: (Ord a, Num a) => a -> a -> a
  []

It only includes the type, no unfolding! This is because at GHC’s default optimization level of -O0, the -fomit-interface-pragmas and -fignore-interface-pragmas flags are enabled which prevent unfoldings (among other things) from being included in and read from the module interfaces. Recompile with optimizations enabled and check the module interface again:

ghc -O F.hs
ghc --show-iface F.hi -dsuppress-all

This time, GHC did expose the unfolding:

152dd20f273a86bea689edd6a298afe6
  f :: (Ord a, Num a) => a -> a -> a
  [...,
   Unfolding: Core: <vanilla>
              \ @a
                ($dOrd['Many] :: Ord a)
                ($dNum['Many] :: Num a)
                (x['Many] :: a)
                (y['Many] :: a) ->
              case < @a $dOrd x y of wild {
                False -> - @a $dNum x y True -> + @a $dNum x y }]

Remember, we still haven’t given GHC explicit permission to specialize calls to f across modules, so we should expect the fully optimized Core of Foo.hs to still include the overloaded call to f. Let’s check:

ghc Foo.hs -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-simpl

The dumped Core includes:

$wfoo = \ ww ww1 -> f $fOrdInteger $fNumInteger ww ww1

foo = \ ds -> case ds of { (ww, ww1) -> $wfoo ww ww1 }

Indeed, GHC applied the worker/wrapper transformation to foo, but was not able to specialize the call to f, despite it meeting our previously discussed prerequisites for automatic specialization.

There is a warning flag in GHC that can notify us of such a case: -Wall-missed-specializations. Compile Foo.hs again, including this flag:

ghc Foo.hs -O -fforce-recomp -Wall-missed-specializations

This will output the following warning:

Foo.hs: warning: [-Wall-missed-specialisations]
    Could not specialise imported function ‘f’
    Probable fix: add INLINABLE pragma on ‘f’

If we do what the warning says by adding an INLINABLE pragma on f, and dump the core of Foo.hs, we’ll see that automatic specialization succeeds:

$sf
  = \ x y ->
      case integerLt x y of {
        False -> integerSub x y;
        True -> integerAdd x y
      }

foo = \ ds -> case ds of { (ww, ww1) -> $sf ww ww1 }

------ Local rules for imported ids --------
"SPEC/Foo f @Integer" forall $dOrd $dNum. f $dOrd $dNum = $sf

Removing the INLINABLE pragma on f and instead enabling -fspecialize-aggressively has the same result.

The automatic specialization decision graph

We have now covered all the major prerequisites for automatic specialization. To summarize them, here is a decision graph illustrating the various ways that an arbitrary function call can trigger automatic specialization:

Now that we fully understand how, why, and when the GHC specializer works, we can move on to discussing the real problems that result from its behavior. Most of this discussion will be left for the next post in this series, but before concluding, I want to introduce something I call “the specialization spectrum”.

The specialization spectrum

Specialization is a very valuable compiler optimization, but I’ve mentioned many times throughout this post that excessive specialization can be a bad thing. This prompts a question: How do we know if we are appropriately benefitting from specialization? The meaning of “appropriately” here depends on application-specific requirements that dictate the desired size of our executables, how much we care about compilation costs, and how much we care about performance.

For example, if we want to maximize performance at all costs, we should make sure that we are generating and using the set of specializations that maximize the performance metrics we’re interested in, disregarding the increase in compilation costs and executable sizes.

Essentially, our goal is to find our ideal spot in the specialization spectrum.

The Specialization Spectrum

Idealized graph of performance against code size and compilation cost.

This is our search space, with performance on one axis and code size and compilation cost on the other. The plotted points represent important application-agnostic points in the spectrum. Those points are:

Baseline: Lowest performance and lowest cost. This point represents GHC’s default behavior where its heuristics will result in smaller code size and lower compilation cost but potentially miss specializations that would result in big performance wins.
Ideal: As the application authors, we get to choose the location of this point based on our priorities. Typically, we want this as “high and to the left” as possible.
Max performance: This point represents the optimal set of specializations, which will result in better runtime performance than any other set of specializations.
Max specialization: This point is the result of generating every³ possible specialization by enabling -fexpose-all-unfoldings and -fspecialize-aggressively. Importantly, this is not always equivalent to max performance! If we generate useless specializations that result in little to no performance improvements but do grow the code size, we can end up losing performance due to more code swapping in and out of CPU caches.

The dotted line illustrates an approximate “optimal path” representing the results we might see as we generate all specializations in order of decreasing performance improvement.

This framework makes it clear that this really is just an optimization problem, with all the normal issues of traditional optimization problems at play. Unfortunately, in the absence of good tools for exploring this spectrum, it is particularly easy for programmers to get lost and go down treacherous, unoptimal paths like this:

Typical path leading to high compilation cost but only medium performance.

Such cases are deceptive, making the programmer think they have landed in a good spot when they are actually in a poor-performing local optimum. Fortunately, the tools and techniques we’ll discuss in the next post of this series will greatly simplify optimal search of the specialization spectrum.

Summary

This concludes our introductory exploration of specialization. Here’s what we have learned:

Calls to overloaded functions are compiled by passing dictionary values with a record of functions for each type class constraint.
Specialization removes type class dictionary arguments from an overloaded function and replaces references to them with references to a concrete dictionary instead.
Almost all of the benefit of specialization comes from the optimizations that it enables by replacing the opaque dictionary arguments with concrete dictionaries whose contents can be inlined.
GHC will only automatically specialize calls if a specific set of conditions holds. See the automatic specialization decision graph.
The specialization spectrum is a convenient framework for conceptualizing the impact of specialization on a program’s compilation cost and runtime performance.

In the next post of this series, we will apply all of what we have learned so far on some example applications, and demonstrate how the new tools we have developed can help us achieve optimal specialization and performance.

Footnotes

The effective size of an unfolding can be thought of as the number of terms in the Core representation of the unfolding, plus or minus some discounts that are applied depending on where GHC is considering inlining the unfolding.↩︎
This hints at a weak confluence of GHC Core and the reductions (i.e. optimizations) that the GHC optimizer applies to it.↩︎
Even with something like this in a cabal.project file:
```
package *
  ghc-options: -fexpose-all-unfoldings -fspecialize-aggressively
```
Some overloaded calls may still not get specialized! This can occur if a chain of calls to overloaded functions includes a call to an overloaded function in a GHC boot library that cannot be reinstalled by Cabal, e.g. base, which does not have its unfolding exposed. The only way to specialize such calls is to build boot libraries from source with -fexpose-all-unfoldings and -fspecialize-aggressively, and include the snippet above in a cabal.project file.

Additionally, some specific scenarios can cause overloaded calls to appear late in the optimization pipeline. To specialize those calls, -flate-specialise (British spelling required) is necessary, which runs another specialization pass at the end of GHC’s Core optimization pipeline.

Further, even after the above, some overloaded calls may still survive without -fpolymorphic-specialisation (British spelling required), which is known to be unsound at the time of writing. Unfortunately, in complex applications, total elimination of overloaded calls is still quite a difficult goal to achieve.↩︎