Unrolling data with Backpack

Oleg Grenrus – Friday, 01 November 2019

A co-worker of mine had a random thought on IRC:

Is data type “unrolling” a valid performance optimization? Used anywhere?

Something like: data List a = Nil | Cons a {-# Unroll 10 #-} (List a)

I thought, that sounds like the vec package I wrote. One module there uses the GADT definition

data Vec (n :: Nat) a where
    VNil  :: Vec 'Z a
    (:::) :: a -> Vec n a -> Vec ('S n) a

and a clever way to define functions so GHC unrolls them. Another module uses the data family

data family   Vec (n :: Nat) a
data instance Vec 'Z     a = VNil
data instance Vec ('S n) a = a ::: !(Vec n a)

which in some cases optimizes better than the GADT version. However in neither version we can {-# UNPACK #-} the tail. GHC simply says

Ignoring unusable UNPACK pragma on the second argument of ‘:::’

It’s not “obvious” to the compiler that tail is a single constructor data type: technically it’s not even a single constructor in the GADT; and it’s not clear in data family case either, because of the degenerate Vec ('S Any) case. But a single contructor data type is a requirement for UNPACK to work.

There’s however another trick we can use: Backpack.

Backpack is a system for retrofitting Haskell with an applicative, mix-in module system. In short, we are able to write more efficient monomorphic code. For example, the description of unpacked-containers says:

This backpack mixin package supplies unpacked sets and maps exploiting backpack’s ability to unpack through signatures.

Unpacking is exactly what we want to do.

Backpack

Backpack is an underdocumented feature of GHC and Cabal. There are Edward Yang’s blog posts, his PhD thesis (which I only scanned through), and in addition to what’s in the source code, that’s all. We won’t let that stop us from experimenting!

The idea of Backpack is that you can write modules which are parametrised by other modules.

Short glossary:

Module is what you think it is.
Signature is a “type signature of module”.
Unit is a collection of modules and signatures. Think library.
Units can be open, contain “holes” (specified by signatures) to be filled by other modules. See OpenUnitId in Cabal docs.
Mixin is a unit with “inputs” and “outputs” possibly renamed.

One advice I can give: give types and signatures different names (i.e. avoid signature Str where data Str), until you are comfortable with Backpack concepts. That’s similar to giving your types and constructors different names (i.e. avoid data Unit = Unit). Rather say:

signature Sig where
  data S

module Impl where
  data S = MkS

Let’s write the Backpack version of Vec next …

Nil

List-like structure have two cases: nil and cons. We’ll start with a simple one: nil. We create an ordinary Haskell-module in an internal library (unit) of the package:

library v0
  exposed-modules: V0

And the Haskell source is below:

module V0 where

data Vec a = Nil
  deriving (Eq, Ord, Show, Functor, Foldable, Traversable)

instance Applicative Vec where
    pure _  = Nil
    _ <*> _ = Nil

Nil is essentially a new name for Proxy, if you are familiar with that.

Cons

The cons-case is a recursive case. The cons unit requires a module, and exposes a module. In other words it’s a mixin.

I suggestively named the signature of the required input module In, and exported output module Out.

library cons
  signatures:      In
  exposed-modules: Out

Signature In is like a module without any implementations or definitions. The module Out is not different from an ordinary Haskell module. We import In and can use whatever is defined by its signature. To be used for vectors, we are careful to implement Out in such a way that it itself satisfies the In signature again, but that fact is not checked until the later instantiation.

In.hsig

The In signature specifies an API that we expect the tail of the vector, and therefore the whole vector to satisfy. In this example it as simple as specifying:

the type of vectors, Vec,
a list of instances that Vec should satisfy.

signature In where

data Vec a

instance Eq a => Eq (Vec a)
instance Ord a => Ord (Vec a)
instance Show a => Show (Vec a)
instance Functor Vec
instance Foldable Vec
instance Traversable Vec
instance Applicative Vec

Out.hs

The output of the mixin is implementation of cons on vectors.

Note: at this stage nothing checks that Out implements In signature. The compatibility checks are done at instantiation phase. This reminds me of C++ templates, which are properly checked only when instantiated too.

Also note that we use UNPACK on the tail field !(In.Vec a). This actually works, as we’ll see later.

module Out where

import qualified In

data Vec a = !a ::: {-# UNPACK #-} !(In.Vec a)
  deriving (Eq, Ord, Functor, Foldable, Traversable)

infixr 5 :::

instance Applicative Vec where
    pure x = x ::: pure x
    f ::: g <*> x ::: y = f x ::: (g <*> y)

We also write a Show instance manually for a bit prettier output:

instance Show a => Show (Vec a) where
    showsPrec d (x ::: xs) = showParen (d > 5)
        $ showsPrec 6 x
        . showString " ::: "
        . showsPrec 5 xs

Non-empty vectors: v1, v2, v3

Now it’s time to instantiate the cons mixin. We already have a v0 vector. We can instantiate cons with V0 to get V1:

library v1
  mixins:             cons (Out as V1) requires (In as V0)
  reexported-modules: V1
  build-depends:      cons, v0

The v1 library depends on v0 (one smaller vector), and cons. The mixins field says:

take cons
rename its exported module from Out to V1
and instantiate the required signature In to V0.

Then, the Cabal Backpack magic will connect the dots, as v0 exposes V0.

And finally, we make V1 available by re-exporting it, using reexported-modules.

Note: for larger vectors, we require no more Haskell code to be written; we are only connecting existing units.

While larger vectors follow the same pattern, we need to pre-instantiate them manually: they cannot be created at the run-time.

library v2
  mixins:             cons (Out as V2) requires (In as V1)
  reexported-modules: V2
  build-depends:      cons, v1

library v3
  mixins:             cons (Out as V3) requires (In as V2)
  reexported-modules: V3
  build-depends:      cons, v2

Example: Some calculations

Before an actual example, we can define a library which will re-export all the modules we defined:

library
  build-depends:      v0, v1, v2, v3
  reexported-modules: V0, V1, V2, V3

Finally we can write a program that will use all these vectors:

executable main
  main-is:          Main.hs
  build-depends:
    , backpack-vector
    , base

Main.hs

First the necessary imports:

module Main (main) where

import qualified V0
import qualified V1
import qualified V2
import qualified V3

import Control.Applicative (liftA2)
import Data.List (mapAccumL)

Next we define a few values. We use pure and mapAccumL. Using actual constructors would be even less pretty, as we would have to use qualified names.

v0 :: V0.Vec Double
v0 = pure 0.0

v1 :: V1.Vec Double
v1 = pure 1.0

v2 :: V2.Vec Double
v2 = pure 2.0

-- 0, 1, 2
v3 :: V3.Vec Double
v3 = snd $ mapAccumL (\n _ -> (succ n, n)) 0 (pure ())

main :: IO ()
main = do
    print v0
    print v1
    print v2
    print $ liftA2 (*) v3 v3

Running this will print

Nil
1.0 ::: Nil
2.0 ::: 2.0 ::: Nil
0.0 ::: 1.0 ::: 4.0 ::: Nil

If we dump the Core of this module, we’ll see bindings like

-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0}
main11
main11 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 0.0##

-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0}
main10
main10 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 1.0##

-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0}
main9
main9 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 4.0##

A strong hint that computations are unrolled, and perfomed at compile-time.

Excursion: Prettier syntax

In example above we had to jump through a hoop to define an interesting (= non-uniform value) V3.Vec. It’s not nice.

One solution is to use pattern synonyms. Pattern synonyms cannot be defined in a type-class, but the helper functions could. Here’s a type-class:

class ConsPattern v t | v -> t, t -> v where
    match :: v a -> (a, t a)
    build :: a -> t a -> v a

Then we can add an instance of this class to each cons case. Note, we shouldn’t require the instance in the signature In (think why!)

instance ConsPattern Vec In.Vec where
    build = (:::)
    match (x ::: xs) = (x, xs)

Given ConsPattern type-class we can define an actual bidirectional pattern synonym.

pattern (:::) :: ConsPattern v t => a -> t a -> v a
pattern x ::: xs <- (match -> (x, xs))
  where x ::: xs = build x xs

With the pattern synonym we have recovered a nice syntax for constructing vectors:

v3a :: V3.Vec Double
v3a = 1 ::: 2 ::: 3 ::: Nil

But why stop here? Using source plugins, we can overload the list syntax! This is similar to what the OverloadedLists extension does but using a different desugaring. This approach is implemented in my overloaded library.

By adding few more instances

-- in V0.hs
instance Overloaded.Lists.Nil (Vec a) where
    nil = Nil

-- in Out.hs of cons
instance Overloaded.Lists.Cons a (In.Vec a) (Vec a) where
    cons = (:::)

and enabling the source plugin in Main module, we can write a vector as simply as

v3b :: V3.Vec Double
v3b = [5, 6, 7]

This small syntax digression is here to show that modern GHC has ways to improve the syntax of certain constructs considerably. Pattern synonyms are great and worth using where appropriate, but I’d not recommend using overloaded library outside of blog posts.

Memory representation

GHC starting with version 8.6 bundles a library called ghc-heap which allows to look into heap representation of the values. ghc-heap documentation for a version bundled with GHC-8.8.1 can be found next to GHC documentation. We are particularly interested in GHC.Exts.Heap module.

Using the module functionality we can write a small function to compute a size of fully evaluated values. From all of possible Closure constructors, we are only interested in ConstrClosure – data constructors. Data constructor closures can have data directly there (non-pointer arguments, dargs), or data referred to by pointers (pointer-arguments, pargs). Also the info table takes one Word.

calculateSize :: HasHeapRep a => a -> IO (Maybe Int)
calculateSize val = do
    c <- getClosureData val
    runMaybeT (go c)
  where
    go :: Closure -> MaybeT IO Int
    go (ConstrClosure _ pargs dargs _ _ _) = do
        ds <- traverse (liftIO . getBoxedClosureData >=> go) pargs
        return
            $ 1            -- info table
            + length dargs -- data words
            + length pargs -- pointers
            + sum ds       -- size of pointed data
    go x = do
        liftIO $ print x -- something unexpected
        MaybeT (return Nothing)

If we ask for sizes of v0, v1, v2 and v3 we get following results:

value	size in words
`v0 :: V0 Double`	2
`v1 :: V1 Double`	4
`v2 :: V2 Double`	7
`v3 :: V3 Double`	10

To understand these numbers, let’s draw a memory representation of V3.Vec:

V3.Vec a
+-------+-------+-------+-------+
|       |       |       |       |
|  hdr  |   *   |   *   |   *   |
|       |   |   |   |   |   |   |
+-------+---|---+---|---+---|---+
            |       |       |
            v       v       v

            a       a       a

V3.Vec Double has

1 word for info table
3 pointers (= words) to
3 boxed doubles, 2 words each: info table and an actual number.

In total 10 words. Similarly, V2.Vec Double takes 1 + 2 * 3 = 7 words, and V1.Vec Double 1 + 1 * 3 = 4 words. V0.Vec Double is 2 words, as all GHC heap values have to be at least 2 words in size (header + 1 word of payload). It’s not a problem here, as V0.Vec a should be a static value, i.e. pre-allocated and shared.

I wish there were a library which would draw that kind of diagrams, using ghc-heap for inspection.

We should also ask how many words the GADT or data family representations take. The answer is 17 for both. For each element, there are three words as with Backpack version, but also each cell has own info table and a pointer to the next one: it’s a linked list. Thus 5 words per element, and additional 2 words for the nil. The nil should be shared value, so effectively the size of the V3.Vec Double is “only” 15 words.

V3.Vec a                      V2.Vec a
+-------+-------+-------+    +-------+-------+-------+
|       |       |       |    |       |       |       |
|  hdr  |   *   |   *------> |  hdr  |   *   |   *------> V1.Vec a...
|       |   |   |       |    |       |   |   |       |
+-------+---|---+-------+    +-------+---|---+-------+
            |                            |
            v                            v

            a                            a

More memory, more pointers to chase, those must be slower. Next some benchmarks.

Benchmarks

We might have made strong claims that Backpack’d structure is efficient. Let’s compare it with other similar data-structures in a toy benchmark.

Given a hundred of vectors of size 3, sum them element-wise and multiply the elements at the end. For V3 defined above, that would be trivially:

benchV3 :: [V3.Vec Double] -> Double
benchV3 = foldl' (*) 1.0 . foldl' (liftA2 (+)) (pure 0.0)

where the right foldl' is over the list, and the left foldl' is over the Vec structure.

We can compare that with using bare lists, and maintaining the length variant. The code is still quite the same. We only change liftA2 to zipWith.

benchList :: [[Double]] -> Double
benchList = foldl' (*) 1.0 . foldl' (zipWith (+)) [0, 0, 0]

Another comparison is with Vec from the vec library. There are two variants, GADT- and data-family-based. We need to cheat a bit and use foldr while folding over vector itself. There is no foldl', as foldr should be as fast when the size is known.

benchGADT :: [GADT Double] -> Double
benchGADT = GADT.foldr (*) 1.0 . foldl' (GADT.zipWith (+)) (pure 0.0)

benchDF :: [DF Double] -> Double
benchDF = DF.foldr (*) 1.0 . foldl' (DF.zipWith (+)) (pure 0.0)

And the final contestant is manually written vector, the definition is straight-forward.

data Manual a = Manual !a !a !a
  deriving (Functor, Foldable, Traversable)

instance Applicative Manual where
    pure x = Manual x x x
    liftA2 f (Manual x1 x2 x3) (Manual y1 y2 y3) =
        Manual (f x1 y1) (f x2 y2) (f x3 y3)

benchManual :: [Manual Double] -> Double
benchManual = foldl' (*) 1.0 . foldl' (liftA2 (+)) (pure 0.0)

Next the results.

Results

The results of criterion benchmarks are below, we can observe the following:

Backpack run takes 500 nanoseconds, this is the baseline.
List version is about 10 times slower. The benchmark is chosen so GHC cannot fuse lists away, we see their costs.
GADT version is a little faster than List version, but still 8x slower than backpack version.
Data-family version is only 35% slower than Backpack, and 6 times faster than GADT version. I suspect that index-driven code-generation just works better for data-families. Note: that backpack and manual version have strict fields, where GADT and data-family version have lazy ones. Maybe the data-family could be made a little faster.
And finally, the manual version … exactly the same performance as with Backpack.

benchmarking Backpack
time                 501.2 ns   (500.9 ns .. 501.5 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 501.3 ns   (501.0 ns .. 501.9 ns)
std dev              1.257 ns   (675.9 ps .. 2.228 ns)

benchmarking List
time                 4.714 μs   (4.710 μs .. 4.719 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 4.724 μs   (4.716 μs .. 4.746 μs)
std dev              41.35 ns   (16.52 ns .. 80.19 ns)

benchmarking GADT
time                 4.023 μs   (4.021 μs .. 4.025 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 4.030 μs   (4.027 μs .. 4.037 μs)
std dev              15.62 ns   (6.974 ns .. 29.17 ns)

benchmarking DF
time                 677.9 ns   (677.3 ns .. 678.7 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 678.7 ns   (677.7 ns .. 680.5 ns)
std dev              4.253 ns   (2.624 ns .. 7.070 ns)

benchmarking Manual
time                 500.8 ns   (498.5 ns .. 505.0 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 500.7 ns   (498.9 ns .. 506.0 ns)
std dev              9.150 ns   (1.279 ns .. 18.06 ns)

Conclusion

We have now seen how one can use Backpack to implement Vec. The approach is different from GADTs and data family way, as the tail can be UNPACKed. One other variant, is to backpack the element, like Key is backpacked in unpacked-containers.

There are non-Backpack implementations of V0 … V4 in linear package. If you’d ever need V5, Backpack approach would give it with less copy & paste. Would you ever need V6?

One downside of Backpack approach is that we have to explicitly instantiate structures. With the GADT or data-family approach, whatever the run-time size argument is, the program will run (maybe very slowly, but still …). With the Backpack approach, there’s some maximum limit determined at compile time, with no way to surpass it at run-time.

The accompanying code for this post is on GitHub at phadej/vec-backpack.