A co-worker of mine had a random thought on IRC:
Is data type “unrolling” a valid performance optimization? Used anywhere?
Something like:
data List a = Nil | Cons a {-# Unroll 10 #-} (List a)
I thought, that sounds like the vec
package I wrote.
One module there uses the GADT definition
data Vec (n :: Nat) a where
VNil :: Vec 'Z a
(:::) :: a -> Vec n a -> Vec ('S n) a
and a clever way to define functions so GHC unrolls them. Another module uses the data family
data family Vec (n :: Nat) a
data instance Vec 'Z a = VNil
data instance Vec ('S n) a = a ::: !(Vec n a)
which in some cases optimizes better than the GADT version.
However in neither version we can {-# UNPACK #-}
the tail. GHC simply says
Ignoring unusable UNPACK pragma on the second argument of ‘:::’
It’s not “obvious” to the compiler that tail is a single constructor data type:
technically it’s not even a single constructor in the GADT; and it’s not clear in data family case either, because of the degenerate Vec ('S Any)
case.
But a single contructor data type is a requirement for UNPACK to work.
There’s however another trick we can use: Backpack.
Backpack is a system for retrofitting Haskell with an applicative, mix-in
module system. In short, we are able to write more efficient
monomorphic code.
For example, the description of unpacked-containers
says:
This backpack mixin package supplies unpacked sets and maps exploiting backpack’s ability to unpack through signatures.
Unpacking is exactly what we want to do.
Backpack
Backpack is an underdocumented feature of GHC and Cabal. There are Edward Yang’s blog posts, his PhD thesis (which I only scanned through), and in addition to what’s in the source code, that’s all. We won’t let that stop us from experimenting!
The idea of Backpack is that you can write modules which are parametrised by other modules.
Short glossary:
- Module is what you think it is.
- Signature is a “type signature of module”.
- Unit is a collection of modules and signatures. Think
library
. - Units can be open, contain “holes” (specified by signatures) to be filled
by other modules. See
OpenUnitId
inCabal
docs. - Mixin is a unit with “inputs” and “outputs” possibly renamed.
One advice I can give: give types and signatures different names (i.e. avoid signature Str where data Str
), until you are comfortable with Backpack concepts.
That’s similar to giving your types and constructors different names (i.e. avoid data Unit = Unit
).
Rather say:
Sig where
signature data S
module Impl where
data S = MkS
Let’s write the Backpack version of Vec
next …
Nil
List-like structure have two cases: nil and cons. We’ll start with a simple one: nil. We create an ordinary Haskell-module in an internal library (unit) of the package:
library v0
exposed-modules: V0
And the Haskell source is below:
module V0 where
data Vec a = Nil
deriving (Eq, Ord, Show, Functor, Foldable, Traversable)
instance Applicative Vec where
pure _ = Nil
<*> _ = Nil _
Nil
is essentially a new name for Proxy
, if you are familiar with that.
Cons
The cons-case is a recursive case.
The cons
unit requires a module, and exposes a module.
In other words it’s a mixin.
I suggestively named the signature of the required input module In
,
and exported output module Out
.
library cons
signatures: In
exposed-modules: Out
Signature In
is like a module without any implementations or definitions.
The module Out
is not different from an ordinary Haskell module. We
import In
and can use whatever is defined by its signature. To be used for
vectors, we are careful to implement Out
in such a way that it itself
satisfies the In
signature again, but that fact is not checked until
the later instantiation.
In.hsig
The In
signature specifies an API that we expect the tail of the vector,
and therefore the whole vector to satisfy.
In this example it as simple as specifying:
- the type of vectors,
Vec
, - a list of instances that
Vec
should satisfy.
In where
signature
data Vec a
instance Eq a => Eq (Vec a)
instance Ord a => Ord (Vec a)
instance Show a => Show (Vec a)
instance Functor Vec
instance Foldable Vec
instance Traversable Vec
instance Applicative Vec
Out.hs
The output of the mixin is implementation of cons on vectors.
Note: at this stage nothing checks that Out
implements In
signature.
The compatibility checks are done at instantiation phase.
This reminds me of C++ templates, which are properly checked only
when instantiated too.
Also note that we use UNPACK
on the tail field !(In.Vec a)
.
This actually works, as we’ll see later.
module Out where
import qualified In
data Vec a = !a ::: {-# UNPACK #-} !(In.Vec a)
deriving (Eq, Ord, Functor, Foldable, Traversable)
infixr 5 :::
instance Applicative Vec where
pure x = x ::: pure x
::: g <*> x ::: y = f x ::: (g <*> y) f
We also write a Show
instance manually for a bit prettier output:
instance Show a => Show (Vec a) where
showsPrec d (x ::: xs) = showParen (d > 5)
$ showsPrec 6 x
. showString " ::: "
. showsPrec 5 xs
Non-empty vectors: v1, v2, v3
Now it’s time to instantiate the cons
mixin.
We already have a v0
vector.
We can instantiate cons
with V0
to get V1
:
library v1
mixins: cons (Out as V1) requires (In as V0)
reexported-modules: V1
build-depends: cons, v0
The v1
library depends on v0
(one smaller vector), and cons
.
The mixins
field says:
- take
cons
- rename its exported module from
Out
toV1
- and instantiate the required signature
In
toV0
.
Then, the Cabal
Backpack magic will connect the dots, as v0
exposes V0
.
And finally, we make V1
available by re-exporting it, using
reexported-modules
.
Note: for larger vectors, we require no more Haskell code to be written; we are only connecting existing units.
While larger vectors follow the same pattern, we need to pre-instantiate them manually: they cannot be created at the run-time.
library v2
mixins: cons (Out as V2) requires (In as V1)
reexported-modules: V2
build-depends: cons, v1
library v3
mixins: cons (Out as V3) requires (In as V2)
reexported-modules: V3
build-depends: cons, v2
Example: Some calculations
Before an actual example, we can define a library which will re-export all the modules we defined:
library
build-depends: v0, v1, v2, v3
reexported-modules: V0, V1, V2, V3
Finally we can write a program that will use all these vectors:
executable main
main-is: Main.hs
build-depends:
, backpack-vector
, base
Main.hs
First the necessary imports:
module Main (main) where
import qualified V0
import qualified V1
import qualified V2
import qualified V3
import Control.Applicative (liftA2)
import Data.List (mapAccumL)
Next we define a few values. We use pure
and mapAccumL
.
Using actual constructors would be even less pretty, as we would
have to use qualified names.
v0 :: V0.Vec Double
= pure 0.0
v0
v1 :: V1.Vec Double
= pure 1.0
v1
v2 :: V2.Vec Double
= pure 2.0
v2
-- 0, 1, 2
v3 :: V3.Vec Double
= snd $ mapAccumL (\n _ -> (succ n, n)) 0 (pure ()) v3
main :: IO ()
= do
main print v0
print v1
print v2
print $ liftA2 (*) v3 v3
Running this will print
Nil
1.0 ::: Nil
2.0 ::: 2.0 ::: Nil
0.0 ::: 1.0 ::: 4.0 ::: Nil
If we dump the Core of this module, we’ll see bindings like
-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0}
main11
main11 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 0.0##
-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0}
main10
main10 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 1.0##
-- RHS size: {terms: 4, types: 0, coercions: 0, joins: 0/0}
main9
main9 = $w$sshowSignedFloat $fShowDouble2 $fShowVec3 4.0##
A strong hint that computations are unrolled, and perfomed at compile-time.
Excursion: Prettier syntax
In example above we had to jump through a hoop to define an interesting
(= non-uniform value) V3.Vec
. It’s not nice.
One solution is to use pattern synonyms. Pattern synonyms cannot be defined in a type-class, but the helper functions could. Here’s a type-class:
class ConsPattern v t | v -> t, t -> v where
match :: v a -> (a, t a)
build :: a -> t a -> v a
Then we can add an instance of this class to each cons
case.
Note, we shouldn’t require the instance in the signature In
(think why!)
instance ConsPattern Vec In.Vec where
= (:::)
build ::: xs) = (x, xs) match (x
Given ConsPattern
type-class we can define an actual bidirectional
pattern synonym.
pattern (:::) :: ConsPattern v t => a -> t a -> v a
pattern x ::: xs <- (match -> (x, xs))
where x ::: xs = build x xs
With the pattern synonym we have recovered a nice syntax for constructing vectors:
v3a :: V3.Vec Double
= 1 ::: 2 ::: 3 ::: Nil v3a
But why stop here? Using source plugins, we can overload the list syntax!
This is similar to what the OverloadedLists
extension does
but using a different desugaring. This approach is implemented in my overloaded
library.
By adding few more instances
-- in V0.hs
instance Overloaded.Lists.Nil (Vec a) where
= Nil
nil
-- in Out.hs of cons
instance Overloaded.Lists.Cons a (In.Vec a) (Vec a) where
= (:::) cons
and enabling the source plugin in Main
module, we can write a vector
as simply as
v3b :: V3.Vec Double
= [5, 6, 7] v3b
This small syntax digression is here to show that modern GHC has ways to improve the syntax of certain constructs considerably.
Pattern synonyms are great and worth using where appropriate, but I’d not recommend using overloaded
library outside of blog posts.
Memory representation
GHC starting with version 8.6 bundles a library called ghc-heap
which allows to look into heap representation of the values.
ghc-heap
documentation for a version bundled with GHC-8.8.1 can be found next to GHC documentation.
We are particularly interested in GHC.Exts.Heap
module.
Using the module functionality we can write a small function to
compute a size of fully evaluated values.
From all of possible Closure
constructors, we are only interested in
ConstrClosure
– data constructors.
Data constructor closures can have data directly there (non-pointer arguments, dargs
),
or data referred to by pointers (pointer-arguments, pargs
).
Also the info table takes one Word
.
calculateSize :: HasHeapRep a => a -> IO (Maybe Int)
= do
calculateSize val <- getClosureData val
c
runMaybeT (go c)where
go :: Closure -> MaybeT IO Int
ConstrClosure _ pargs dargs _ _ _) = do
go (<- traverse (liftIO . getBoxedClosureData >=> go) pargs
ds return
$ 1 -- info table
+ length dargs -- data words
+ length pargs -- pointers
+ sum ds -- size of pointed data
= do
go x $ print x -- something unexpected
liftIO MaybeT (return Nothing)
If we ask for sizes of v0
, v1
, v2
and v3
we get following results:
value | size in words |
---|---|
v0 :: V0 Double |
2 |
v1 :: V1 Double |
4 |
v2 :: V2 Double |
7 |
v3 :: V3 Double |
10 |
To understand these numbers, let’s draw a memory representation of V3.Vec
:
V3.Vec a
+-------+-------+-------+-------+
| | | | |
| hdr | * | * | * |
| | | | | | | |
+-------+---|---+---|---+---|---+
| | |
v v v
a a a
V3.Vec Double
has
- 1 word for info table
- 3 pointers (= words) to
- 3 boxed doubles, 2 words each: info table and an actual number.
In total 10 words. Similarly, V2.Vec Double
takes 1 + 2 * 3 = 7 words,
and V1.Vec Double
1 + 1 * 3 = 4 words.
V0.Vec Double
is 2 words, as all GHC heap values have to be at least 2 words in size (header + 1 word of payload).
It’s not a problem here, as V0.Vec a
should be a static value, i.e. pre-allocated and shared.
I wish there were a library which would draw that kind of diagrams,
using ghc-heap
for inspection.
We should also ask how many words the GADT or data family representations take.
The answer is 17 for both.
For each element, there are three words as with Backpack version,
but also each cell has own info table and a pointer to the next one: it’s a linked list.
Thus 5 words per element, and additional 2 words for the nil.
The nil should be shared value, so effectively the size of the V3.Vec Double
is “only” 15 words.
V3.Vec a V2.Vec a
+-------+-------+-------+ +-------+-------+-------+
| | | | | | | |
| hdr | * | *------> | hdr | * | *------> V1.Vec a...
| | | | | | | | | |
+-------+---|---+-------+ +-------+---|---+-------+
| |
v v
a a
More memory, more pointers to chase, those must be slower. Next some benchmarks.
Benchmarks
We might have made strong claims that Backpack’d structure is efficient. Let’s compare it with other similar data-structures in a toy benchmark.
Given a hundred of vectors of size 3, sum them element-wise and multiply
the elements at the end. For V3
defined above, that would be trivially:
benchV3 :: [V3.Vec Double] -> Double
= foldl' (*) 1.0 . foldl' (liftA2 (+)) (pure 0.0) benchV3
where the right foldl'
is over the list,
and the left foldl'
is over the Vec
structure.
We can compare that with using bare lists, and maintaining the length variant.
The code is still quite the same. We only change liftA2
to zipWith
.
benchList :: [[Double]] -> Double
= foldl' (*) 1.0 . foldl' (zipWith (+)) [0, 0, 0] benchList
Another comparison is with Vec
from the vec
library.
There are two variants, GADT- and data-family-based.
We need to cheat a bit and use foldr
while folding over vector itself.
There is no foldl'
, as foldr
should be as fast when the size is known.
benchGADT :: [GADT Double] -> Double
= GADT.foldr (*) 1.0 . foldl' (GADT.zipWith (+)) (pure 0.0)
benchGADT
benchDF :: [DF Double] -> Double
= DF.foldr (*) 1.0 . foldl' (DF.zipWith (+)) (pure 0.0) benchDF
And the final contestant is manually written vector, the definition is straight-forward.
data Manual a = Manual !a !a !a
deriving (Functor, Foldable, Traversable)
instance Applicative Manual where
pure x = Manual x x x
Manual x1 x2 x3) (Manual y1 y2 y3) =
liftA2 f (Manual (f x1 y1) (f x2 y2) (f x3 y3)
benchManual :: [Manual Double] -> Double
= foldl' (*) 1.0 . foldl' (liftA2 (+)) (pure 0.0) benchManual
Next the results.
Results
The results of criterion benchmarks are below, we can observe the following:
- Backpack run takes 500 nanoseconds, this is the baseline.
- List version is about 10 times slower. The benchmark is chosen so GHC cannot fuse lists away, we see their costs.
- GADT version is a little faster than List version, but still 8x slower than backpack version.
- Data-family version is only 35% slower than Backpack, and 6 times faster than GADT version. I suspect that index-driven code-generation just works better for data-families. Note: that backpack and manual version have strict fields, where GADT and data-family version have lazy ones. Maybe the data-family could be made a little faster.
- And finally, the manual version … exactly the same performance as with Backpack.
benchmarking Backpack
time 501.2 ns (500.9 ns .. 501.5 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 501.3 ns (501.0 ns .. 501.9 ns)
std dev 1.257 ns (675.9 ps .. 2.228 ns)
benchmarking List
time 4.714 μs (4.710 μs .. 4.719 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 4.724 μs (4.716 μs .. 4.746 μs)
std dev 41.35 ns (16.52 ns .. 80.19 ns)
benchmarking GADT
time 4.023 μs (4.021 μs .. 4.025 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 4.030 μs (4.027 μs .. 4.037 μs)
std dev 15.62 ns (6.974 ns .. 29.17 ns)
benchmarking DF
time 677.9 ns (677.3 ns .. 678.7 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 678.7 ns (677.7 ns .. 680.5 ns)
std dev 4.253 ns (2.624 ns .. 7.070 ns)
benchmarking Manual
time 500.8 ns (498.5 ns .. 505.0 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 500.7 ns (498.9 ns .. 506.0 ns)
std dev 9.150 ns (1.279 ns .. 18.06 ns)
Conclusion
We have now seen how one can use Backpack to implement Vec
.
The approach is different from GADTs and data family way,
as the tail can be UNPACK
ed. One other variant,
is to backpack the element, like Key
is backpacked
in unpacked-containers
.
There are non-Backpack implementations of V0
… V4
in
linear
package.
If you’d ever need V5
, Backpack approach would give it with less
copy & paste. Would you ever need V6
?
One downside of Backpack approach is that we have to explicitly instantiate structures. With the GADT or data-family approach, whatever the run-time size argument is, the program will run (maybe very slowly, but still …). With the Backpack approach, there’s some maximum limit determined at compile time, with no way to surpass it at run-time.
The accompanying code for this post is on GitHub at phadej/vec-backpack.