Consider this falsify
property test that tries to verify the (obviously false) property
that all elements of all lists of up to 10 binary digits are the same (we will explain the details below;
hopefully the intent is clear):
prop_list :: Property ()
= do
prop_list <- gen $ Gen.integral $ Range.between (0, 10)
n <- gen $ replicateM n $ Gen.int $ Range.between (0, 1)
xs $ P.pairwise P.eq .$ ("xs", xs) assert
we might get a counter-example such as this:
failed after 9 shrinks
(xs !! 0) /= (xs !! 1)
xs : [0,1]
xs !! 0: 0
xs !! 1: 1
More interesting than the counter-example itself is how falsify
arrived
at that counter-example; if we look at the shrink history (--falsify-verbose
),
we see that the list shrunk as follows:
[1,1,0,1,0,1]
~> [1,1,0] -- shrink the list length
~> [0,1,0] -- shrink an element of the list
~> [0,1] -- shrink the list length again
The test runner is able to go back and forth between shrinking the length and the list, and
shrinking elements in the list. That is, we have integrated shrinking (like in
hedgehog
: we do not specify a separate generator and shrinker), which is
internal: works across monadic bind. The Python
Hypothesis library showed the world how to achieve
this. In this blog post we will introduce falsify, a new
library that provides property based testing in Haskell and has an approach to
shrinking that is inspired by Hypothesis
. As we shall see, however, the details
are quite different.
Background
In this first section we will discuss some of the background behind falsify
;
the next section will be a more tutorial-style introduction on how to use
it. This section is not meant to an exhaustive discussion of the theory behind
falsify
, or how the theory differs from that of Hypothesis
; both of those
topics will be covered in a paper, currently under review. However, a basic
understanding of these principles will help to use the library more effectively,
and so that will be our goal in this first section.
Unit testing versus property based testing
In unit testing (for example using tasty-hunit), a test
for a function f
might look something like this:
test :: Assertion
=
test == expected) $
unless (f input "not equal" assertFailure
That is, we apply f
to specific input
, and then verify that we get an
expected
result. By contrast, in property based testing, we do not specify a
specific input, but instead generate a random input using some generator
genInput
, and then verify that the input and the output are related by some
property prop
:
test_property :: Property ()
= do
test_property <- gen $ genInput
input $
unless (prop input (f input)) "property not satisfied" testFailed
This blog post is not intended as an introduction to property-based testing; merely observe that generation of input values is a critical ingredient in property based testing. However, if you are not familiar with the topic, or not yet convinced that you should be using it, I can highly recommend watching Testing the Hard Stuff and Staying Sane, and then reading How to Specify It!: A Guide to Writing Properties of Pure Functions, both by the world’s foremost property-based testing guru John Hughes.
The importance of shrinking
Suppose we want to test the (false) property that for all numbers x
and y
,
x - y == y - x
:
prop_shrinking :: Property ()
= do
prop_shrinking <- gen $ Gen.int $ Range.between (0, 99)
x <- gen $ Gen.int $ Range.between (0, 99)
y - y == y - x) $
unless (x "property not satisfied" testFailed
Since the property is false, we will get a counter-example. Without shrinking, one such a counter-example might be
x = 38
y = 23
However, although that is indeed a counter-example to the property, it’s not a great counter-example. Why these specific numbers? Is there something special about them? To quote John Hughes in Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane:
Random tests contain a great deal of junk—that is their purpose! Junk provokes unexpected behaviour and tests scenarios that the developer would never think of. But tests usually fail because of just a few features of the test case. Debugging a test that is 90% irrelevant is a nightmare; presenting the developer with a test where every part is known to be relevant to the failure, simplifies the debugging task enormously.
(emphasis mine). For our example, the numbers 38 and 23 are not particularly relevant to the failure; with shrinking, however, the counter-example we will get is
x = 0
y = 1
Indeed, this is the only counter-example we will ever get: 0
is the smallest
number possible (“least amount of detail”), and the only thing that is relevant
about the second number is that it’s not equal to the first.
Parsing versus generation
Generation of inputs relies on pseudo-random number generators (PRNGs). The typical interface to a PRNGs is something like this:
nextSample :: PRNG -> (Word, PRNG)
Given such an interface, we might define the type of generators as
newtype Gen a = Gen (PRNG -> (a, PRNG))
This covers generation, but not shrinking. The traditional approach in QuickCheck to shrinking is to pair a generator with a shrinking function, a function of type
shrink :: a -> [a]
This works, but it’s not without its problems; see my blog post Integrated
versus Manual Shrinking for an in-depth discussion.
The key insight of the Hypothesis
library is that instead of shrinking generated
values, we instead shrink the samples produced by the PRNG. Suppose we
unfold a PRNG to a stream of random samples:
unfoldLinear :: PRNG -> [Word]
=
unfoldLinear prng let (s, prng') = next prng
in s : unfoldLinear prng'
Then we can shift our perspective: rather than thinking of generating random values from a PRNG we instead parse this stream of random samples:
newtype Parser a = Parser ([Word] -> (a, [Word])
Instead of having a separate shrinking function, we now simply shrink the list
of samples, and then re-run the parser. This is the Hypothesis
approach in a
nutshell; parsers of course need to ensure that the produced value shrinks as
the samples are shrunk. For example, here is a very simple (proof of concept)
generator for Bool
:
parseBool :: Parser Bool
= Parser $ \(s:ss) -> (
parseBool if s >= maxBound `div` 2 then True else False
, ss )
Assuming that the sample is chosen uniformly in the full Word
range, this
parser will choose uniformly between True
and False
; and as the sample
is shrunk towards zero, the boolean will shrink towards False
.
Streams versus trees
If you look at the definition of Gen
in QuickCheck
,
you will see it’s actually different to the definition we showed above:
newtype Gen a = Gen (PRNG -> a)
Like our definition above, this generator takes a PRNG as input, but it does not return an updated PRNG. This might seem confusing: suppose we are generating two numbers, as in our example above; how do we ensure those two numbers are generated from different PRNGs?
To solve this problem, we will need a PRNG that in addition to next
, also
provides a way to split the PRNG into two new PRNGs:
next :: PRNG -> (Word, PRNG) -- as before
split :: PRNG -> (PRNG, PRNG) -- new
Then to run two generators, we first split the PRNG:
both :: QcGen a -> QcGen b -> QcGen (a, b)
QcGen g1) (QcGen g2) = QcGen $ \prng ->
both (let (l, r) = split prng
in (g1 l, g2 r)
The advantage of this approach is laziness: we can produce the second value of
type b
without generating the value of type a
first. Indeed, if we never
demand the value of a
, we will not generate it at all! This is of critical
importance if we have generators for infinite values; for example, it is what
enables us to Generate Functions.
The falsify
definition of Gen
If we apply the insight from Hypothesis
(that is, parse samples rather than
generate using PRNGs) to this new setting where splitting PRNGs is a
fundamental operation, we arrive at the definition of Gen
in falsify
. First,
unfolding a PRNG does not give us an infinite stream of samples, but rather an
infinite tree of samples:
data STree = STree Word STree STree
unfold :: PRNG -> STree
=
unfold prng let (s, _) = next prng
= split prng
(l, r) in STree s (unfold l) (unfold r)
A generator is then a function that takes a part of a sample tree, parses it, and produces a value and an updated sample tree:
newtype Gen a = Gen (STree -> (a, [STree]))
This does not reintroduce dependencies between generators: each generator will be run against a different subtree, and update only that subtree. For example, here is how we might run two generators:
both :: Gen a -> Gen b -> Gen (a, b)
Gen g1) (Gen g2) = Gen $ \(STree s l r) ->
both (let (a, ls) = g1 l
= g2 r
(b, rs) in ( (a, b)
STree s l' r | l' <- ls]
, [++ [STree s l r' | r' <- rs]
)
Note that we are focussing on the core concepts here, and are glossing over
various details. In particular, the actual definition
in falsify
has an additional constructor Minimal
, which is a finite
representation of the infinite tree that is zero everywhere. This is a key
component in making this work with infinite data structures; see upcoming paper
for an in-depth discussion. Users of the library however generally do not need
to be aware of this (indeed, the sample tree abstraction is not part of the
public API).
Consequences of using sample trees
Arguably all of the key differences between Hypothesis
and falsify
stem from
the difference in representation of samples: a linear stream in Hypothesis
and an infinite tree in falsify
. In this section we will discuss two
consequences of this choice.
Shrinking the sample tree
First, we need to decide how to shrink a sample tree. In Hypothesis
, the sample
stream (known as a “choice sequence”) is subjected all kinds of passes (15 and
counting, according to Test-Case Reduction via Test-Case Generation: Insights
from the Hypothesis Reducer), which shrink the sample stream
according to lexicographical ordering; for example:
..¸ x, .. < .., x', .. -- shrink an element (x' < x)
.., x, y, z, .. < .., x, z, .. -- drop an element from the stream
.., x, y, z, .. < .., y, z, x, .. -- sort part of the stream (y < z < x)
When we are dealing with infinite sample trees, such a total ordering does not exist. For example, consider the following two trees:
= STree .. tree2 = STree ..
tree1 STree 1 ..) (STree 2 ..)
(STree 4 ..) (STree 3 ..) (
Sample 1
in tree1
is less than the corresponding sample 2
in tree2
,
but sample 4
in tree1
is greater than the corresponding sample 3
in
tree2
. Hence, we have neither tree1 < tree2
nor tree2 < tree1
: these two
trees are incomparable. Instead, falsify
works with a partial ordering;
instead of the multitude of shrinking passes of Hypothesis
, falsify
has
precisely one pass1: shrink an individual sample in the tree.
Distributing samples to parsers
When we have a stream of values that we need to use for multiple parsers, we
need to decide which samples go to which parser. In Hypothesis
, this
essentially happens on a first-come-first-served basis: any samples left unused
by the first parser will be used by the next. As discussed, falsify
parsers do
not return “samples left unused”. Instead, the sample tree is split each time we
compose parsers, like we did in both
, shown above. In practice, this happens
primarily when using applicative <*>
or monadic >>=
.
Predictability
These two differences are rather technical in nature; how do they affect users? Suppose we have a generator that produces a list and then a number:
listThenNum :: Gen ([Bool], Int)
= do
listThenNum <- Gen.list ..
xs <- Gen.int ..
n return (xs, n)
If we are using a stream of samples, Hypothesis
style, and then drop a random
sample from that stream, the generator for int
might suddenly be run against
an entirely different sample; it might increase in value! Similarly, if we
run that int
generator against the first sample left over by the list
generator, and if that list generator uses fewer samples as it shrinks, we might
also run int
against an unrelated sample, and its value might again increase.
This is not necessarily a problem; after all, we can then start to decrease
that new int
value again. However, that is only possible if the generated
value with the larger int
is still a counter-example to whatever property is
being tested. If that is not the case, then we might not be able to shrink the
list, and we might end up with a non-minimal counter-example. That can make
debugging more difficult (we haven’t gotten rid of all the “junk”), and it can
be difficult for users to understand why this might not shrink any further;
even if the library offers facilities for showing why shrinking stopped (for
example, showing which shrunk examples were rejected; verbose mode in
falsify
), it can still be quite puzzling why the library is trying to
increase a value during shrinking.
Neither of these problems can arise in falsify
: it never drops samples at all
(instead, only shrinking individual samples), and since monadic bind splits the
sample tree, we are guaranteed that the behaviour of int
is entirely
unaffected by the behaviour of list
. This makes the shrinking behaviour in
falsify
more predictable and easier to understand.2
Monadic bind
We mentioned above that QuickCheck
’s approach to shrinking has its problems,
without going into detail about what those problems are. Instead, we referred to
the blogpost Integrated versus Manual Shrinking; this
blog post discusses not only the problems in QuickCheck
, but also shows one
alternative approach, known as integrated shrinking, used by QuviQ
QuickCheck and made popular in the Haskell
world by the library hedgehog.
The problem with integrated shrinking is that it does not work across monadic bind. The linked blogpost explains this in great detail, but the essence of the problem is not hard to see. Consider the type of monadic bind:
(>>=) :: Gen a -> (a -> Gen b) -> Gen b
We cannot shrink the right hand side of (>>=)
independent from the left hand
side, because the right hand side is not a generator. We only have a generator
once we apply the supplied function to the result of the first generator. This
means that we cannot shrink these two generators independently: if, after
shrinking the right hand side, we go back and then shrink the left hand side, we
get an entirely different generator, and the shrinking we did previously is just
wasted.
In practice what this means is that once we start shrinking the right hand side, we will never go back anymore and shrink the left hand side. In the example from the introduction we first generated a list length, and then the elements of the list:
prop_list :: Property ()
= do
prop_list <- gen $ Gen.integral $ Range.between (0, 10)
n <- gen $ replicateM n $ Gen.int $ Range.between (0, 1)
xs $ P.pairwise P.eq .$ ("xs", xs) assert
With integrated shrinking, once we start shrinking elements from the list, we will never go back anymore and shrink the list length. With internal shrinking, however, we can go back and forth across monadic bind. This is the raison d’être of internal shrinking: it doesn’t matter that we cannot shrink the two generators independently, because we are not shrinking generators! Instead, we just shrink the samples that feed into those generators.
Selective functors
It is important to understand the limitations of internal shrinking: it is
certaintly not a silver bullet. For example, consider this combinator that takes
two generators, flips a coin (generates a boolean, shrinking towards True
),
and then executes one of the two generators:
choose :: Gen a -> Gen a -> Gen a -- Suboptimal definition
= do
choose g g' <- Gen.bool True
b if b then g else g'
This combinator works, but it’s not optimal. Suppose the initial value of b
is
False
, and so we use g'
; and let’s suppose furthermore that we spend some
time shrinking the sample tree using g'
. Consider what happens if b
now
shrinks to True
. When this happens we will now run g
against the sample
tree as it was left after shrinking with g'
. Although we can do that, it
very much depends on the specific details of g
and g'
whether it’s useful
to do it, and we will certainly lose the predictability we discussed
above.
We could try to make the two generators shrink independent from each other by simply running both of them, and using the boolean only to choose which result we want. After all, Haskell is lazy, and so this should be fine:
choose :: Gen a -> Gen a -> Gen a -- Bad definition!
= do
choose g g' <- g
x <- g'
y <- Gen.bool True
b return $ if b then x else y
While is is true that generation using this definition of choose
will work
just fine (and laziness ensures that we will in fact only run whatever generator
is used), this combinator shrinks very poorly. The problem is that if we
generate a value but they don’t use it, the (part of) the sample tree that we
used to produce that value is irrelevant, and so by definition we can always
replace it by the sample tree that is zero everywhere. This means that if we
later want to switch to that generator, we will only be able to do so if the
absolute minimum value that the generator can produce happens to work for
whatever property we’re testing. This is an important lesson to remember:
Do not generate values and then discard them: such values will always shrink to their minimum. (Instead, don’t generate the value at all.)
To solve this problem, we need to make it visible to the library when we need a generator and when we do not, so that we it can avoid shrinking that part of the sample tree while the generator is not in use. Selectively omitting effects is precisely what selective applicative functors give us. A detailed discussion of this topic would take us well outside the scope of this blog post; in the remainder of this section we will discuss the basics only.
Gen
is a selective functor, which means that it is an instance of
Selective, which has a single method called select
:
select :: Gen (Either a b) -> Gen (a -> b) -> Gen b
The intuition is that we run the first generator; if that produces Left a
, we
run the second generator to get a b
; if the first generator produces Right b
, we skip the second generator completely. Like for applicative <*>
and
monadic (>>=)
, the two generators are run against different subtrees of the
sample tree, but the critical difference is that we will not try to shrink the
right subtree for the second generator unless that generator is used.
If that all sounds a bit abstract, perhaps suffices to say that any selective functor supports
ifS :: Selective f => f Bool -> f a -> f a -> f a
which we can use to implement choose
in a way that avoids reusing the sample
tree of the first generator for the second:
choose :: Gen a -> Gen a -> Gen a
= ifS (bool True) choose
Indeed, this is precisely the definition in the falsify
library
itself.
Tutorial
With the background out of the way, let’s now consider how to actually use the
library. Probably the easiest way to get started is to use the tasty
integration. Here is a minimal template to get you started:
module Main (main) where
import Test.Tasty
import Test.Tasty.Falsify
main :: IO ()
= defaultMain $ testGroup "MyTestSuite" [
main "myFirstProperty" prop_myFirstProperty
testProperty
]
prop_myFirstProperty :: Property ()
= return () prop_myFirstProperty
This depends on tasty package, as well as
falsify of course. If you want, you can also use the
Test.Falsify.Interactive module to experiment with
falsify
in ghci
.
Getting started
Suppose we want to test that if we multiply a number by two, the result must be even. Here’s how we could do it:
prop_multiply2_even :: Property ()
= do
prop_multiply2_even <- gen $ Gen.int $ Range.withOrigin (-100, 100) 0
x even (x * 2)) $ testFailed "not even" unless (
Some observations:
Property
is a monad, so the usual combinators (such as unless) for monads are available- gen runs a generator, and adds the output of the generator to the test log. (The test log is only shown when the property fails.)
- Gen.int is an alias for
Gen.integral, which can produce values for any
Integral type. There is no analogue of
QuickCheck
’s Arbitrary class infalsify
: like inhedgehog
and inHypothesis
, every generator must be explicitly specified. For a justification of this choice, see Jacob Stanley’s excellent Lambda Jam 2017 presentation Gens N’ Roses: Appetite for Reduction (Jacob is the author ofhedgehog
). - The specified Range tells the generator two things: in which range to produce a value, and how to shrink that value. In our example, withOrigin takes an “origin” as explicit value (here, 0), and the generator will shrink towards that origin.
- testFailed is the primitive way to make a test fail, but we shall see a better way momentarily.
Predicates
Suppose we mistakingly think we need to multiply a number by three to get a even number:
prop_multiply3_even :: Property ()
= do
prop_multiply3_even <- gen $ Gen.int $ Range.withOrigin (-100, 100) 0
x even (x * 3)) $ testFailed "not even" unless (
If we run this test, we will get a counter-example:
multiply3_even: FAIL
failed after 14 shrinks
not even
Logs for failed test run:
generated 1 at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:217:10 in main:Demo.Blogpost
This counter-example is not awful: it gives us the counter-example (1), and that
counter-example is minimal. We can however do much better; the idiomatic way in
falsify
to test properties of values is to use a
Predicate. A predicate of type
Predicate '[a, b, c, ..]
is essentially a function
-> b -> c -> .. -> Bool a
but in such a way that it can produce a meaningful message if the predicate is not satisfied. Here’s how we might use it for our example:
prop_multiply3_even_pred :: Property ()
= do
prop_multiply3_even_pred <- gen $ Gen.int $ Range.withOrigin (-100, 100) 0
x $ P.even `P.dot` P.fn ("multiply3", (* 3)) .$ ("x", x) assert
Some comments:
- P.even, like
even
from the prelude. is a predicate that checks its argument is even - P.dot, like
(.)
from the prelude, composes a predicate with a function. In addition to the function itself, you also specify the name of the function, so that that name can be used in error messages. - [(.
)][docs:falsify:dollar], like `( )` from the prelude, applies a predicate to a named argument.
The use of predicates is not required, but can be very helpful indeed. For our running example, this will produce this test failure message instead:
multiply3_even_pred: FAIL
failed after 2 successful tests and 13 shrinks
not (even (multiply3 x))
x : 1
multiply3 x: 3
Ranges, Labelling
We saw the use of withOrigin
already, and earlier in this blog post we
used between; a generator such as
$ Range.between (10, 100) Gen.integral
will produce a value between 10 and 100 (inclusive), shrinking towards 10; it is also possibly to flip the two bounds to shrink towards 100 instead.
The other very useful Range
constructor is skewedBy.
A generator such as
$ Range.skewedBy 5 (0, 100) Gen.integral
will produce values between (0, 100), like between
does, but skewed towards
zero; a negative skew value will instead skew towards 100 (but still shrink
towards zero). As an example use case, suppose that for a certain property we
need a list of Int
and an Int
, and sometimes that separate Int
should be a
member of the list, sometimes not:
prop_skew :: Double -> Property ()
= do
prop_skew skew <- gen $ Gen.list rangeListLen $ Gen.integral rangeValues
xs <- gen $ Gen.integral rangeValues
x "elem" [x `elem` xs]
collect where
rangeValues :: Range Word
rangeListLen,= Range.between (0, 10)
rangeListLen = Range.skewedBy skew (0, 100) rangeValues
This example is a property that always passes, but we use
collect to collect some statistics; specifically, in
what percentage of the tests x
is an element of xs
. If we run this with
a skew of 0, we might see something like:
100000 successful tests
Label "elem":
94.6320% False
5.3680% True
In only 5% of cases the element appears in the list. There are various ways in which we could change that distribution of test data, but the simplest way is simply to generate more values towards the lower end of the range; if we run the test with a skew of 5 we get
Label "elem":
41.8710% False
58.1290% True
Generators
Nearly all generators are built using prim as their basic building block, which returns the next sample from the sample tree. Higher-level generators split into two categories: “simple” (non-compound) generators that produce a value given some arguments, and generator combinators that take generators as input and produce new generators. Some important examples in the first category are:
- integral, which we discussed already
- bool, which produces a
Bool
, shrinking towards a choice ofTrue
orFalse
- elem, which picks a random element from a list, and shuffle, which shuffles a list
- etc.
The library also offers a number of generator combinators; here we list the most important ones:
choose we saw when we discussed Selective functors, and chooses (uniformly) between two generators, shrinking towards the first.
list takes a range for the list length and a generator and produces a list of values. Unlike the simple “pick a length and then call
replicateM
” approach from the example from the introduction, this generator can drop elements anywhere in the list (it does this by using the combinator mark to mark elements in the list; as the marks shrink towards “drop”, the element is removed, up to the specifiedRange
).frequency, similar to the like-named function in
QuickCheck
, takes a list of generators and frequencies, and chooses a generator according to the specified frequency. This is another way in which we can tweak the distribution of test data.The implementation of
frequency
ensures that the generators can shrink indepedently from each other. This could be defined just usingSelective
, but for improved performance it makes use of a low-level combinator called perturb; see also bindIntegral, which generalizesSelective
bindS, and has significantly better performance thanbindS
.
Generating functions
One of the most impressive aspects of QuickCheck
is that it can generate, show
and shrink functions. This is due to a functional pearl by Koen Claessen called
Shrinking and showing functions; the presentation is
available on YouTube and is well worth a watch. We
have adapted the QuickCheck
approach (and simplified it slightly) to falsify
;
the generator is called fun. Here is an example
(Fn is a pattern synonym; you will essentially always need it
when generating functions):
prop_fn1 :: Property ()
= do
prop_fn1 Fn (f :: [Int] -> Bool) <- gen $ Gen.fun $ Gen.bool False
$
assert
P.eq`P.on` P.fn ("f", f)
.$ ("x", [1, 2, 3])
.$ ("y", [4, 5, 6])
This property says that for any function f :: [Int] -> Bool
, if we apply that
function to the list [1, 2, 3]
we must get the same result as when we apply
it to the list [4, 5, 6]
. Of course, that is not true, and when we run this
test, falsify
will give us a counter-example:
failed after 53 shrinks
(f x) /= (f y)
x : [1,2,3]
y : [4,5,6]
f x: True
f y: False
Logs for failed test run:
generated {[1,2,3]->True, _->False} at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:244:32 in main:Demo.Blogpost
Notice the counter-example we get: a function that returns True
for the list
[1, 2, 3]
, and False
everywhere else.3 It truly is quite astonishing that this
works: we can see that the list [1, 2, 3]
is special by inspecting the
source code, but of course falsify
(or indeed, QuickCheck
; this is not
unique to falsify
) cannot! Instead, falsify
will generate a random
infinitely large description of functions from [Int] -> Bool
, that covers
all possible input lists, and then start shrinking this description, throwing
away values for inputs it doesn’t need, until a minimal test case remains. Truly
a testament to the power of laziness in Haskell.
For a more realistic example, let’s port an example from Koen Claessen’s
presentation to falsify
. This example is testing the (wrong) property that for
all functions f
and predicates p
,
map f . filter p == filter p . map f
In falsify
, we might express this as:
prop_mapFilter :: Property ()
= do
prop_mapFilter Fn (f :: Int -> Int) <- gen $ Gen.fun genInt
Fn (p :: Int -> Bool) <- gen $ Gen.fun genBool
xs :: [Int] <- gen $ Gen.list (Range.between (0, 100)) genInt
$
assert
P.eq`P.split` (P.fn ("map f", map f), P.fn ("filter p", filter p))
`P.split` (P.fn ("filter p", filter p), P.fn ("map f", map f))
.$ ("xs", xs)
.$ ("xs", xs)
where
genInt :: Gen Int
= Gen.int $ Range.between (0, 100)
genInt
genBool :: Gen Bool
= Gen.bool False genBool
We generate a random function f
, a random predicate p
, a random list
xs
, and then assert the property; and of course, falsify
will happily give
us a counter-example:
failed after 25 shrinks
(map f (filter p xs)) /= (filter p (map f xs))
xs : [96]
xs : [96]
filter p xs : [96]
map f xs : [0]
map f (filter p xs): [0]
filter p (map f xs): []
Logs for failed test run:
generated {_->0} at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:254:30 in main:Demo.Blogpost
generated {96->True, _->False} at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:255:30 in main:Demo.Blogpost
generated [96] at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:256:20 in main:Demo.Blogpost
It generated a function that maps anything to 0
, a predicate that is
True
for 96
and False
for everything else, and a list containing only the
value 96
; this is indeed a nice counter-example, as the output from the
assert
explains.
Side note: in an ideal world that value 96
would be shrunk too. However, this
would require shrinking both the 96
in the generated list and the 96
in
the generated predicate at the same time. Like QuickCheck
, falsify
never
takes more than one shrink step at once, to ensure that shrinking is O(n)
and
avoid exponential explosion. Section Dependencies between commands
of my blog post “An in-depth look at quickcheck-state-machine
” discusses this
kind of problem in the context of quickcheck-state-machine
.
Testing shrinking
When we use internal (or indeed, integrated) shrinking, we don’t write a
separate shrinking function, but that doesn’t mean we cannot get shrinking
wrong. Shrinking never truly comes for free! As a simple example, consider
writing a generator that produces any value below a given maximum
(essentially, a more limited form of integral
). A first attempt might be:
below :: Word64 -> Gen Word64
= (`mod` n) <$> Gen.prim below n
While this generator does in fact produce values in the range 0 <= x < n
, it
does not shrink very well! As the value produced prim
shrinks, the value
produced by below
will cycle. We can discover this by writing a property that
tests the shrinking behaviour of below
, using
testShrinkingOfGen:
prop_below_shrinking :: Property ()
= do
prop_below_shrinking <- gen $ Gen.integral $ Range.between (1, 1_000)
n $ below n testShrinkingOfGen P.ge
This property will fail:
failed after 4 successful tests and 14 shrinks
original < shrunk
original: 0
shrunk : 1
Logs for failed test run:
generated 2 at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:281:10 in main:Demo.Blogpost
In addition to testing individual shrinking steps, we can also test that for particular property and generator, we can generate a particular minimum using testMinimum. Let’s consider the naive list generator from the introduction one more time:
naiveList :: Range Int -> Gen a -> Gen [a]
= do
naiveList r g <- Gen.integral r
n replicateM n g
Suppose we want to verify that if we use this generator to test the (false)
property that “all elements of all lists are always equal” we should always
get either [0, 1]
or [1, 0]
as a counter-example; after all, those are
the two minimal counter-examples. We could test this as follows:
prop_naiveList_minimum :: Property ()
=
prop_naiveList_minimum .$ ("expected", [[0,1], [1,0]])) $ do
testMinimum (P.elem <- gen $ naiveList
xs 0, 10))
(Range.between (0, 1)))
(Gen.int (Range.between (case P.eval $ P.pairwise P.eq .$ ("xs", xs) of
Left _ -> testFailed xs
Right () -> return ()
The counter-example reported by falsify
is (somewhat shortened):
naiveList_minimum: FAIL
failed after 0 shrinks
minimum `notElem` expected
minimum : [0,0,1]
expected: [[0,1],[1,0]]
Logs for failed test run:
generated [0,0,1] at CallStack (from HasCallStack):
gen, called at demo/Demo/Blogpost.hs:294:13 in main:Demo.Blogpost
Logs for rejected potential next shrinks:
** Rejected run 0
generated [] at CallStack (from HasCallStack):
** Rejected run 3
generated [0] at CallStack (from HasCallStack):
** Rejected run 4
generated [0,0] at CallStack (from HasCallStack):
** Rejected run 8
generated [0,0,0] at CallStack (from HasCallStack):
This is telling us that the minimum value it produced was [0, 0, 1]
, instead
of one of the two lists that we expected. It also tells us what shrink steps
were rejected (because they weren’t counter-examples). This is informative,
because _none of those shrink steps is [0, 1]
: the naive list generator,
unlike the real one (which does pass this property) cannot drop elements from
the start of the list.
The falsify
test suite uses
testShrinkingOfGen (and its generalization
testShrinking) as well as
testMinimum extensively to test falsify
’s own
generators.
Compatibility
Finally, falsify
offers two combinators shrinkWith
and fromShrinkTree which provide compatibility
with QuickCheck
style shrinking
shrinkWith :: (a -> [a]) -> Gen a -> Gen a
and hedgehog
style shrinking
fromShrinkTree :: Tree a -> Gen a
respectively. The implementation of these combinators depends on a minor generalization of the sample tree representation; the details are discussed in the paper.
Conclusions
Shrinking is an essential component of any approach to property based testing.
In the Haskell world, two libraries offered two competing approaches to
shrinking: manual shrinking offered by QuickCheck
, where users are entirely
responsible for writing shrinkers for their generators, and integrated
shrinking, offered by hedgehog
. Integrated shrinking is nice, but does not work
well with monadic bind. The Python Hypothesis
library taught us how we can have
“internal” shrinking: like in integrated shrinking, we do not write a separate
generator and shrinker, but unlike in integrated shrinking, this approach does
work across monadic bind.
The Haskell falsify
library takes the core ideas of Hypothesis
and applies
them in the context of Haskell. As we have seen, however, the actual details of
how these two libraries work differ quite significantly. The falsify
approach is
more suitable to Haskell where we might deal with infinite data structures,
provides the user with more predictable shrinking, and provides the user with
tools for controlling generator independence (through the use of selective
functors).
Footnotes
Admittedly, it depends a bit on how you count: we can also replace any tree in a single step by the tree that is zero everywhere, which could be considered a separat “pass”.↩︎
Hypothesis does try to avoid redistributing samples during shrinking as part of it’s “hierarchical delta debugging”, essentially recovering some kind of tree structure, but this is not under the control of the user, and does not provide any guarantees.↩︎
It can also produce
{[4,5,6]->True, _->False}
, depending on the seed.↩︎