Testing and verifying concurrent systems is hard due to their non-deterministic nature — verifying behavior that changes with each execution is difficult. Race conditions thrive in the non-deterministic world of thread scheduling. Even more challenging is verifying timeliness constraints, i.e. ensuring that operations complete within specified deadlines or that service guarantees are maintained under load. Traditional testing approaches struggle with concurrency, and mocking strategies often fail to capture the subtle interactions between threads, time, and shared state that cause real production failures.
The io-sim
Haskell library, developed
by Well-Typed in partnership with engineers from IOG and Quviq, offers a
compelling solution to this problem. The library provides a pure simulation environment
for IO computations, enabling deterministic execution of concurrent code with
accurate time simulation and detailed execution traces. Unlike other testing
approaches, with io-sim
one is able to write highly concurrent, real-time
systems and verify their timeliness constraints in a deterministic manner, by
accurately simulating GHC’s runtime system (e.g. asynchronous exceptions,
timeouts & delays, etc.).
This blog post introduces and explores io-sim
through a practical example:
debugging an elevator controller that violates its response time requirements.
There’s also this great blog post announcing
io-sim
and it goes a bit more into detail about its features!
The Problem
Consider a simple elevator located in a three-floor building (ground, first , second). It takes roughly 1 second for the elevator to go up and down between each floor. The service requirement is: no passenger should wait more than 4 seconds from pressing the call button until the elevator doors open at their floor. It should be possible to test and verify this requirement when writing our elevator controller.
This ensures a reasonable quality of service and prevents frustration. Given the short distance between floors, 4 seconds is sensible. In the worst case, the elevator must travel from ground to second floor and back again.
Here’s a first attempt at modelling the system. Let’s start with the core data structures:
data Direction = Up | Down | None
deriving (Eq, Show)
data Floor = Ground | First | Second
deriving (Eq, Ord, Show, Enum)
data ElevatorState = ElevatorState
currentFloor :: Floor
{ moving :: Direction
, requests :: [Floor]
,deriving (Show) }
The elevator’s state tracks three things: where it currently is, which direction it’s moving (if any), and a queue of floor requests.
The system has two main components that run concurrently:
- An elevator controller that continuously processes the request queue
- Button press handler that adds new floor requests
Let’s look at the controller first:
-- | Initialize an empty elevator state.
--
-- The elevator starts on the ground floor
--
initElevator :: IO (TVar ElevatorState)
= newTVarIO $ ElevatorState Ground None []
initElevator
-- | Elevator controller logic.
--
-- 1. Read the current 'ElevatorState'
-- 2. Check if there are any requested floors
-- 3.
-- 3.1. Block waiting for new requests if there aren't any
-- 3.2. If there any requests, move to the floor at the top of the queue.
--
-- Straightforward FIFO elevator algorithm.
--
elevatorController :: TVar ElevatorState -> IO ()
= forever $ do
elevatorController elevatorVar -- Atomically get the next floor from the queue
<- atomically $ do
(nextFloor, dir) <- readTVar elevatorVar
state case requests state of
-> retry -- Block until a request arrives
[] :rs) -> do
(targetFloor-- Remove the floor from queue and start moving
let direction = getDirection (currentFloor state) targetFloor
$ state
writeTVar elevatorVar = direction, requests = rs }
{ moving return (targetFloor, direction)
putStrLn ("Going " ++ show dir ++ " to " ++ show nextFloor)
moveToFloor elevatorVar nextFloor
The moveToFloor
function simulates the physical movement of the elevator.
moveToFloor :: TVar ElevatorState -> Floor -> IO ()
= do
moveToFloor elevatorVar targetFloor <- readTVarIO elevatorVar
elevatorState /= targetFloor) $ do
when (currentFloor elevatorState -- Takes 1 second to move between floors
1000000 * numberOfFloorsToMove)
threadDelay ($
atomically ->
modifyTVar elevatorVar (\elevatorState' = targetFloor
elevatorState' { currentFloor = Idle
, moving
}
)putStrLn ("Arrived at " ++ show targetFloor)
The buttonPress
function handles both external calls (someone waiting for
the elevator) and internal requests (someone inside selecting a destination):
-- | Whenever a button is pressed this function is called.
--
-- There are two scenarios when a button is pressed:
--
-- 1. When a person is calling the elevator to a floor in order to enter it.
-- 2. When a person is inside the elevator and wants to instruct the elevator
-- to go to a particular floor.
--
buttonPress :: TVar ElevatorState -> Floor -> IO ()
floor = do
buttonPress elevatorVar putStrLn ("Pressing button to " ++ show floor)
$
atomically $ \state -> do
modifyTVar elevatorVar case requests state of
@(nextFloor:_)
rs| let mostRecentRequestedFloor = last rs
/= floor
, nextFloor || mostRecentRequestedFloor /= floor ->
= rs ++ [floor] }
state { requests | otherwise -> state
-> state { requests = [floor] } []
Consider the following example scenario and timeline:
- The elevator starts on the ground floor.
- Person A is on the first floor and presses the button to call the elevator to the first floor.
- While the elevator is going up, Person B arrives on the ground floor calls it to the ground floor.
- Elevator arrives at the first floor.
- Person A enters and presses the button to go to the second floor.
- Elevator goes to the ground floor to pick up Person B.
- Person B enters and presses the button to go to the first floor.
- Elevator goes to the second floor.
- Elevator goes to the first floor.
-- | This example mimicks the scenario above, pressing buttons in the right
-- order.
elevatorExample :: [Floor] -> IO ()
= do
elevatorExample floors <- initElevator
elevator
withAsync (elevatorController elevator)$ \controllerAsync -> do
-- Simulate multiple people pressing buttons simultaneously
forConcurrently_ floors (buttonPress elevator)10 * 1000000)
threadDelay (
cancel controllerAsync
First, Ground, Second, First] elevatorExample [
This function spawns the elevator controller and then simulates multiple button presses happening concurrently. Let’s trace through our example:
Pressing button to First
Going Up to First
Pressing button to Ground
Pressing button to Second
Pressing button to First
Arrived at First
Going Down to Ground
Arrived at Ground
Going Up to Second
Arrived at Second
Going Down to First
Arrived at First
Does such a simple implementation adhere to the specified time constraints? The answer is no, a FIFO elevator algorithm is easy to implement but can be inefficient if the requests are spread out across floors, leading to more travel time.
How would one go about to test/verify this? Testing timeliness constraints in concurrent IO is tricky, due to its non-deterministic nature and limited observability.
io-sim
: Deterministic IO Simulator
io-sim
closes the gap between the code that’s actually run in
production and the code that runs in tests. Combined with property based
testing techniques it is possible to simulate execution of a program for years
worth of simulated time and find reproducible, rare edge-case bugs.
io-sim
achieves this by taking advantage of the
io-classes
set of packages,
which offers a class-based API compatible with most of the core Haskell
packages, including mtl
. In general the APIs follow the base
or async
io-sim
is a time based, discrete event simulator. Which means, it provides a
granular execution trace that can be used from inspecting the commit order of
STM transactions to validating a high level, temporal logic property over some
abstract trace. The best part is that code requires minimal changes to use
io-sim
, just polymorphic type signatures that work with both IO
and
IOSim
monads. Here’s the elevator controller code refactored for testing
with io-sim
:
initElevator :: MonadSTM m => m (TVar m ElevatorState)
= ...
initElevator
elevatorController :: ( MonadSTM m
MonadDelay m
, MonadSay m
,
)=> TVar m ElevatorState -> m ()
=
elevatorController elevatorVar ...
"Going " ++ show dir ++ " to " ++ show nextFloor)
say (...
moveToFloor :: ( MonadSTM m
MonadDelay m
, MonadSay m
,
)=> TVar m ElevatorState -> Floor -> m ()
= do
moveToFloor elevatorVar targetFloor ...
"Arrived at " ++ show targetFloor)
say (
getDirection :: Floor -> Floor -> Direction
= ...
getDirection from to
buttonPress :: ( MonadSTM m
MonadSay m
,
)=> TVar m ElevatorState -> Floor -> m ()
floor = do
buttonPress elevatorVar "Pressing button to " ++ show floor)
say (...
elevatorExample :: ( MonadSTM m
MonadAsync m
, MonadDelay m
, MonadSay m
,
)=> [Floor]
-> m ()
= ... elevatorExample floors
Notice that only type signatures and IO
operations needed changes. The core
business logic remains identical. When instantiated to IO, say
becomes
putStrLn
, but in the IOSim
monad it produces traceable events.
main :: IO ()
= do
main let simpleExample :: [Floor]
= [First, Ground, Second, First]
simpleExample
-- Runs the 'elevatorExample' in IO. This outputs exactly the same output
-- as before
elevatorExample simpleExample
-- Runs the 'elevatorExample' in IOSim.
putStrLn . intercalate "\n"
. map show
. selectTraceEventsSayWithTime
-- ^ Extracts only the 'say' events from the 'SimTrace' and
-- attaches the timestamp for each event
--
-- selectTraceEventsSayWithTime :: SimTrace a -> [(Time, String)]
--
-- This function takes a 'SimTrace' and filters all 'EventSay'
-- traces. It also captures the time of the trace event.
$ runSimTrace (elevatorExample simpleExample)
-- ^ Runs example in 'IOSim'
--
-- runSimTrace :: (forall s. IOSim s a) -> SimTrace a
--
-- This function runs a IOSim program, yielding an execution trace.
Running the program above, the first noticeable thing is that when the program
runs in IO
, it actually takes 10 real seconds to run due to the
threadDelay
calls. However, when the program runs in IOSim
the output is
instantaneous. This is because io-sim
operates on simulated time rather than
wall-clock time, i.e. only the internal clock advances when threads execute
time-dependent operations like threadDelay
or timeout
s. Between these
operations, the simulation executes as if it had infinite CPU speed, i.e. all
computations at a given timestamp complete instantly, yet remain sequentially
ordered and deterministic.
(Time 0s,"Pressing button to First")
(Time 0s,"Going Up to First")
(Time 0s,"Pressing button to Ground")
(Time 0s,"Pressing button to Second")
(Time 0s,"Pressing button to First")
(Time 1s,"Arrived at First")
(Time 1s,"Going Down to Ground")
(Time 2s,"Arrived at Ground")
(Time 2s,"Going Up to Second")
(Time 4s,"Arrived at Second")
(Time 4s,"Going Down to First")
(Time 5s,"Arrived at First")
This particular scenario doesn’t violate the constraint. To find violations,
property-based testing can explore the space of possible request patterns. The
only problem is that our say
traces are strings which is not a very
functional way of tracing things.
contra-tracer: Structured Tracing
While say
provides basic, string-based tracing, real systems need structured
tracing of domain-specific events. String-based logging quickly becomes
inadequate when trying to verify complex properties or analyze system behavior
programmatically. Tracing strongly-typed events that can be filtered,
analyzed, and used in property tests is much better. The
contra-tracer
library
provides a contravariant tracing abstraction that integrates seamlessly with
io-sim
.
The key advantages of structured tracing:
- Type Safety: Events are strongly typed, preventing typos and logging errors.
- Composability: Tracers can be filtered, transformed, and combined.
- Testability: Events can be programmatically analyzed in tests.
All one needs to do is to have a custom trace type:
data ElevatorTrace = ButtonPress Floor
| Going Direction Floor
| ArrivedAt Floor
deriving (Eq, Show, Typeable)
And substitute all calls to say
for traceWith tracer (ButtonPress floor)
,
for example.
With structured tracing in place, extracting and analyzing traces becomes type-safe and straightforward:
-- | Extract typed elevator events with timestamps
extractElevatorEvents :: SimTrace a -> [(Time, ElevatorTrace)]
=
extractElevatorEvents selectTraceEventsDynamicWithTime
Property-Based Testing: Verifying Timing Constraints
The elevator system began with a clear requirement: no passenger should wait more than 4 seconds. The FIFO implementation seemed reasonable, but the elevator can end up travelling between the bottom and top floors whilst someone in the middle waits their turn.
With typed traces from contra-tracer
and deterministic simulation from
io-sim
, QuickCheck
can systematically explore the space of possible request
patterns and verify this property.
To verify our timing constraint, we need to:
- Generate random sequences of floor requests
- Run each sequence through the elevator simulation
- Check that every passenger gets service within 4 seconds
Let’s start with the test data generation:
-- | 'Floor' Arbitrary instance.
--
-- Randomly generate floors. The shrink instance is the most important here
-- since it will be responsible for generating a simpler counterexample.
--
instance Arbitrary Floor where
= elements [Ground, First, Second]
arbitrary Second = [Ground, First]
shrink First = [Ground]
shrink Ground = [] shrink
The shrinking strategy is important because when QuickCheck
finds a failing
case with many floors, it will try simpler combinations to find the minimal
reproduction of the original input.
To verify the property that no passenger waits more than 4 seconds for the elevator to arrive to its floor, one needs to track the button presses and measure how long until the elevator arrives.
The property works by maintaining a map of pending requests. Each
ButtonPress
adds an entry (keeping the earliest if multiple people request
the same floor), and each ArrivedAt
checks if that floor was requested and
whether the wait exceeded 4 seconds:
-- Traverse the event trace and check if there is any gap longer than 4s
-- between requests and the elevator arriving at the request's floor.
--
violatesFourSecondRule :: [(Time, ElevatorTrace)] -> Property
= counterexample (intercalate "\n" $ map show events)
violatesFourSecondRule events $ checkViolations events Map.empty
where
checkViolations :: [(Time, ElevatorTrace)] -> Map Floor DiffTime -> Property
-- Fail if there are pending requests
=
checkViolations [] pending "Elevator never arrived at: " ++ show pending)
counterexample (
(Map.null pending)Time t, event):rest) pending =
checkViolations ((case event of
-- Add request to the pending requests map. Note that if there's
-- already a request for a particular floor, overwriting the
-- timestamp is not the right thing to do because there's an older
-- request that shouldn't be forgotten.
--
ButtonPress floor ->
maybe (Just t) Just) floor pending)
checkViolations rest (Map.alter (
-- The elevator arrived at a floor. Check if it took more than 4
-- seconds to do so. If not continue and remove the request from
-- the pending map.
--
ArrivedAt floor ->
case Map.lookup floor pending of
Nothing ->
checkViolations rest pendingJust requestTime
| let time = t - requestTime
"Passenger waited "
counterexample ( ++ show time
++ " for the elevator to arrive to the "
++ show floor
++ " floor"
False
) | otherwise -> checkViolations rest (Map.delete floor pending)
-> checkViolations rest pending _
Then it is just a matter of running the example for randomly generated inputs,
extract the trace and use QuickCheck
to assert if the property is true or not.
prop_no_passenger_waits_4_seconds :: [Floor] -> Property
=
prop_no_passenger_waits_4_seconds floors -- Run the button press sequence and get the execution trace
--
let trace = extractElevatorEvents
$ runSimTrace
$ elevatorExample (Tracer (emit traceM)) floors
in violatesFourSecondRule trace
where
Running this property, QuickCheck
quickly finds a counterexample:
*** Failed! Falsified (after 8 tests and 2 shrinks):
[Second,Ground,First]
(Time 0s,ButtonPress Second)
(Time 0s,Going Up Second)
(Time 0s,ButtonPress Ground)
(Time 0s,ButtonPress First)
(Time 2s,ArrivedAt Second)
(Time 2s,Going Down Ground)
(Time 4s,ArrivedAt Ground)
(Time 4s,Going Up First)
(Time 5s,ArrivedAt First)
Passenger waited 5s for the elevator to arrive to the First floor
The counterexample is minimal thanks to QuickCheck
’s shrinking. Here, one
can imagine three passengers, pressing a button almost at the same time. Since
the elevator starts on the Ground
floor and it is the Second
floor passenger
that wins the race, the elevator starts going to the Second
floor and queues
the Ground
and the First
floor requests, by this order. It then takes 5
seconds in total for the elevator to arrive at the First
floor, violating
the timeliness requirement.
With property test in place, it is possible to iterate on better algorithms with
confidence. prop_no_passenger_waits_4_seconds
property will be able to assert
if any of the improvements actually meet the timing requirements.
Using io-sim
in the Real World
Real systems don’t explicitly block, they perform actual work that takes time.
To make such code testable with io-sim
, one can introduce a typeclass abstraction
(e.g. MonadElevator m
) with methods like moveElevator
. In production,
this would perform real hardware control; in tests, it would use threadDelay
to
simulate the operation’s duration.
In this elevator example, in a real system, there would be a sensor which would
inform the controller at what time the elevator arrive at a specific floor, at
which point the internal logic about the current floor of the elevator would be
updated. With suitable abstraction, that implementation could replace our simplification
using threadDelay
.
io-sim
can accurately simulate the standard IO
operations, but this additional
abstraction also introduces the challenge of verifying that the model accurately
describes the real-world interactions. For example, 1 second is actually a very
fast elevator, so our model and timeliness requirements may have to be modified
slightly.
That’s a topic left for another blog post!
Related Tools and Libraries
The Haskell ecosystem offers several libraries to test concurrent systems, each one addresses different aspects of the problem. Here are two of the most popular and known ones:
Each takes a slightly different approach to exploring thread schedules, invariants, or state-space, and all have proven useful in practice.
dejafu
explores all possible thread interleavings to find concurrency bugs.
The library offers a similar typeclass abstraction to io-classes
for
concurrency primitives, allowing testing code that uses threads, MVars and
STM.
quickcheck-state-machine
tests stateful programs using state machine models
with pre and post-conditions. The library can find race conditions through
parallel testing. It excels at testing APIs with complex state dependencies,
e.g. databases or file systems, but focuses on state correctness rather than
temporal properties.
io-sim
distinguishes itself by being the only time-based simulator. One
can’t easily ask “what happens when this operation takes 150ms instead of
15ms?” with dejafu
nor quickcheck-state-machine
. io-sim
enables testing
of timeout logic, retry mechanisms, timeliness constraints, etc. The ability
to compress years of simulated execution into seconds of test runtime makes
io-sim
particularly valuable for testing long-running systems where bugs
emerge only after extended operation.
Conclusion
The key insight is that io-sim
simulates the actual behavior of Haskell’s
runtime. STM transactions, thread scheduling, and time passing behave exactly
as in production, but deterministically.
For concurrent Haskell systems with timing requirements, e.g. network
protocols, distributed systems, or real-time controllers, io-sim
allows the
verification of time-sensitive properties. The library offers
much more than shown here, including thread scheduling exploration testing with
partial order reduction.
The complete code examples are available here.