Testing and verifying concurrent systems is hard due to their non-deterministic nature — verifying behavior that changes with each execution is difficult. Race conditions thrive in the non-deterministic world of thread scheduling. Even more challenging is verifying timeliness constraints, i.e. ensuring that operations complete within specified deadlines or that service guarantees are maintained under load. Traditional testing approaches struggle with concurrency, and mocking strategies often fail to capture the subtle interactions between threads, time, and shared state that cause real production failures.

The io-sim Haskell library, developed by Well-Typed in partnership with engineers from IOG and Quviq, offers a compelling solution to this problem. The library provides a pure simulation environment for IO computations, enabling deterministic execution of concurrent code with accurate time simulation and detailed execution traces. Unlike other testing approaches, with io-sim one is able to write highly concurrent, real-time systems and verify their timeliness constraints in a deterministic manner, by accurately simulating GHC’s runtime system (e.g. asynchronous exceptions, timeouts & delays, etc.).

This blog post introduces and explores io-sim through a practical example: debugging an elevator controller that violates its response time requirements.

There’s also this great blog post announcing io-sim and it goes a bit more into detail about its features!

The Problem

Consider a simple elevator located in a three-floor building (ground, first , second). It takes roughly 1 second for the elevator to go up and down between each floor. The service requirement is: no passenger should wait more than 4 seconds from pressing the call button until the elevator doors open at their floor. It should be possible to test and verify this requirement when writing our elevator controller.

This ensures a reasonable quality of service and prevents frustration. Given the short distance between floors, 4 seconds is sensible. In the worst case, the elevator must travel from ground to second floor and back again.

Here’s a first attempt at modelling the system. Let’s start with the core data structures:

data Direction = Up | Down | None
  deriving (Eq, Show)

data Floor = Ground | First | Second
  deriving (Eq, Ord, Show, Enum)

data ElevatorState = ElevatorState
  { currentFloor :: Floor
  , moving       :: Direction
  , requests     :: [Floor]
  } deriving (Show)

The elevator’s state tracks three things: where it currently is, which direction it’s moving (if any), and a queue of floor requests.

The system has two main components that run concurrently:

  • An elevator controller that continuously processes the request queue
  • Button press handler that adds new floor requests

Let’s look at the controller first:

-- | Initialize an empty elevator state.
--
-- The elevator starts on the ground floor
--
initElevator :: IO (TVar ElevatorState)
initElevator = newTVarIO $ ElevatorState Ground None []

-- | Elevator controller logic.
--
-- 1. Read the current 'ElevatorState'
-- 2. Check if there are any requested floors
-- 3.
--    3.1. Block waiting for new requests if there aren't any
--    3.2. If there any requests, move to the floor at the top of the queue.
--
-- Straightforward FIFO elevator algorithm.
--
elevatorController :: TVar ElevatorState -> IO ()
elevatorController elevatorVar = forever $ do
  -- Atomically get the next floor from the queue
  (nextFloor, dir) <- atomically $ do
    state <- readTVar elevatorVar
    case requests state of
      []               -> retry  -- Block until a request arrives
      (targetFloor:rs) -> do
        -- Remove the floor from queue and start moving
        let direction = getDirection (currentFloor state) targetFloor
        writeTVar elevatorVar $ state
          { moving = direction, requests = rs }
        return (targetFloor, direction)

  putStrLn ("Going " ++ show dir ++ " to " ++ show nextFloor)
  moveToFloor elevatorVar nextFloor

The moveToFloor function simulates the physical movement of the elevator.

moveToFloor :: TVar ElevatorState -> Floor -> IO ()
moveToFloor elevatorVar targetFloor = do
  elevatorState <- readTVarIO elevatorVar
  when (currentFloor elevatorState /= targetFloor) $ do
    -- Takes 1 second to move between floors
    threadDelay (1000000 * numberOfFloorsToMove)
    atomically $
      modifyTVar elevatorVar (\elevatorState' ->
        elevatorState' { currentFloor = targetFloor
                       , moving       = Idle
                       }
        )
  putStrLn ("Arrived at " ++ show targetFloor)

The buttonPress function handles both external calls (someone waiting for the elevator) and internal requests (someone inside selecting a destination):

-- | Whenever a button is pressed this function is called.
--
-- There are two scenarios when a button is pressed:
--
-- 1. When a person is calling the elevator to a floor in order to enter it.
-- 2. When a person is inside the elevator and wants to instruct the elevator
--    to go to a particular floor.
--
buttonPress :: TVar ElevatorState -> Floor -> IO ()
buttonPress elevatorVar floor = do
  putStrLn ("Pressing button to " ++ show floor)
  atomically $
    modifyTVar elevatorVar $ \state -> do
      case requests state of
        rs@(nextFloor:_)
          | let mostRecentRequestedFloor = last rs
          ,    nextFloor /= floor
            || mostRecentRequestedFloor /= floor ->
            state { requests = rs ++ [floor] }
          | otherwise -> state
        [] -> state { requests = [floor] }

Consider the following example scenario and timeline:

  1. The elevator starts on the ground floor.
  2. Person A is on the first floor and presses the button to call the elevator to the first floor.
  3. While the elevator is going up, Person B arrives on the ground floor calls it to the ground floor.
  4. Elevator arrives at the first floor.
  5. Person A enters and presses the button to go to the second floor.
  6. Elevator goes to the ground floor to pick up Person B.
  7. Person B enters and presses the button to go to the first floor.
  8. Elevator goes to the second floor.
  9. Elevator goes to the first floor.
-- | This example mimicks the scenario above, pressing buttons in the right
-- order.
elevatorExample :: [Floor] -> IO ()
elevatorExample floors = do
  elevator <- initElevator
  withAsync (elevatorController elevator)
    $ \controllerAsync -> do
        -- Simulate multiple people pressing buttons simultaneously
        forConcurrently_ floors (buttonPress elevator)
        threadDelay (10 * 1000000)
        cancel controllerAsync

elevatorExample [First, Ground, Second, First]

This function spawns the elevator controller and then simulates multiple button presses happening concurrently. Let’s trace through our example:

Pressing button to First
Going Up to First
Pressing button to Ground
Pressing button to Second
Pressing button to First
Arrived at First
Going Down to Ground
Arrived at Ground
Going Up to Second
Arrived at Second
Going Down to First
Arrived at First

Does such a simple implementation adhere to the specified time constraints? The answer is no, a FIFO elevator algorithm is easy to implement but can be inefficient if the requests are spread out across floors, leading to more travel time.

How would one go about to test/verify this? Testing timeliness constraints in concurrent IO is tricky, due to its non-deterministic nature and limited observability.

io-sim: Deterministic IO Simulator

io-sim closes the gap between the code that’s actually run in production and the code that runs in tests. Combined with property based testing techniques it is possible to simulate execution of a program for years worth of simulated time and find reproducible, rare edge-case bugs.

io-sim achieves this by taking advantage of the io-classes set of packages, which offers a class-based API compatible with most of the core Haskell packages, including mtl. In general the APIs follow the base or async

io-sim is a time based, discrete event simulator. Which means, it provides a granular execution trace that can be used from inspecting the commit order of STM transactions to validating a high level, temporal logic property over some abstract trace. The best part is that code requires minimal changes to use io-sim, just polymorphic type signatures that work with both IO and IOSim monads. Here’s the elevator controller code refactored for testing with io-sim:

initElevator :: MonadSTM m => m (TVar m ElevatorState)
initElevator = ...

elevatorController
  :: ( MonadSTM m
     , MonadDelay m
     , MonadSay m
     )
  => TVar m ElevatorState -> m ()
elevatorController elevatorVar =
  ...
  say ("Going " ++ show dir ++ " to " ++ show nextFloor)
  ...

moveToFloor
  :: ( MonadSTM m
     , MonadDelay m
     , MonadSay m
     )
  => TVar m ElevatorState -> Floor -> m ()
moveToFloor elevatorVar targetFloor = do
  ...
  say ("Arrived at " ++ show targetFloor)

getDirection :: Floor -> Floor -> Direction
getDirection from to = ...

buttonPress
  :: ( MonadSTM m
     , MonadSay m
     )
  => TVar m ElevatorState -> Floor -> m ()
buttonPress elevatorVar floor = do
  say ("Pressing button to " ++ show floor)
  ...

elevatorExample
  :: ( MonadSTM m
     , MonadAsync m
     , MonadDelay m
     , MonadSay m
     )
  => [Floor]
  -> m ()
elevatorExample floors = ...

Notice that only type signatures and IO operations needed changes. The core business logic remains identical. When instantiated to IO, say becomes putStrLn, but in the IOSim monad it produces traceable events.

main :: IO ()
main = do
  let simpleExample :: [Floor]
      simpleExample = [First, Ground, Second, First]

  -- Runs the 'elevatorExample' in IO. This outputs exactly the same output
  -- as before
  elevatorExample simpleExample

  -- Runs the 'elevatorExample' in IOSim.
  putStrLn . intercalate "\n"
           . map show
           . selectTraceEventsSayWithTime
           -- ^ Extracts only the 'say' events from the 'SimTrace' and
           -- attaches the timestamp for each event
           --
           -- selectTraceEventsSayWithTime :: SimTrace a -> [(Time, String)]
           --
           -- This function takes a 'SimTrace' and filters all 'EventSay'
           -- traces. It also captures the time of the trace event.
           $ runSimTrace (elevatorExample simpleExample)
           -- ^ Runs example in 'IOSim'
           --
           -- runSimTrace :: (forall s. IOSim s a) -> SimTrace a
           --
           -- This function runs a IOSim program, yielding an execution trace.

Running the program above, the first noticeable thing is that when the program runs in IO, it actually takes 10 real seconds to run due to the threadDelay calls. However, when the program runs in IOSim the output is instantaneous. This is because io-sim operates on simulated time rather than wall-clock time, i.e. only the internal clock advances when threads execute time-dependent operations like threadDelay or timeouts. Between these operations, the simulation executes as if it had infinite CPU speed, i.e. all computations at a given timestamp complete instantly, yet remain sequentially ordered and deterministic.

(Time 0s,"Pressing button to First")
(Time 0s,"Going Up to First")
(Time 0s,"Pressing button to Ground")
(Time 0s,"Pressing button to Second")
(Time 0s,"Pressing button to First")
(Time 1s,"Arrived at First")
(Time 1s,"Going Down to Ground")
(Time 2s,"Arrived at Ground")
(Time 2s,"Going Up to Second")
(Time 4s,"Arrived at Second")
(Time 4s,"Going Down to First")
(Time 5s,"Arrived at First")

This particular scenario doesn’t violate the constraint. To find violations, property-based testing can explore the space of possible request patterns. The only problem is that our say traces are strings which is not a very functional way of tracing things.

contra-tracer: Structured Tracing

While say provides basic, string-based tracing, real systems need structured tracing of domain-specific events. String-based logging quickly becomes inadequate when trying to verify complex properties or analyze system behavior programmatically. Tracing strongly-typed events that can be filtered, analyzed, and used in property tests is much better. The contra-tracer library provides a contravariant tracing abstraction that integrates seamlessly with io-sim.

The key advantages of structured tracing:

  • Type Safety: Events are strongly typed, preventing typos and logging errors.
  • Composability: Tracers can be filtered, transformed, and combined.
  • Testability: Events can be programmatically analyzed in tests.

All one needs to do is to have a custom trace type:

data ElevatorTrace = ButtonPress Floor
                   | Going Direction Floor
                   | ArrivedAt Floor
                   deriving (Eq, Show, Typeable)

And substitute all calls to say for traceWith tracer (ButtonPress floor), for example.

With structured tracing in place, extracting and analyzing traces becomes type-safe and straightforward:

-- | Extract typed elevator events with timestamps

extractElevatorEvents :: SimTrace a -> [(Time, ElevatorTrace)]
extractElevatorEvents =
  selectTraceEventsDynamicWithTime

Property-Based Testing: Verifying Timing Constraints

The elevator system began with a clear requirement: no passenger should wait more than 4 seconds. The FIFO implementation seemed reasonable, but the elevator can end up travelling between the bottom and top floors whilst someone in the middle waits their turn.

With typed traces from contra-tracer and deterministic simulation from io-sim, QuickCheck can systematically explore the space of possible request patterns and verify this property.

To verify our timing constraint, we need to:

  1. Generate random sequences of floor requests
  2. Run each sequence through the elevator simulation
  3. Check that every passenger gets service within 4 seconds

Let’s start with the test data generation:

-- | 'Floor' Arbitrary instance.
--
-- Randomly generate floors. The shrink instance is the most important here
-- since it will be responsible for generating a simpler counterexample.
--
instance Arbitrary Floor where
  arbitrary = elements [Ground, First, Second]
  shrink Second = [Ground, First]
  shrink First  = [Ground]
  shrink Ground   = []

The shrinking strategy is important because when QuickCheck finds a failing case with many floors, it will try simpler combinations to find the minimal reproduction of the original input.

To verify the property that no passenger waits more than 4 seconds for the elevator to arrive to its floor, one needs to track the button presses and measure how long until the elevator arrives.

The property works by maintaining a map of pending requests. Each ButtonPress adds an entry (keeping the earliest if multiple people request the same floor), and each ArrivedAt checks if that floor was requested and whether the wait exceeded 4 seconds:

-- Traverse the event trace and check if there is any gap longer than 4s
-- between requests and the elevator arriving at the request's floor.
--
violatesFourSecondRule :: [(Time, ElevatorTrace)] -> Property
violatesFourSecondRule events = counterexample (intercalate "\n" $ map show events)
                              $ checkViolations events Map.empty
  where
    checkViolations :: [(Time, ElevatorTrace)] -> Map Floor DiffTime -> Property
    -- Fail if there are pending requests
    checkViolations [] pending =
      counterexample ("Elevator never arrived at: " ++ show pending)
                     (Map.null pending)
    checkViolations ((Time t, event):rest) pending =
      case event of
        -- Add request to the pending requests map. Note that if there's
        -- already a request for a particular floor, overwriting the
        -- timestamp is not the right thing to do because there's an older
        -- request that shouldn't be forgotten.
        --
        ButtonPress floor ->
          checkViolations rest (Map.alter (maybe (Just t) Just) floor pending)

        -- The elevator arrived at a floor. Check if it took more than 4
        -- seconds to do so. If not continue and remove the request from
        -- the pending map.
        --
        ArrivedAt floor ->
          case Map.lookup floor pending of
            Nothing ->
              checkViolations rest pending
            Just requestTime
              | let time = t - requestTime
                counterexample (  "Passenger waited "
                               ++ show time
                               ++ " for the elevator to arrive to the "
                               ++ show floor
                               ++ " floor"
                               ) False
              | otherwise -> checkViolations rest (Map.delete floor pending)

        _ -> checkViolations rest pending

Then it is just a matter of running the example for randomly generated inputs, extract the trace and use QuickCheck to assert if the property is true or not.

prop_no_passenger_waits_4_seconds :: [Floor] -> Property
prop_no_passenger_waits_4_seconds floors =
  -- Run the button press sequence and get the execution trace
  --
  let trace = extractElevatorEvents
            $ runSimTrace
            $ elevatorExample (Tracer (emit traceM)) floors

   in violatesFourSecondRule trace

  where

Running this property, QuickCheck quickly finds a counterexample:

*** Failed! Falsified (after 8 tests and 2 shrinks):
[Second,Ground,First]
(Time 0s,ButtonPress Second)
(Time 0s,Going Up Second)
(Time 0s,ButtonPress Ground)
(Time 0s,ButtonPress First)
(Time 2s,ArrivedAt Second)
(Time 2s,Going Down Ground)
(Time 4s,ArrivedAt Ground)
(Time 4s,Going Up First)
(Time 5s,ArrivedAt First)
Passenger waited 5s for the elevator to arrive to the First floor

The counterexample is minimal thanks to QuickCheck’s shrinking. Here, one can imagine three passengers, pressing a button almost at the same time. Since the elevator starts on the Ground floor and it is the Second floor passenger that wins the race, the elevator starts going to the Second floor and queues the Ground and the First floor requests, by this order. It then takes 5 seconds in total for the elevator to arrive at the First floor, violating the timeliness requirement.

With property test in place, it is possible to iterate on better algorithms with confidence. prop_no_passenger_waits_4_seconds property will be able to assert if any of the improvements actually meet the timing requirements.

Using io-sim in the Real World

Real systems don’t explicitly block, they perform actual work that takes time. To make such code testable with io-sim, one can introduce a typeclass abstraction (e.g. MonadElevator m) with methods like moveElevator. In production, this would perform real hardware control; in tests, it would use threadDelay to simulate the operation’s duration.

In this elevator example, in a real system, there would be a sensor which would inform the controller at what time the elevator arrive at a specific floor, at which point the internal logic about the current floor of the elevator would be updated. With suitable abstraction, that implementation could replace our simplification using threadDelay.

io-sim can accurately simulate the standard IO operations, but this additional abstraction also introduces the challenge of verifying that the model accurately describes the real-world interactions. For example, 1 second is actually a very fast elevator, so our model and timeliness requirements may have to be modified slightly. That’s a topic left for another blog post!

The Haskell ecosystem offers several libraries to test concurrent systems, each one addresses different aspects of the problem. Here are two of the most popular and known ones:

Each takes a slightly different approach to exploring thread schedules, invariants, or state-space, and all have proven useful in practice.

dejafu explores all possible thread interleavings to find concurrency bugs. The library offers a similar typeclass abstraction to io-classes for concurrency primitives, allowing testing code that uses threads, MVars and STM.

quickcheck-state-machine tests stateful programs using state machine models with pre and post-conditions. The library can find race conditions through parallel testing. It excels at testing APIs with complex state dependencies, e.g. databases or file systems, but focuses on state correctness rather than temporal properties.

io-sim distinguishes itself by being the only time-based simulator. One can’t easily ask “what happens when this operation takes 150ms instead of 15ms?” with dejafu nor quickcheck-state-machine. io-sim enables testing of timeout logic, retry mechanisms, timeliness constraints, etc. The ability to compress years of simulated execution into seconds of test runtime makes io-sim particularly valuable for testing long-running systems where bugs emerge only after extended operation.

Conclusion

The key insight is that io-sim simulates the actual behavior of Haskell’s runtime. STM transactions, thread scheduling, and time passing behave exactly as in production, but deterministically.

For concurrent Haskell systems with timing requirements, e.g. network protocols, distributed systems, or real-time controllers, io-sim allows the verification of time-sensitive properties. The library offers much more than shown here, including thread scheduling exploration testing with partial order reduction.

The complete code examples are available here.