When profiling Haskell programs, time spent in foreign functions (functions defined in C) does not show up on normal time profiles, which can be problematic when debugging or optimizing performance of code that makes heavy use of the foreign function interface (FFI). In this blog post we present a new compiler plugin called trace-foreign-calls, which makes this time visible and available for analysis.

The trace-foreign-calls plugin as well as a simple analysis tool ghc-events-util are both available on GitHub.

Overview

Consider a C function

long slow_add(long a, long b) {
  while(b--) {
    a++;
  }
  return a;
}

with corresponding Haskell import

foreign import capi unsafe "test_cbits.h slow_add"
  c_slowAddIO :: CLong -> CLong -> IO CLong

and an application that invokes

main :: IO ()
main = do
    print =<< slowAddIO a b
    print =<< slowAddIO b a
  where
    a = 1_000_000_000
    b = 2_000_000_000

When we compile the application with the trace-foreign-calls plugin enabled, run it, and then look at the generated .eventlog using ghc-events-util, we will see:

 607.16ms   607.16ms       1922573  cap 1  trace-foreign-calls: call c_slowAddIO (capi unsafe "test_cbits.h slow_add")
   0.26ms     0.00ms     609077635  cap 1  trace-foreign-calls: return c_slowAddIO

 302.02ms   302.02ms     609336093  cap 1  trace-foreign-calls: call c_slowAddIO (capi unsafe "test_cbits.h slow_add")
   0.01ms     0.00ms     911353269  cap 1  trace-foreign-calls: return c_slowAddIO

The important column here is the first column, which for each event reports the time from that event to the next; in this case, from the call to the foreign call to its return. Perhaps ghc-events-util could be given a mode that is designed specifically for trace-foreign-events to make this output a bit more readable, but for now this general purpose output suffices.1

If we additionally compile with profiling enabled, we get an additional event for each foreign call, recording the cost-centre callstack:

   0.00ms     0.00ms       4643217  cap 1  trace-foreign-calls: call c_slowAddIO (capi unsafe "test_cbits.h slow_add")
 856.37ms   856.37ms       4643327  cap 1  heap prof sample 0, residency 1, cost centre stack:
                                           slowAddIO in Example at Example.hs:29:1-78
                                           main in Main at test/Main.hs:(8,1)-(19,21)
                                           runMainIO1 in GHC.Internal.TopHandler at <no location info>
   0.65ms     0.00ms     861018010  cap 1  trace-foreign-calls: return c_slowAddIO

   0.00ms     0.00ms     861672464  cap 1  trace-foreign-calls: call c_slowAddIO (capi unsafe "test_cbits.h slow_add")
 426.79ms   426.79ms     861672834  cap 1  heap prof sample 0, residency 1, cost centre stack:
                                           slowAddIO in Example at Example.hs:29:1-78
                                           main in Main at test/Main.hs:(8,1)-(19,21)
                                           runMainIO1 in GHC.Internal.TopHandler at <no location info>
   0.06ms     0.00ms    1288461103  cap 1  trace-foreign-calls: return c_slowAddIO

Note that we are abusing the “heap profile sample” event to record the cost-centre callstack to the foreign function (see “Conclusions and future work”, below).2

Dependencies

Suppose in example-pkg-A we have

foreign import capi "cbits.h xkcdRandomNumber"
  someFunInA :: IO CInt

and we use this function in example-pkg-B

main :: IO ()
main = do
    randomNumber <- someFunInA
    let bs = compress (BS.Char8.pack $ show randomNumber)
    print $ BS.Word8.unpack bs

where compress is from zlib. Although we are running main from example-pkg-B, in order to get information about someFunInA we need to enable the plugin when compiling example-pkg-A; the README.md describes how to enable the plugin for all dependencies. Indeed, when we do this, we see calls to libz as well:

   0.00ms     0.00ms        414047  cap 0  trace-foreign-calls: call someFunInA (capi safe "cbits.h xkcdRandomNumber")
   0.00ms     0.00ms        414607  cap 0  trace-foreign-calls: return someFunInA

(..)

   0.00ms     0.00ms        493076  cap 0  trace-foreign-calls: call c_zlibVersion (capi unsafe "zlib.h zlibVersion")
   0.00ms     0.00ms        493866  cap 0  trace-foreign-calls: return c_zlibVersion

Indeed, if we are willing to do a custom build of ghc, we can even enable the plugin on the boot libraries, which (amongst other things) makes the final print also visible:

   0.00ms     0.00ms        609576  cap 0  trace-foreign-calls: call unsafe_fdReady (ccall unsafe "fdReady")
   0.00ms     0.00ms        611846  cap 0  trace-foreign-calls: return unsafe_fdReady
   0.01ms     0.00ms        612286  cap 0  trace-foreign-calls: call c_write (capi unsafe "HsBase.h write")
   0.23ms     0.00ms        618236  cap 0  trace-foreign-calls: return c_write

Conclusions and future work

The trace-foreign-calls compiler plugin can be used to generate eventlog events for foreign function invocations, so that the time spent in foreign functions becomes visible; the ghc-events-util tool can be used to inspect these eventlogs.

The plugin works by renaming each foreign function import of foo to foo_uninstrumented, and then introducing a new wrapper function foo which emits some custom events to the eventlog before and after calling foo_uninstrumented. Since we want the plugin to work even on the GHC boot libraries, the wrapper tries to use only functionality from GHC.Prim, which limits what we can do. One consequence is that because the plugin reuses “heap profile sample” events to record the cost centre stacks for foreign functions, it is not currently possible to record both regular heap profile samples (that is, run the code with +RTS -p) and enable the plugin at the same time.

A better solution would be to add support for profiling foreign functions to ghc itself. This would involve creating new eventlog event types, and then upgrading existing time profiling tools to interpret these new events. Until then, however, time profiling of foreign function invocations is now at least possible.


  1. The first three columns are the time from each event to the next visible event (some events might be filtered out), the time from each event to the next actual event, and the time of the event since the start of the program.↩︎

  2. The samplefield will always be 0; the residency field is used to record the capability. The latter allows us to correlate events of concurrent foreign function invocations; the --res-is-cap command line option to ghc-events-util makes it understand this convention.↩︎