This week I gave a talk at the Tech Mesh conference:

Multi-core with less Pain: Deterministic Parallel Programming with Haskell.

Abstract:

You want to make your code run faster, so you think about changing it to run in parallel to use multiple cores. But most approaches to parallelism also force you to rewrite your program using explicit concurrency which means you'll always worry about race conditions, deadlocks and other concurrency bugs. Deterministic parallelism frees you from concurrency bugs and gives a strong guarantee that your program remains deterministic, so it will give the same result no matter the number of cores or the scheduling being used.

Haskell offers a variety of libraries for deterministic parallelism that enable concise high-level parallel programs. In this talk we will look at the idea of parallelism without concurrency and give an overview of the paradigms that Haskell offers in this area.

The main point I was making is that we can achieve parallel speedups, without ourselves having to write explicitly concurrent programs. For a mainstream audience this is often a rather surprising and novel idea.

The talk was based on an article that Andres and I wrote for the Computing in Science & Engineering Magazine (CiSE). As you can guess from the name, CiSE is aimed at an audience of scientists and engineers. CiSE ran a special issue on parallelism and concurrency in modern programming languages. They got articles on approaches in different languages (Clojure, Erlang, and Haskell). The guest editors gave the authors the problem of parallelising a simple numerical solver. You can read the guest editors' introduction to the special issue.

For those of you not subscribed to CiSE magazine, we are making our article available

Funnily enough, I'm still waiting for my copy of CiSE to arrive, so I've not yet read the articles on Clojure and Erlang. I'll be interested to see how the other authors solved the same problem. The tricky aspect of the problem is that it has very fine grained parallelism.

When we looked at the problem, we decided it was a good fit for data parallelism, and we wrote a solution using Repa. Repa handles the granularity issue for us automatically by chunking up the array across cores. If we'd used a more manual technique we would have had to do quite a bit of chunking ourselves to get the granularity right, and that would have added more code that could have obscured the algorithm. As it was, the core of our optimised Repa version was only 5 lines long and pretty readable. (The code snippets are in the article and the full code is available on the CiSE website.)

We got performance results that are fairly competitive with C (using gcc and OpenMP, see the article for the details). The competitive side of me is also interested to find out how we compare on performance with the other solutions. After all, parallelism is about performance.