Fix-ing regular expressions

Oleg Grenrus – Thursday, 04 June 2020

TL;DR: We add variables, let bindings, and explicit recursion via fixed points to classic regular expressions. It turns out that the resulting explicitly recursive, finitely described languages are well suited for analysis and introspection.

It’s been almost a year since I touched the kleene library, and almost two years since I published it – a good time to write a little about regular expressions.

I like regular expressions very much. They are truly declarative way to write down a grammar… as long as the grammar is expressible in regular expressions.

Matthew Might, David Darais and Daniel Spiewak have written a paper Functional Pearl: Parsing with Derivatives published in ICFP ’11 Proceedings [Might2011] in which regular expressions are extended to handle context-free languages. However, they rely on memoization, and – as structures are infinite – also on reference equality. In short, their approach is not implementable in idiomatic Haskell. ¹

There’s another technique that works for a subset of context-free languages. In my opinion, it is very elegant, and it is at least not painfully slow. The result is available on Hackage: the rere library. The idea is to treat regular expressions as a proper programming language, and add a constructions which proper languages should have: variables and recursion.

This blog post will describe the approach taken by rere in more detail.

Regular expression recap

The abstract syntax of a regular expression (over the alphabet of unicode characters) is given by the following “constructors”:

Null regexp:
Empty string:
Characters: , etc
Concatenation:
Alternation:
Kleene star:

The above can be translated directly into Haskell:

data RE
    = Empty
    | Eps
    | Ch Char
    | App RE RE
    | Alt RE RE
    | Star RE

In the rere implementation, instead of bare Char we use a set of characters, CharSet, as recommended by Owens et al. in Regular-expression derivatives reexamined [Owens2009]. This makes the implementation more efficient, as a common case of character sets is explicitly taken into account. We write them in curly braces: .

We can give declarative semantics to these constructors. These will look like typing rules. A judgment denotes that the regular expression successfully recognises the string .

For example, the rule for application now looks like:

This rule states that if recognises , and recognises , then the concatenation expression recognises the concatenated string .

For alternation we have two rules, one for each of the alternatives:

The rules resemble the structure of non-commutative intuitionistic linear logic, if you are into such stuff. Not only do you have to use everything exactly once; you have to use everything in order, there aren’t any substructural rules, no weakening, no contraction and even no exchange. I will omit the rest of the rules, look them up (and think how rules for Kleene star would look like ‘why not’ exponential ?).

It’s a good idea to define smart versions of the constructors, which simplify regular expressions as they are created. For example, in the following Semigroup instance for concatenation, <> is a smart version of App:

instance Semigroup RE where
    -- Empty annihilates
    Empty  <> _     = Empty
    _      <> Empty = Empty
    -- Eps is unit of <>
    Eps    <> r     = r
    r      <> Ep s  = r
    -- otherwise use App
    r      <> s     = App r s

The smart version of Alt is called \/, and the smart version of Star is called star.

We can check that the simplifications performed by the smart constructors are sound, by using the semantic rules. For example, the simplification Eps <> r = r is justified by the following equivalence of derivation trees:

If string is matched by , then “the match” can be constructor only in one way, by applying the rule. Therefore is also matched by bare . If we introduced proof terms, we’d have a concrete evidence of the match as terms in this language.

There is, however, a problem: matching using declarative rules is not practical. At several points in these rules, we have to guess. We have to guess whether we should pick left or right branch, or where we should split string to match concatenated regular expression. For a practical implementation, we need a syntax-directed approach. Interestingly, we then need just two rules:

In the above rules, we use two operations: The decision procedure “nullable” that tells whether a regular expression can recognise the empty string, and a mapping that, given a single character and a regular expression computes a new regular expression called the derivative of with respect to .

Both operations are quite easy to map to Haskell. The function nullable is defined as a straight-forward recursive function:

nullable :: RE -> Bool
nullable Empty      = False
nullable Eps       = True
nullable (Ch _)    = False
nullable (App r s) = nullable r && nullable s
nullable (Alt r s) = nullable r || nullable s
nullable (Star _)  = True

The Brzozowski derivative is best understood by considering the formal language regular expressions represent:

In Haskell terms: derivative c r matches string str if and only if r matches c : str. From this equivalence, we can more or less directly infer an implementation:

derivative :: Char -> RE -> RE
derivative _ Empty       = Empty
derivative _ Eps         = Empty
derivative _ (Ch x)
    | c == x             = Eps
    | otherwise          = Empty
derivative c (App r s)
    | nullable r         = derivative c s \/ derivative c r <> s
    | otherwise          =                   derivative c r <> s
derivative c (Alt r s)   = derivative c r \/ derivative c s
derivative c r0@(Star r) = derivative c r <> r0

We could try to show that the declarative and syntax directed systems are equivalent, but I omit it here, because it’s been done often enough in the literature (though probably not in exactly this way and notation).

We can now watch how a regular expression “evolves” while matching a string. For example, if we take the regular expression , which in code looks like

ex1 :: RE
ex1 = star (Ch 'a' <> Ch 'b')

then the following is how match ex1 "abab" proceeds:

We can see that there’s implicitly a small finite state automaton, with two states: an initial state and secondary state . This is the approach taken by the kleene package to transform regular expressions into finite state machines. There is an additional character set optimization from Regular-expression derivatives re-examined [Owens2009] by Owens, Reppy and Turon, but in essence, the approach works as follows: Try all possible derivatives, and in the process collect all the states and construct a transition function. ²

The string is accepted as the matching process stops at the state, which is nullable.

Variables and let-expressions

The first new construct we now add to regular-expressions are let expressions. They alone do not add any matching power, but they are prerequisite for allowing recursive expressions.

We already used meta-variables in the rules in the previous section. Let expressions allow us to internalise this notion. The declarative rule for let expressions is:

Here, the notation denotes substituting the variable by the regular expression in the regular expression .

To have let expressions in our implementation, we need to represent variables and we must be able to perform substitution. My tool of choice for handling variables and substitution in general is the bound library. But for the sake of keeping the blog post self-contained, we’ll define the needed bits inline. We’ll reproduce a simple variant of bound, which amounts to using de Bruijn indices and polymorphic recursion.

We define our own datatype to represent variables (which is isomorphic to Maybe):

data Var a
    = B     -- ^ bound
    | F a   -- ^ free
  deriving (Eq, Show, Functor, Foldable, Traversable)

With this, we can extend regular expressions with let. First we make it a functor, i.e., change RE to RE a, and then also add two new constructors: Var and Let:

data RE a
    = Empty
    | Eps
    | Ch Char
    | App (RE a) (RE a)
    | Alt (RE a) (RE a)
    | Star (RE a)

    | Var a
    | Let (RE a) (RE (Var a))

Note that we keep the argument a unchanged in all recursive occurrences of RE, with the exception of the body of the Let, where use use Var a instead, indicating that we can use B in that body to refer to the variable bound by the Let.

In the actual rere library, the Let (and later Fix) constructors additionally have an irrelevant Name field, which allows us to retain the variable names and use them for pretty-printing. I omit them from the presentation in this blog post.

Now, we can write a regular expression with repetitions like instead of ; or in Haskell:

ex2 :: RE Void
ex2 = Let (star (Ch 'a')) (Var B <> Var B)

The use of Void as parameter tells us that expression is closed, i.e., doesn’t contain any free variables.

We still need to extend nullable and derivative to work with the new constructors. For nullable, we’ll simply pass a function telling whether variables in context are nullable. The existing constructors just pass a context around:

nullable :: RE Void -> Bool
nullable = nullable' absurd

nullable' :: (a -> Bool) -> RE a -> Bool
nullable' _ Empty     = False
nullable' _ Eps       = True
nullable' _ (Ch _)    = False
nullable' f (App r s) = nullable' f r && nullable' f s
nullable' f (Alt r s) = nullable' f r || nullable' f s
nullable' _ (Star _)  = True

The cases for Var and Let use and extend the context, respectively:

-- Var: look in the context
nullable' f (Var a)   = f a

-- Let: - compute `nullable r`
--      - extend the context
--      - continue with `s`
nullable' f (Let r s) = nullable' (unvar (nullable' f r) f) s

The unvar function corresponds to maybe, but transported to our Var type:

unvar :: r -> (a -> r) -> Var a -> r
unvar b _ B     = b
unvar _ f (F x) = f x

How to extend derivative to cover the new cases requires a bit more thinking. The idea is similar: we want to add to the context whatever we need to know about the variables. The key insight is to replace every Let binding by two Let bindings, one copying the original, and one binding to the derivative of the let-bound variable. Because the number of let bindings changes, we have to carefully re-index variables as we go.

Therefore, the context for derivative consists of three pieces of information per variable:

whether the variable is nullable (we need it for derivative of App),
the variable denoting the derivative of the original variable,
the re-indexed variable denoting the original value.

The top-level function derivative :: Char -> RE Void -> RE Void now makes use of a local helper function

derivative' :: (Eq a, Eq b) => (a -> (Bool, b, b)) -> RE a -> RE b

which takes this context. Note that, as discussed above, derivative' changes the indices of the variables. However, at the top-level, both a and b are Void, and the environment can be trivially instantiated to the function with empty domain.

The derivative' case for Var is simple: we just look up the derivative of the Var in the context.

derivative' f (Var a) = Var (sndOf3 (f a))

The case for Let is quite interesting:

derivative' f (Let r s)
    = let_ (fmap (trdOf3 . f) r)       -- rename variables in r
    $ let_ (fmap F (derivative' f r))  -- binding for derivative of r
    $ derivative' (\case
        B   -> (nullable' (fstOf3 . f) r, B, F B)
        F x -> bimap (F . F) (F . F) (f x))
    $ s
...

As a formula it looks like:

For our running example or Let (star (Ch 'a')) (Var B <> Var B), we call derivative' recursively with an argument of type RE (Var a), corresponding to the one variable , and we get back a RE (Var (Var b)), corresponding to the two variables and .

The careful reader will also have noticed the smart constructor let_, which does a number of standard rewritings on the fly (which I explain in a Do you have a problem? Write a compiler! talk). These are justified by the properties of substitution:

-- let-from-let
let x = (let y = a in b) in c
-- ==>
let y = a; x = b in c

-- inlining of cheap bindings
let x = a in b
-- ==>
b [ x -> a ] -- when a is cheap, i.e. Empty, Eps, Ch or Var

-- used once, special case
let x = a in x
-- ==>
a

-- unused binding
let x = a in b
-- ==>
b -- when x is unused in b

And importantly, we employ a quick form common-subexpression-elimination (CSE):

let x = a in f x a
-- ==>
let x = a in f x x

This form of CSE is easy and fast to implement, as we don’t introduce new lets, only consider what we already bound and try to increase sharing.

It’s time for examples: Recall again ex2 which was defined as or

ex2 :: RE Void
ex2 = Let "r" (star (Ch 'a')) (Var B <> Var B)

Let’s try to observe the match of the string step by step:

As our smart constructors are quite smart, the automaton stays in its single state, the union comes from the derivative of App, as r is nullable, we get derivative 'a' r \/ derivative 'a' r <> r. And as derivative 'a' r = r, we don’t see any additional let bindings.

Recursion

Now we are ready for the main topic of the post: recursion. We add one more constructor to our datatype of regular expressions:

data RE a
    ...
    | Fix (RE (Var a))

The Fix construct looks similar to Let, except that the bound variable is semantically equivalent to the whole expression. We can unroll each expression by substituting it into itself:

The Fix constructor subsumes the Kleene star, as can now be expressed as , which feels like a very natural definition indeed. For example ex1 previously defined using Kleene star as could also be re-defined as . That looks like

ex3 :: RE Void
ex3 = Fix "x" (Eps \/ Ch 'a' <> Ch 'b' <> Var B)

in code.

The problem is now the same as with Let: How to define nullable and derivative? Fortunately, we have most of the required machinery already in place from the addition of Var and Let.

Nullability of Fix relies on Kleene’s theorem to compute the least fixed point of a monotonic recursive definition, like in Parsing with Derivatives. The idea is to unroll Fix once, and to pretend that the nullability of the recursive occurrence of the bound variable in Fix is False:

nullable' :: (a -> Bool) -> RE a -> Bool
...
nullable' f (Fix _ r)   = nullable' (unvar False f) r

In other words, we literally assume that the nullability of new binding is False, and see what comes out. We don’t need to iterate more then once, as False will flip to True right away, or will never do so even with further unrollings.

Following a similar idea, our smart constructor fix_ is capable of recognising a Empty fixed point by substituting Empty for the recursive occurrence in the unrolling:

fix_ :: RE (Var a) -> RE a
fix_ r | (r >>>= unvar Empty Var) == Empty = Empty
...

This works because Empty is a bottom of the language-inclusion lattice (just as False is a bottom of the Bool lattice).

The extension of derivative is again a bit more involved, but it resembles what we did for Let: As the body of a contains self references , the derivative of a will also be a . Thus, when we need to compute the derivative of , we’ll use . It is important that not all occurrences of in the body of a will turn into references to its derivative (e.g., if they appear to the right of an App, or in a Star), so we need to save the value of in a let binding – how fortunate that we just introduced those … Schematically, the transformation looks as follows:

In the rest, we will use a shorthand notation for a let binding to a , as in . We will write such a binding more succinctly as with the subscript indicating that the binding is recursive. We prefer this notation over introducing , because in a cascade of expressions, we can have individual bindings being recursive, but we still cannot forward-reference to later bindings.

Applying the abbreviation to our derivation rule above yields

Let’s compare this to the let case, rearranged slightly, to establish the similarity:

Consequently, the implementation in Haskell also looks similar to the Let case:

derivative' f r0@(Fix r)
    = let_ (fmap (trdOf3 . f) r0)
    $ fix_
    $ derivative' (\case
        B   -> (nullable' (fstOf3 . f) r0, B, F B)
        F x -> bimap (F . F) (F . F) (f x))
    $ r

Let’s see how it works in practice. We observe the step-by-step matching of ex3 on abab, which was ex1 defined using a fixed point rather than the Kleene star:

We couldn’t wish for a better outcome. We see the same two-state ping-pong behavior as we got using the Kleene star.

More examples

The / Fix is a much more powerful construction than the Kleene star. Let’s look at some examples …

aⁿbⁿ

Probably the simplest non-regular language is some amount of s followed by the same amount of s:

We can describe that language using our library, thanks to the presence of fixed points: (note the variable in between the literal symbols). Transcribed to Haskell code, this is:

ex4 :: RE Void
ex4 = Fix (Eps \/ Ch 'a' <> Var B <> Ch 'b')

And we can test the expression on a string in the language, for example "aaaabbbb":

Now things become more interesting. We can see how in the trace of this not-so-regular expression, we obtain let bindings resembling the stack of a pushdown automaton.

From the trace one can relatively easily see that if we “forget” one b at the end of the input string, then the “state” b isn’t nullable, so the string won’t be recognized.

Left recursion

Previously in this post, we have rewritten as . But another option is to use recursion on the left, i.e., to write instead:

ex5 :: RE Void
ex5 = Fix "x" (Eps \/ Var B <> Ch 'a' <> Ch 'b')

This automaton works as well. In fact, in some sense it works better than the right-recursive one: we can see (as an artifact of variable naming), that we get the derivatives as output of each step. We do save the original expression in a , but as it is unused in the result, our smart constructors will drop it:

Arithmetic expressions

Another go-to example of context free grammars is arithmetic expressions:

The Haskell version is slightly more inconvenient to write due to the use of de Bruijn indices, but otherwise straight-forward:

ex6 :: RE Void
ex6 = let_ (Ch "0123456789")
    $ let_ (Var B <> star_ (Var B))
    $ fix_
    $ ch_ '(' <> Var B <> ch_ ')'
    \/ Var (F B)
    \/ Var B <> ch_ '+' <> Var B
    \/ Var B <> ch_ '*' <> Var B

Here is an (abbreviated) trace of matching the input string :

One can see that the final state is nullable and the word is therefore accepted: one options in the final is , and that is nullable in itself, because it is a union containing .

The other two options are “continuations” starting with or , as the arithmetic expression could indeed continue with these two characters.

Conversion from context-free grammars

Can all context-free languages be expressed in this framework? Is there some algorithm to rewrite a usual context-free grammar into the formalism presented here? The answer to both these questions is yes.

For example, the following non-ambiguous grammar for arithmetic expressions

can be converted into the following “recursive regular expression”:

And it works – it’s fascinating to see how the “state expression” evolves during a match:

In general, the conversion from a context-free grammar to a recursive regular expression makes use of the following theorem by Bekić:

Bisection lemma [Bekic1984]:
For monotone and

where

(There is a constructive proof for this lemma.)

We can use the bisection lemma to eliminate simultaneous recursion in a CFG, reducing it to a recursive regular expression. A CFG is a fixed point of , We can assume that there’s at least one production, the starting symbol. If CFG has only one single production, than we can convert it to recursive regular expression using . Otherwise, . We then take and , extract functions and , and define

where represents a vector of distinct variables.

By iterating this process, we can reduce the number of top-level productions until we have a single production left and can apply the base case.

Note on performance

Let me start by saying that I haven’t really measured very much. The examples above run at interactive speed in GHCi, and I suspect that collecting and pretty-printing the traces is not free.

I wrote the parser for unambiguous arithmetic expressions based on the example above, using parsec:

ex7parsec :: P.Parser ()
ex7parsec = expr where
    expr   = void $ P.try (mult >> P.char '*' >> expr) <|> mult
    mult   = void $ P.try (term >> P.char '+' >> mult) <|> term
    term   = P.try digits <|> void (P.char '(' *> expr *> P.char ')')
    digits = void $ some digit
    digit  = P.satisfy (\c -> c >= '0' && c <= '9')

Note, that this parser only recognises, i.e., doesn’t build a parse tree. I also didn’t use good practices in writing parsec parsers, rather translating the CFG as directly as possible.

The result is salty. The recursive-regexp approach is 1000-10000 times slower (and getting slower the longer the input string is). This is not really surprising, as the matching algorithm recomputes a lot of things on each character, but still unfortunate.

We can get 100x speedup (but be still 100x slower than parsec) by introducing explicit sharing instead of Fix (and Let). At the end we are doing the same as Might, Darais and Spiewak; with a difference being that our public interface is non-opaque.

For this, we take the original regular expression we started with, and add a new constructor Ref:

-- | Knot-tied recursive regular expression.
data RR
    = Eps
    | Ch CS.CharSet
    | App RR RR
    | Alt RR RR
    | Star RR
    | Ref !Int RR

This structure can now be circular, as long as cycles use Ref. Conversion from RE with Fix to RR is a direct mapping of constructors; the interesting part happens with Fix, where we have to use mfix (and a lazy state monad):

    go :: RE RR -> State Int RR
    go (R.Fix _ r) = mfix $ \res -> do
        i <- newId
        r' <- go (fmap (unvar res id) r)
        return (Ref i r')

The implementation is still relatively simple and importantly not non-acceptably slow. In the simple artificial benchmark of parsing 1000*(2020+202)*(20+3)*((30+20)*10000)+123123123*12313 arithmetic expression the results are

benchmarking parsec
time                 22.31 μs   (21.44 μs .. 23.66 μs)

benchmarking rere
time                 237.2 ms   (207.5 ms .. 266.1 ms)

benchmarking ref
time                 6.029 ms   (5.486 ms .. 6.801 ms)

benchmarking derp
time                 20.31 μs   (18.37 μs .. 22.07 μs)

The Haskell used in this post is simple enough so that the library can easily be ported to Miranda ³. With this version, our example runs in

% echo "bench" | time mira rere.m
True
mira rere.m  0,32s user 0,02s system 99% cpu 0,343 total

Perhaps surprisingly, it is not much slower than GHC.

JSON

A less artificial benchmark is parsing JSON. I took the JSON syntax definition from https://www.crockford.com/mckeeman.html and directly translated it to CFG:

which is then translated into recursive regular expression:

I used a simple 1611 byte JSON file (a package.json definition). It takes a dozen of microseconds for aeson to parse it. The “optimised” implementation of this section needs dozen seconds, i.e., it is a million times slower. The parser is slow, but acceptably so to be used in tests, to verify the implementation of “fast” production variant. Unfortunately, the derp parser generated from above regular expression loops, so we cannot know how it would perform.

Intersection

We know that regular expressions are closed under intersection, and it’s possible to define a conversion from an RE that makes use of the And constructor to an RE that does not. Context-free grammars are not closed under intersection, but our recursive RE can still be extended with additional And constructor, and everything discussed above will continue to work.

data RE a =
    ...
    | And (RE a) (RE a)

We can also add a Full constructor that matches anything.

The extension of nullable and derivative is so simple that you might think something will break; yet nothing does:

nullable' Full      = True
nullable' (And r s) = nullable' r && nullable' s

derivative' _ Full      = Full
derivative' f (And r s) = derivative' f r /\ derivative' f s

(/\) :: Ord a => RE a -> RE a -> RE a
...
r /\ s = And r s

As an example, we consider the intersection of two languages:

which is known to be not context-free:

However, we can simply match with it:

This is of course cheating somewhat, as And / occurs here only on the top-level, and not e.g. inside of Fix / . However, even then it wouldn’t pose problems for the algorithm, but it is difficult to come up with any meaningful examples. Also recall that we interpret as least fixed point, so for example, even though

has as a fixed point, we have as another fixed point, and that is the least one. Therefore, the above expression works (and indeed is automatically simplified to) .

However And / adds expressive power to the language, so it cannot be omitted as in pure regular expressions (where it causes a combinatorial explosion of expression size, so it may not be a good idea there either).

There’s one clear problem with And however: languages defined using And cannot be easily generated. Assuming we can use the first branch of an intersection to generate the candidate, it must still also match the second branch. I don’t know whether the language inhabitation problem for this class of languages is decidable (for CFGs it is, but we’re now outside of CFGs). Consider the intersection of two simple regular expressions, strings of of odd and even length:

Their intersection is empty, but it’s not structurally obvious. For example if we match on , the expression will stay in a single non-nullable state, but it won’t simplify to :

Conclusion

It was nice to combine known things in a new way, and the result is interesting.

We now know that we can add fixed points to regular expressions or recursive types to non-commutative intuitionistic linear logic. The systems are still well-behaved, but many problems become harder. As CFG equivalence is undecidable, so is term synthesis in NCILL with recursive types (as synthesizing a term would yield a proof of equivalence).

We can also use the recursive RE not only to match on strings, but also to generate them. Therefore we can use it to statistically determine grammar equivalence (which is not a new idea).

Finally, this is not only for fun. I’m trying to formalize the grammars of fields in .cabal files. The parsec parsers are the definition, but we now have more declarative definitions too and compare these two, using QuickCheck. Additionally we get nice-looking grammar definitions that are (hopefully) human-readable. If you ever wondered what the complete and precise syntax for version ranges in .cabal files is, here is what it looks like at the time of writing this post:

The notation is described in detail in the Cabal user manual, where you can also find more grammar definitions.

Code in the Cabal Parsec instances is accumulating history baggage, and is written to produce helpful error messages, not necessarily with clarity of the grammar in mind. However, we can compare it (and its companion Pretty instance) with its RE counterpart to find possible inconsistencies. Also, Cabal has a history of not handling whitespace well, either always requiring, completely forbidding, or allowing it where it shouldn’t be allowed. The RE-derived generator can be amended to produce slightly skewed strings, for example inserting or removing whitespace, to help identify and overcome such problems.

References

[Bekic1984]

Bekić, Hans: Definable operations in general algebras, and the theory of automata and flowcharts. In: Jones, C. B. (ed.): Programming languages and their definition: H. Bekič (1936–1982). Berlin, Heidelberg : Springer Berlin Heidelberg, 1984 — ISBN 978-3-540-38933-0, pp. 30–55. https://doi.org/10.1007/BFb0048939

[Might2011]

Might, Matthew; Darais, David; Spiewak, Daniel: Parsing with derivatives: A functional pearl. In: Proceedings of the 16th ACM Sigplan International Conference on Functional Programming, ICFP ’11. New York, NY, USA : Association for Computing Machinery, 2011 — ISBN 9781450308656, pp. 189–195. https://doi.org/10.1145/2034773.2034801

[Owens2009]

Owens, Scott; Reppy, John; Turon, Aaron: Regular-expression derivatives re-examined. In: Journal of Functional Programming vol. 19 (2). USA, Cambridge University Press (2009), pp. 173–190. https://doi.org/10.1017/S0956796808007090 https://www.ccs.neu.edu/home/turon/re-deriv.pdf

There is an implementation in the derp package, which uses Data.IORef and unsafePerformIO.↩︎
Using smart constructors in this approach, we obtain relatively small automata, but they are not minimal. To actually obtain minimal ones, kleene can compare regular expressions for an equivalence, e.g. concluding that and are in fact equivalent.↩︎
Miranda was first released in 1985, and in January 2020 also under BSD-2-Clause license. Now anyone can play with it. Lack of type classes makes the code a bit more explicit, but otherwise it doesn’t look much different.↩︎