TL;DR: We add variables, let bindings, and explicit recursion via fixed
points to classic regular expressions. It turns out that the resulting
explicitly recursive, finitely described languages are well suited for
analysis and introspection.
It’s been almost a year since I touched
the kleene library,
and almost two years since I published it –
a good time to write a little about regular expressions.
I like regular expressions very much. They are truly declarative way to
write down a grammar… as long as the grammar is expressible
in regular expressions.
Matthew Might, David Darais and Daniel Spiewak have written
a paper Functional Pearl: Parsing with Derivatives published in ICFP ’11 Proceedings
[Might2011]
in which regular expressions are extended to handle context-free languages.
However, they rely on memoization, and – as structures are infinite – also
on reference equality. In short, their approach is not implementable in idiomatic
Haskell. 1
There’s another technique that works for a subset of context-free languages.
In my opinion, it is very elegant, and it is at least not painfully slow.
The result is available on Hackage:
the rere library.
The idea is to treat regular expressions as a proper programming language,
and add a constructions which proper languages should have:
variables and recursion.
This blog post will describe the approach taken by rere in more detail.
Regular expression recap
The abstract syntax of a regular expression (over the alphabet of unicode characters)
is given by the following “constructors”:
Null regexp:
Empty string:
Characters: , etc
Concatenation:
Alternation:
Kleene star:
The above can be translated directly into Haskell:
dataRE=Empty|Eps|ChChar|AppRERE|AltRERE|StarRE
In the rere implementation, instead of bare Char we use a set of characters, CharSet,
as recommended by Owens et al. in
Regular-expression derivatives reexamined[Owens2009].
This makes the implementation more efficient, as a common case of character
sets is explicitly taken into account.
We write them in curly braces:
.
We can give declarative semantics to these constructors.
These will look like typing rules. A judgment
denotes that the regular expression successfully
recognises the string .
For example, the rule for application now looks like:
This rule states that if recognises ,
and recognises , then the concatenation
expression recognises the
concatenated string .
For alternation we have two rules, one for each of the alternatives:
The rules resemble the structure of non-commutative intuitionistic linear logic,
if you are into such stuff.
Not only do you have to use everything exactly once; you have to use
everything in order,
there aren’t any substructural rules, no weakening, no contraction
and even no exchange. I will omit the rest of the rules, look them up
(and think how rules for Kleene star would look like ‘why not’ exponential ?).
It’s a good idea to define smart versions of the constructors,
which simplify regular expressions as they are created.
For example, in the following Semigroup instance for concatenation,
<> is a smart version of App:
instanceSemigroupREwhere-- Empty annihilatesEmpty<> _ =Empty _ <>Empty=Empty-- Eps is unit of <>Eps<> r = r r <>Ep s = r-- otherwise use App r <> s =App r s
The smart version of Alt is called \/, and the smart version of
Star is called star.
We can check that the simplifications performed by the smart
constructors are sound, by using the semantic rules.
For example, the simplification Eps <> r = r is justified by
the following equivalence of derivation trees:
If string is matched by
,
then “the match” can be constructor only in one way, by applying the
rule. Therefore is also matched by
bare . If we introduced proof terms, we’d have a
concrete evidence of the match as terms in this language.
There is, however, a problem: matching using declarative rules is not practical.
At several points in these rules, we have to guess.
We have to guess whether we should pick left or right branch, or
where we should split string to match concatenated regular expression.
For a practical implementation, we need
a syntax-directed approach. Interestingly, we then need just
two rules:
In the above rules, we use two operations:
The decision procedure “nullable” that tells whether a
regular expression can recognise the empty string,
and a mapping that, given a single
character and a regular expression
computes a new regular expression called the derivative of
with respect to .
Both operations are quite easy to map to Haskell. The function
nullable is defined as a straight-forward recursive function:
nullable ::RE->Boolnullable Empty=Falsenullable Eps=Truenullable (Ch _) =Falsenullable (App r s) = nullable r && nullable snullable (Alt r s) = nullable r || nullable snullable (Star _) =True
The Brzozowski derivative is best understood by considering
the formal language regular expressions represent:
In Haskell terms: derivative c r matches string str if and only if r matches c : str.
From this equivalence, we can more or less directly infer an implementation:
derivative ::Char->RE->REderivative _ Empty=Emptyderivative _ Eps=Emptyderivative _ (Ch x)| c == x =Eps|otherwise=Emptyderivative c (App r s)| nullable r = derivative c s \/ derivative c r <> s|otherwise= derivative c r <> sderivative c (Alt r s) = derivative c r \/ derivative c sderivative c r0@(Star r) = derivative c r <> r0
We could try to show that the declarative and syntax directed
systems are equivalent, but I omit it here, because it’s been
done often enough in the literature
(though probably not in exactly this way and notation).
We can now watch how a regular expression “evolves” while matching
a string. For example, if we take the regular expression
,
which in code looks like
ex1 ::REex1 = star (Ch'a'<>Ch'b')
then the following is how match ex1 "abab" proceeds:
We can see that there’s implicitly a small finite state automaton,
with two states:
an initial state
and secondary state .
This is the approach taken by the kleene package to transform regular expressions
into finite state machines.
There is an additional character set optimization from
Regular-expression derivatives re-examined[Owens2009]
by Owens, Reppy and Turon, but in essence, the approach works as follows:
Try all possible derivatives, and in the process collect all the states and construct a transition function. 2
The string is accepted as the matching process stops at the state, which is nullable.
Variables and let-expressions
The first new construct we now add to regular-expressions are let expressions.
They alone do not add any matching power, but they are prerequisite
for allowing recursive expressions.
We already used meta-variables in the rules in the previous section.
Let expressions allow us to internalise this notion.
The declarative rule for let expressions is:
Here, the notation
denotes substituting the variable by the regular
expression in the regular expression .
To have let expressions in our implementation, we need to
represent variables and we must be able to perform substitution.
My tool of choice for handling variables and substitution in general
is the bound library.
But for the sake of keeping the blog post self-contained, we’ll
define the needed bits inline. We’ll reproduce a simple variant of
bound, which amounts to using de Bruijn indices and polymorphic
recursion.
We define our own datatype to represent variables
(which is isomorphic to Maybe):
With this, we can extend regular expressions with let.
First we make it a functor, i.e., change RE to RE a,
and then also add two new constructors: Var and Let:
dataRE a=Empty|Eps|ChChar|App (RE a) (RE a)|Alt (RE a) (RE a)|Star (RE a)|Var a|Let (RE a) (RE (Var a))
Note that we keep the argument a unchanged in all
recursive occurrences of RE, with the exception of
the body of the Let, where use use Var a instead,
indicating that we can use B in that body to refer
to the variable bound by the Let.
In the actual rere library, the Let (and later Fix) constructors
additionally have an irrelevant Name field, which allows us to retain
the variable names and use them for pretty-printing.
I omit them from the presentation in this blog post.
Now, we can write a regular expression with repetitions
like instead of ; or in Haskell:
ex2 ::REVoidex2 =Let (star (Ch'a')) (VarB<>VarB)
The use of Void as parameter tells us that expression is closed,
i.e., doesn’t contain any free variables.
We still need to extend nullable and derivative to work with
the new constructors.
For nullable, we’ll simply pass a function telling whether
variables in context are nullable.
The existing constructors just pass a context around:
nullable ::REVoid->Boolnullable = nullable' absurdnullable' :: (a ->Bool) ->RE a ->Boolnullable' _ Empty=Falsenullable' _ Eps=Truenullable' _ (Ch _) =Falsenullable' f (App r s) = nullable' f r && nullable' f snullable' f (Alt r s) = nullable' f r || nullable' f snullable' _ (Star _) =True
The cases for Var and Let
use and extend the context, respectively:
-- Var: look in the contextnullable' f (Var a) = f a-- Let: - compute `nullable r`-- - extend the context-- - continue with `s`nullable' f (Let r s) = nullable' (unvar (nullable' f r) f) s
The unvar function corresponds to maybe, but
transported to our Var type:
unvar :: r -> (a -> r) ->Var a -> runvar b _ B= bunvar _ f (F x) = f x
How to extend derivative to cover the new cases requires a bit
more thinking. The idea is similar: we want to add to the context
whatever we need to know about the variables.
The key insight is to replace every Let binding by two Let
bindings, one copying the original, and one binding to the
derivative of the let-bound variable. Because the number of let
bindings changes, we have to carefully re-index variables as we go.
Therefore, the context for derivative consists of
three pieces of information per variable:
whether the variable is nullable (we need it for derivative of App),
the variable denoting the derivative of the original variable,
the re-indexed variable denoting the original value.
The top-level function derivative :: Char -> RE Void -> RE Void
now makes use of a local helper function
derivative' :: (Eq a, Eq b) => (a -> (Bool, b, b)) ->RE a ->RE b
which takes this context. Note that, as discussed above,
derivative' changes the indices of the variables. However,
at the top-level, both a and b are Void, and the
environment can be trivially instantiated to the function
with empty domain.
The derivative' case for Var is simple: we just look up the
derivative of the Var in the context.
derivative' f (Var a) =Var (sndOf3 (f a))
The case for Let is quite interesting:
derivative' f (Let r s)= let_ (fmap (trdOf3 . f) r) -- rename variables in r$ let_ (fmapF (derivative' f r)) -- binding for derivative of r$ derivative' (\caseB-> (nullable' (fstOf3 . f) r, B, FB)F x -> bimap (F.F) (F.F) (f x))$ s...
As a formula it looks like:
For our running example or
Let (star (Ch 'a')) (Var B <> Var B),
we call derivative' recursively with an argument of type
RE (Var a), corresponding to the one variable ,
and we get back a RE (Var (Var b)), corresponding to the two
variables and .
The careful reader will also have noticed the smart constructor let_, which
does a number of standard rewritings on the fly
(which I explain in a Do you have a problem? Write a compiler! talk).
These are justified by the properties of substitution:
-- let-from-letlet x = (let y = a in b) in c-- ==>let y = a; x = b in c
-- inlining of cheap bindingslet x = a in b-- ==>b [ x -> a ] -- when a is cheap, i.e. Empty, Eps, Ch or Var
-- used once, special caselet x = a in x-- ==>a
-- unused bindinglet x = a in b-- ==>b -- when x is unused in b
And importantly, we employ a quick form common-subexpression-elimination (CSE):
let x = a in f x a-- ==>let x = a in f x x
This form of CSE is easy and fast to implement, as
we don’t introduce new lets, only consider what we already bound
and try to increase sharing.
It’s time for examples: Recall again ex2 which was defined as or
Let’s try to observe the match of the string step by step:
As our smart constructors are quite smart, the automaton stays
in its single state, the union comes from the derivative of App,
as r is nullable, we get derivative 'a' r \/ derivative 'a' r <> r.
And as derivative 'a' r = r, we don’t see any additional let bindings.
Recursion
Now we are ready for the main topic of the post: recursion.
We add one more constructor to our datatype of regular expressions:
dataRE a...|Fix (RE (Var a))
The Fix construct looks similar to Let, except that the bound variable is
semantically equivalent to the whole expression.
We can unroll each expression by substituting it into
itself:
The Fix constructor subsumes the Kleene star, as
can now be expressed as
,
which feels like a very natural definition indeed.
For example ex1 previously defined using Kleene star as could also be re-defined as
. That looks like
The problem is now the same as with Let: How to define nullable and derivative?
Fortunately, we have most of the required machinery already in place
from the addition of Var and Let.
Nullability of Fix relies on Kleene’s theorem to compute the
least fixed point of a monotonic recursive definition,
like in Parsing with Derivatives. The idea is to unroll Fix
once, and to pretend that the nullability of the recursive
occurrence of the bound variable in Fix is False:
nullable' :: (a ->Bool) ->RE a ->Bool...nullable' f (Fix _ r) = nullable' (unvar False f) r
In other words, we literally assume that the nullability of new binding
is False, and see what comes out. We don’t need to iterate
more then once, as False will flip to True right away, or will
never do so even with further unrollings.
Following a similar idea, our smart constructor fix_ is
capable of recognising a Empty fixed point by substituting
Empty for the recursive occurrence in the unrolling:
fix_ ::RE (Var a) ->RE afix_ r | (r >>>= unvar EmptyVar) ==Empty=Empty...
This works because Empty is a bottom of the language-inclusion lattice
(just as False is a bottom of the Bool lattice).
The extension of derivative is again a bit more involved,
but it resembles what we did for Let:
As the body of a contains self references ,
the derivative of a will also be a .
Thus, when we need to compute the derivative of , we’ll use .
It is important that not all occurrences of in the body of a
will turn into references to its derivative (e.g., if they appear to the right of an App,
or in a Star), so we need to save the value of in a let binding –
how fortunate that we just introduced those …
Schematically, the transformation looks as follows:
In the rest, we will use a shorthand notation for a let binding to a , as in
.
We will write such a binding more succinctly as
with the subscript indicating that the binding is recursive.
We prefer this notation over introducing , because in a
cascade of expressions, we can have individual bindings being
recursive, but we still cannot forward-reference to later bindings.
Applying the abbreviation to our derivation rule above yields
Let’s compare this to the let case, rearranged slightly, to establish
the similarity:
Consequently, the implementation in Haskell also looks similar
to the Let case:
derivative' f r0@(Fix r)= let_ (fmap (trdOf3 . f) r0)$ fix_$ derivative' (\caseB-> (nullable' (fstOf3 . f) r0, B, FB)F x -> bimap (F.F) (F.F) (f x))$ r
Let’s see how it works in practice. We observe the step-by-step matching of
ex3 on abab, which was ex1 defined using a fixed point rather than the
Kleene star:
We couldn’t wish for a better outcome. We see the same two-state ping-pong
behavior as we got using the Kleene star.
More examples
The / Fix is a much more powerful construction than the Kleene star.
Let’s look at some examples …
anbn
Probably the simplest non-regular language is some amount of s followed
by the same amount of s:
We can describe that language using our library, thanks to the presence
of fixed points:
(note the variable in between the literal symbols).
Transcribed to Haskell code, this is:
ex4 ::REVoidex4 =Fix (Eps \/Ch'a'<>VarB<>Ch'b')
And we can test the expression on a string in the language, for example "aaaabbbb":
Now things become more interesting. We can see how in the trace of this
not-so-regular expression, we obtain let bindings resembling the stack
of a pushdown automaton.
From the trace one can relatively easily see that if we “forget” one b at
the end of the input string, then the “state” b isn’t nullable,
so the string won’t be recognized.
Left recursion
Previously in this post, we have rewritten
as . But another option is to
use recursion on the left, i.e., to write instead:
This automaton works as well. In fact, in some sense it works better
than the right-recursive one: we can see (as an artifact of variable naming),
that we get the derivatives as output of each step.
We do save the original expression in a , but as it is unused
in the result, our smart constructors will drop it:
Arithmetic expressions
Another go-to example of context free grammars is arithmetic
expressions:
The Haskell version is slightly more inconvenient to write
due to the use of de Bruijn indices, but otherwise straight-forward: