Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
In the previous posts, I talked about converting a regular grammar to a
deterministic regular grammar (DRG) equivalent to a DFA and mentioned that
it was one of the ways to ensure that there is no exponential wort-case
for regular expression matching. However, this is not the only way to
avoid exponential worst-case due to backtracking in regular expressions.
A possible solution is the Thompson algorithm described here.
The idea is that, rather than backtracking, for any given NFA,
track all possible threads in parallel. This raises a question. Is the number
of such parallel threads bounded?
Given a regular expression with size n, excluding parenthesis,
can be represented by an NFA with n states. Furthermore, any given parsing
thread can contain no information other than the current node.
That is, for an NFA with n states one
never needs to track more than n parallel threads while parsing.
I recently found a rather elegant and tiny implementation of this in Python
here.
This is an attempt to document my understanding of this code.
As before, we start with the prerequisite imports.
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
Here is the Python implementation from here
slightly modified to look similar to my post
on simple combinatory parsing. I also name the anonymous lambdas to make it
easier to understand.
The basic idea is to first construct the NFA processing graph, then let the
string processing take place through the graph. Each node (state in NFA) in
the graph requires no more information than the nodes to transition to on
given input tokens to accept or reject the remaining string. This is leveraged
in the match function below, where based on the current input token, all the
states are queried to identify the next state. The main point of the NFA is
that we need to maintain at most n states in our processing list at any time
where n is the number of nodes in the NFA. That is, for each input symbol, we
have a constant number of states to query.
The most important thing to understand in this algorithm for NFA graph
construction is that we start the construction from the end, that is the
accepting state. Then, as each node is constructed, new node is linked to
next nodes in processing. The second most important thing to understand this
algorithm for regular expression matching is that, each node represents the
starting state for the remaining string. That is, we can query next nodes to
check if the current remaining string will be accepted or rejected.
Match
Next, we build our NFA graph.
Lit
Let us take the simplest case: A literal such as a.
We represent the literal by a function that accepts the character, and returns
back a state in the NFA. The idea is that this NFA can accept or reject the
remaining string. So, it needs a continuation, which is given as the next
state. The NFA will continue with the next state only if the parsing of the
current symbol succeeds. So we have an inner function parse that does that.
An accepting state is just a sentinel, and we define it with Lit.
Epsilon
An epsilon matches nothing. That is, it passes any string it receives
without modification to the next state.
(It could also be written as lambda s: return s)
AndThen
AndThen is a meta construction that concatenates two regular expressions.
It accepts the given string if the given regular expressions (rex1, rex2)
accepts the given string when placed end to end.
The rnfa is the node to connect to after nodes represented by
rex1 -> rex2.
That is, the final NFA should look like [rex1]->[rex2]->rnfa.
Note: I use a [] to represent a node, and no [] when I specify the
remaining NFA. For example, [xxx] represents the node in the NFA, while
xxx is the NFA starting with the node xxx
From the perspective of the node [rex1], it can accept the remaining string
if its own matching succeeds on the given string and the remaining string can
be matched by the NFA rex2(rnfa) i.e, the NFA produced by calling rex2
with argument rnfa. That is, [rex1]->rex2(rnfa),
which is same as rex1(rex2(rnfa)) — the NFA constructed by calling the
function sequence rex1(rex2(rnfa)).
The functions are expanded to make it easy to understand. The node may as well
have had rex1(rex2(rnfa)) as the return value.
OrElse
The OrElse construction allows either one of the regular expressions to
match the given symbol. That is, we are trying to represent parallel
processing of [rex1] -> rnfa || [rex2] -> rnfa, or equivalently
rex1(rnfa) || rex2(rnfa).
Star
Finally, the Star is defined similar to OrElse. Note that unlike the context
free grammar, we do not allow unrestricted recursion. We only allow tail
recursion in the star form.
In Star, we are trying to represent
rnfa || [rex] -> [Star(rex)] -> rnfa.
Since the parse is a function that closes
over the passed in rex and rnfa, we can use that directly. That is
Star(rex)(rnfa) == parse. Hence, this becomes
rnfa || [rex] -> parse, which is written equivalently as
rnfa || rex(parse).
Let us test this.
In the interest of code golfing, here is how to compress it. We use the
self application combinator mkrec, and use mkrec(lambda _: ... _(_) ...)
pattern
Testing it out
Easier Regular Literals
Typing constructors such as Lit every time you want to match a single token
is not very friendly. Can we make them a bit better?
Using it.
Object Based Implementation
This machinery is a bit complex to understand due to the multiple levels of
closures. I have found such constructions easier to understand if I think of
them in terms of objects. So, here is an attempt.
First, the Re class that defines the interface.
Match
The match is slightly modified to account for the new Re interface.
Lit
We separate out the construction of object, connecting it to the remaining
NFA (trans)
An accepting node is a node that requires no input. It is a simple sentinel
Next, we define our matching algorithm. The idea is to start with the
constructed NFA as the single thread, feed it our string, and check whether
the result contains the accepted state.
Let us test this.
AndThen
Let us test this.
OrElse
Next, we want to match alternations.
The important point here is that we want to
pass on the next state if either of the parses succeed.
Let us test this.
Star
Star now becomes much easier to understand.
Let us test this.
Epsilon
We also define an epsilon expression.
Let us test this.
We can have quite complicated expressions. Again, test suite from
here.
Parser
What about constructing regular expression literals? For that let us start
with a simplified grammar
This is ofcourse a very limited grammar that only supports basic
regular expression operations concatenation, alternation, and star.
We can use earley parser for parsing any given regular expressions
Let us define the basic machinary. The parse function parses
the regexp string to an AST, and to_re converts the AST
to the regexp structure.
The unit expression may e an alpha or a parenthesised expression.
The alpha gets converted to a Lit.
check it has worked
Next, we write the exp and cex conversions. cex gets turned into AndThen
check it has worked
Next, we write the regex, which gets converted to OrElse
check it has worked
Finally the Star.
check it has worked
Wrapping everything up.
check it has worked
The runnable code for this post is available
here.
Artifacts
The runnable Python source for this notebook is available here.