# Efficient Matching with Regular Expressions

## Contents

**Important:** Pyodide takes time to initialize.
Initialization completion is indicated by a red border around *Run all* button.

In the previous posts, I talked about converting a regular grammar to a deterministic regular grammar (DRG) equivalent to a DFA and mentioned that it was one of the ways to ensure that there is no exponential wort-case for regular expression matching. However, this is not the only way to avoid exponential worst-case due to backtracking in regular expressions. A possible solution is the Thompson algorithm described here. The idea is that, rather than backtracking, for any given NFA, track all possible threads in parallel. This raises a question. Is the number of such parallel threads bounded?

Given a regular expression with size n, excluding parenthesis, can be represented by an NFA with n states. Furthermore, any given parsing thread can contain no information other than the current node. That is, for an NFA with n states one never needs to track more than n parallel threads while parsing. I recently found a rather elegant and tiny implementation of this in Python here. This is an attempt to document my understanding of this code.

As before, we start with the prerequisite imports.

## Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.The imported modules

## Functional Implementation

Here is the Python implementation from here slightly modified to look similar to my post on simple combinatory parsing. I also name the anonymous lambdas to make it easier to understand.

The basic idea is to first construct the NFA processing graph, then let the
string processing take place through the graph. Each node (state in NFA) in
the graph requires no more information than the nodes to transition to on
given input tokens to accept or reject the remaining string. This is leveraged
in the match function below, where based on the current input token, all the
states are queried to identify the next state. The main point of the NFA is
that we need to maintain at most `n`

states in our processing list at any time
where n is the number of nodes in the NFA. That is, for each input symbol, we
have a constant number of states to query.

The most important thing to understand in this algorithm for NFA graph construction is that we start the construction from the end, that is the accepting state. Then, as each node is constructed, new node is linked to next nodes in processing. The second most important thing to understand this algorithm for regular expression matching is that, each node represents the starting state for the remaining string. That is, we can query next nodes to check if the current remaining string will be accepted or rejected.

### Match

Next, we build our NFA graph.

### Lit

Let us take the simplest case: A literal such as `a`

.
We represent the literal by a function that accepts the character, and returns
back a state in the NFA. The idea is that this NFA can accept or reject the
remaining string. So, it needs a continuation, which is given as the next
state. The NFA will continue with the next state only if the parsing of the
current symbol succeeds. So we have an inner function `parse`

that does that.

An accepting state is just a sentinel, and we define it with `Lit`

.

### Epsilon

An epsilon matches nothing. That is, it passes any string it receives
without modification to the next state.
(It could also be written as `lambda s: return s`

)

### AndThen

AndThen is a meta construction that concatenates two regular expressions.
It accepts the given string if the given regular expressions (`rex1`

, `rex2`

)
accepts the given string when placed end to end.
The `rnfa`

is the node to connect to after nodes represented by
`rex1 -> rex2`

.
That is, the final NFA should look like `[rex1]->[rex2]->rnfa`

.

Note: I use a `[]`

to represent a node, and no `[]`

when I specify the
remaining NFA. For example, `[xxx]`

represents the node in the NFA, while
`xxx`

is the NFA starting with the node `xxx`

From the perspective of the node `[rex1]`

, it can accept the remaining string
if its own matching succeeds on the given string and the remaining string can
be matched by the NFA `rex2(rnfa)`

i.e, the NFA produced by calling `rex2`

with argument `rnfa`

. That is, `[rex1]->rex2(rnfa)`

,
which is same as `rex1(rex2(rnfa))`

— the NFA constructed by calling the
function sequence `rex1(rex2(rnfa))`

.

The functions are expanded to make it easy to understand. The node may as well
have had `rex1(rex2(rnfa))`

as the return value.

### OrElse

The OrElse construction allows either one of the regular expressions to
match the given symbol. That is, we are trying to represent parallel
processing of `[rex1] -> rnfa || [rex2] -> rnfa`

, or equivalently
`rex1(rnfa) || rex2(rnfa)`

.

### Star

Finally, the Star is defined similar to OrElse. Note that unlike the context
free grammar, we do not allow unrestricted recursion. We only allow tail
recursion in the star form.
In Star, we are trying to represent
`rnfa || [rex] -> [Star(rex)] -> rnfa`

.
Since the `parse`

is a function that closes
over the passed in `rex`

and `rnfa`

, we can use that directly. That is
`Star(rex)(rnfa) == parse`

. Hence, this becomes
`rnfa || [rex] -> parse`

, which is written equivalently as
`rnfa || rex(parse)`

.

Let us test this.

In the interest of code golfing, here is how to compress it. We use the
self application combinator `mkrec`

, and use `mkrec(lambda _: ... _(_) ...)`

pattern

Testing it out

## Easier Regular Literals

Typing constructors such as Lit every time you want to match a single token is not very friendly. Can we make them a bit better?

Using it.

## Object Based Implementation

This machinery is a bit complex to understand due to the multiple levels of closures. I have found such constructions easier to understand if I think of them in terms of objects. So, here is an attempt. First, the Re class that defines the interface.

### Match

The match is slightly modified to account for the new Re interface.

### Lit

We separate out the construction of object, connecting it to the remaining
NFA (`trans`

)

An accepting node is a node that requires no input. It is a simple sentinel

Next, we define our matching algorithm. The idea is to start with the constructed NFA as the single thread, feed it our string, and check whether the result contains the accepted state. Let us test this.

### AndThen

Let us test this.

### OrElse

Next, we want to match alternations. The important point here is that we want to pass on the next state if either of the parses succeed.

Let us test this.

### Star

Star now becomes much easier to understand.

Let us test this.

### Epsilon

We also define an epsilon expression.

Let us test this.

We can have quite complicated expressions. Again, test suite from here.

## Parser

What about constructing regular expression literals? For that let us start with a simplified grammar

This is ofcourse a very limited grammar that only supports basic regular expression operations concatenation, alternation, and star. We can use earley parser for parsing any given regular expressions

Let us define the basic machinary. The parse function parses the regexp string to an AST, and to_re converts the AST to the regexp structure.

The unit expression may e an alpha or a parenthesised expression.

The alpha gets converted to a Lit.

check it has worked

Next, we write the exp and cex conversions. cex gets turned into AndThen

check it has worked

Next, we write the regex, which gets converted to OrElse

check it has worked

Finally the Star.

check it has worked

Wrapping everything up.

check it has worked

The runnable code for this post is available here.

## Artifacts

The runnable Python source for this notebook is available here.