Efficient Matching with Regular Expressions

Functional Implementation
1. Match
2. Lit
3. Epsilon
4. AndThen
5. OrElse
6. Star
Easier Regular Literals
Object Based Implementation
1. Match
2. Lit
3. AndThen
4. OrElse
5. Star
6. Epsilon
Parser
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

In the previous posts, I talked about converting a regular grammar to a deterministic regular grammar (DRG) equivalent to a DFA and mentioned that it was one of the ways to ensure that there is no exponential wort-case for regular expression matching. However, this is not the only way to avoid exponential worst-case due to backtracking in regular expressions. A possible solution is the Thompson algorithm described here. The idea is that, rather than backtracking, for any given NFA, track all possible threads in parallel. This raises a question. Is the number of such parallel threads bounded?

Given a regular expression with size n, excluding parenthesis, can be represented by an NFA with n states. Furthermore, any given parsing thread can contain no information other than the current node. That is, for an NFA with n states one never needs to track more than n parallel threads while parsing. I recently found a rather elegant and tiny implementation of this in Python here. This is an attempt to document my understanding of this code.

As before, we start with the prerequisite imports.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

The imported modules

Functional Implementation

Here is the Python implementation from here slightly modified to look similar to my post on simple combinatory parsing. I also name the anonymous lambdas to make it easier to understand.

The basic idea is to first construct the NFA processing graph, then let the string processing take place through the graph. Each node (state in NFA) in the graph requires no more information than the nodes to transition to on given input tokens to accept or reject the remaining string. This is leveraged in the match function below, where based on the current input token, all the states are queried to identify the next state. The main point of the NFA is that we need to maintain at most n states in our processing list at any time where n is the number of nodes in the NFA. That is, for each input symbol, we have a constant number of states to query.

The most important thing to understand in this algorithm for NFA graph construction is that we start the construction from the end, that is the accepting state. Then, as each node is constructed, new node is linked to next nodes in processing. The second most important thing to understand this algorithm for regular expression matching is that, each node represents the starting state for the remaining string. That is, we can query next nodes to check if the current remaining string will be accepted or rejected.

Match

Next, we build our NFA graph.

Lit

Let us take the simplest case: A literal such as a. We represent the literal by a function that accepts the character, and returns back a state in the NFA. The idea is that this NFA can accept or reject the remaining string. So, it needs a continuation, which is given as the next state. The NFA will continue with the next state only if the parsing of the current symbol succeeds. So we have an inner function parse that does that.

An accepting state is just a sentinel, and we define it with Lit.

Epsilon

An epsilon matches nothing. That is, it passes any string it receives without modification to the next state. (It could also be written as lambda s: return s)

AndThen

AndThen is a meta construction that concatenates two regular expressions. It accepts the given string if the given regular expressions (rex1, rex2) accepts the given string when placed end to end. The rnfa is the node to connect to after nodes represented by rex1 -> rex2. That is, the final NFA should look like [rex1]->[rex2]->rnfa.

Note: I use a [] to represent a node, and no [] when I specify the remaining NFA. For example, [xxx] represents the node in the NFA, while xxx is the NFA starting with the node xxx

From the perspective of the node [rex1], it can accept the remaining string if its own matching succeeds on the given string and the remaining string can be matched by the NFA rex2(rnfa) i.e, the NFA produced by calling rex2 with argument rnfa. That is, [rex1]->rex2(rnfa), which is same as rex1(rex2(rnfa)) — the NFA constructed by calling the function sequence rex1(rex2(rnfa)).

The functions are expanded to make it easy to understand. The node may as well have had rex1(rex2(rnfa)) as the return value.

OrElse

The OrElse construction allows either one of the regular expressions to match the given symbol. That is, we are trying to represent parallel processing of [rex1] -> rnfa || [rex2] -> rnfa, or equivalently rex1(rnfa) || rex2(rnfa).

Star

Finally, the Star is defined similar to OrElse. Note that unlike the context free grammar, we do not allow unrestricted recursion. We only allow tail recursion in the star form. In Star, we are trying to represent rnfa || [rex] -> [Star(rex)] -> rnfa. Since the parse is a function that closes over the passed in rex and rnfa, we can use that directly. That is Star(rex)(rnfa) == parse. Hence, this becomes rnfa || [rex] -> parse, which is written equivalently as rnfa || rex(parse).

Let us test this.

In the interest of code golfing, here is how to compress it. We use the self application combinator mkrec, and use mkrec(lambda _: ... _(_) ...) pattern

Testing it out

Easier Regular Literals

Typing constructors such as Lit every time you want to match a single token is not very friendly. Can we make them a bit better?

Using it.

Object Based Implementation

This machinery is a bit complex to understand due to the multiple levels of closures. I have found such constructions easier to understand if I think of them in terms of objects. So, here is an attempt. First, the Re class that defines the interface.

Match

The match is slightly modified to account for the new Re interface.

Lit

We separate out the construction of object, connecting it to the remaining NFA (trans)

An accepting node is a node that requires no input. It is a simple sentinel

Next, we define our matching algorithm. The idea is to start with the constructed NFA as the single thread, feed it our string, and check whether the result contains the accepted state. Let us test this.

AndThen

Let us test this.

OrElse

Next, we want to match alternations. The important point here is that we want to pass on the next state if either of the parses succeed.

Let us test this.

Star

Star now becomes much easier to understand.

Let us test this.

Epsilon

We also define an epsilon expression.

Let us test this.

We can have quite complicated expressions. Again, test suite from here.

Parser

What about constructing regular expression literals? For that let us start with a simplified grammar

This is ofcourse a very limited grammar that only supports basic regular expression operations concatenation, alternation, and star. We can use earley parser for parsing any given regular expressions

Let us define the basic machinary. The parse function parses the regexp string to an AST, and to_re converts the AST to the regexp structure.

The unit expression may e an alpha or a parenthesised expression.

The alpha gets converted to a Lit.

check it has worked

Next, we write the exp and cex conversions. cex gets turned into AndThen

check it has worked

Next, we write the regex, which gets converted to OrElse

check it has worked

Finally the Star.

check it has worked

Wrapping everything up.

check it has worked

The runnable code for this post is available here.

Artifacts

The runnable Python source for this notebook is available here.

Contents

Functional Implementation

Match

Lit

Epsilon

AndThen

OrElse

Star

Easier Regular Literals

Object Based Implementation

Match

Lit

AndThen

OrElse

Star

Epsilon

Parser

Artifacts