Recursive descent parsing with Parsing Expression (PEG) and Context Free (CFG) Grammars

PEG Recognizer
PEG Parser
Context Free Parser
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

In the previous post, I showed how to write a simple recursive descent parser by hand – that is using a set of mutually recursive procedures. Actually, I lied when I said context-free. The common hand-written parsers are usually an encoding of a kind of grammar called Parsing Expression Grammar or PEG for short. The difference between PEG and a CFG is that PEG does not admit ambiguity. In particular, PEG uses ordered choice in its alternatives. Due to the ordered choice, the ordering of alternatives is important. A few interesting things about PEG:

We know that L(PEG) is not a subset of L(CFG) (There are languages that can be expressed with a PEG that can not be expressed with a CFG – for example, \(a^nb^nc^n\)).
We do not know if L(PEG) is a superset of CFL. However, given that all PEGs can be parsed in \(O(n)\), and the best general CFG parsers can only reach \(O(n^{(3-\frac{e}{3})})\) due to the equivalence to boolean matrix multiplication¹ ².
We do know that L(PEG) is at least as large as deterministic CFL.
We also know that an LL(1) grammar can be interpreted either as a CFG or a PEG and it will describe the same language. Further, any LL(k) grammar can be translated to L(PEG), but reverse is not always true – it will work only if the PEG lookahead pattern can be reduced to a DFA.

The problem with what we did in the previous post is that it is a rather naive implementation. In particular, there could be a lot of backtracking, which can make the runtime explode. One solution to that is incorporating memoization. Since we start with automatic generation of parser from a grammar (unlike previously, where we explored a handwritten parser first), we will take a slightly different tack in writing the algorithm.

PEG Recognizer

The idea behind a simple PEG recognizer is that, you try to unify the string you want to match with the corresponding key in the grammar. If the key is not present in the grammar, it is a literal, which needs to be matched with string equality. If the key is present in the grammar, get the corresponding productions (rules) for that key, and start unifying each rule one by one on the string to be matched.

For unifying rules, the idea is similar. We take each token in the rule, and try to unify that token with the string to be matched. We rely on unify_key for doing the unification of the token. if the unification fails, we return empty handed.

Let us define a grammar to test it out.

The driver:

PEG Parser

When we implemented the unify_key, we made an important decision, which was that, we return as soon as a match was found. This is what distinguishes a PEG parser from a general CFG parser. In particular it means that rules have to be ordered. That is, the following grammar wont work:

ifmatch = IF (expr) THEN stmts
        | IF (expr) THEN stmts ELSE stmts

It has to be written as the following so that the rule that can match the longest string will come first.

ifmatch = IF (expr) THEN stmts ELSE stmts
        | IF (expr) THEN stmts

If we now parse an if statement without else using the above grammar, such a IF (a=b) THEN return 0, the first rule will fail, and the parse will start for the second rule again at IF. This backtracking is unnecessary as one can see that IF (a=b) THEN return 0 is already parsed by all terms in the second rule. What we want is to save the old parses so that we can simply return the already parsed result. That is,

   if seen((token, text, at)):
       return old_result

Fortunately, Python makes this easy using functools.lru_cache which provides cheap memoization to functions. Adding memoizaion, saving results, and reorganizing code, we have our PEG parser.

The driver:

What we have here is only a subset of PEG grammar. A PEG grammar can contain

Sequence: e1 e2
Ordered choice: e1 / e2
Zero-or-more: e*
One-or-more: e+
Optional: e?
And-predicate: &e – match e but do not consume any input
Not-predicate: !e

We are yet to provide e*, e+, and e?. However, these are only conveniences. One can easily modify any PEG that uses them to use grammar rules instead. The effect of predicates on the other hand can not be easily produced. However, the lack of predicates does not change³ the class of languages that such grammars can match, and even without the predicates, our PEG can be useful for easily representing a large category of programs.

Note: This implementation will blow the stack pretty fast if we attempt to parse any expressions that are reasonably large (where some node in the derivation tree has a depth of 500) because Python provides very limited stack. One can improve the situation slightly by inlining the unify_rule().

This gets us to derivation trees with at a depth of 1000 (or more if we increase the sys.setrecursionlimit()). We can also turn this to a completely iterative solution if we simulate the stack (formal arguments, locals, return value) ourselves rather than relying on the Python stack frame.

Context Free Parser

It is fairly easy to turn this parser into a context-free grammar parser instead. The main idea is to keep a list of parse points, and advance them one at a time.

Driver

Let us define a simple viewer

Driver

The above can only work with binary trees. Here is another that can work with all trees.

Using

This implementation is quite limited in that we have lost the ability to memoize (can be added back), and can not handle left recursion. See the Earley parser for a parser without these drawbacks.

Note: I recently found a very tiny PEG parser described here.

Note: It has been five years since I wrote this post, and I have had some time to better understand the research behind PEG, CFG, LR(k), and LL(k). At this point, I would say that this parser is an implementation of the TDPL (Top Down Parsing Language) from Alexander Birman in 1970⁴. The main difference between TDPL and PEG is that PEG allows a few convenience operators such as *, +, ?, &, ! and grouping which are taken from EBNF and regular expression syntax. However, the resulting language is equivalent in recognition power to TDPL and GTDPL³, and PEGs can be reduced to GTDPL, which in turn can be reduced to TDPL.

However, more importantly, the language expressed with PEG grammars coincide with the language expressed by Context Free Grammars within the LL(1) set. That is, the algorithm expressed below should actually be considered an LL(1) parser when the grammar is LL(1). That is, any rule in a given definition can be unambiguously chosen based only on the next token in the stream. This is important because in the literature, LL(1) parsing requires computation of first sets. This parser shows you how to do LL(1) parsing without parse tables.

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel pegparser is available here.

Valiant, Leslie G. “General context-free recognition in less than cubic time.” Journal of computer and system sciences 10.2 (1975): 308-315. ↩
Lee, Lillian. “Fast context-free grammar parsing requires fast boolean matrix multiplication.” Journal of the ACM (JACM) 49.1 (2002): 1-15. ↩
Ford, Bryan. “Parsing expression grammars: a recognition-based syntactic foundation.” Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 2004. https://pdos.csail.mit.edu/~baford/packrat/popl04/peg-popl04.pdf ↩ ↩²
Birman, Alexander (1970). The TMG Recognition Schema. ACM Digital Library (phd). Princeton University. ↩

Contents

PEG Recognizer

PEG Parser

Context Free Parser

Artifacts