# Recursive descent parsing with Parsing Expression Grammars (PEG)

In the previous post, I showed how to write a simple recursive descent parser by hand – that is using a set of mutually recursive procedures. Actually, I lied when I said context-free. The common hand-written parsers are usually an encoding of a kind of grammar called *Parsing Expression Grammar* or *PEG* for short.

The difference between *PEG* and a *CFG* is that *PEG* does not admit ambiguity. In particular, *PEG* uses ordered choice in its alternatives. Due to the ordered choice, the ordering of alternatives is important.

A few interesting things about *PEG*:

- We know that
*L(PEG)*is not a subset of*L(CFG)*(There are languages that can be expressed with a*PEG*that can not be expressed with a*CFG*– for example, ). - We do not know if
*L(PEG)*is a superset of*CFL*. However, given that all PEGs can be parsed in , and the best general*CFG*parsers can only reach due to the equivalance to boolean matrix multiplication (Valiant 1975)(Lee 2002). - We do know that
*L(PEG)*is at least as large as deterministic*CFL*. - We also know that an
*LL(1)*grammar can be interpreted either as a*CFG*or a*PEG*and it will describe the same language. Further, any*LL(k)*grammar can be translated to*L(PEG)*, but reverse is not always true – it will work only if the PEG lookahead pattern can be reduced to a DFA.

The problem with what we did in the previous post is that it is a rather naive implementation. In particular, there could be a lot of backtracking, which can make the runtime explore. One solution to that is incorporating memoization. Since we start with automatic generation of parser from a grammar (unlike previously, where we explored a handwritten parser first), we will take a slightly different tack in writing the algorithm.

The idea behind a simple *PEG* parser is that, you try to unify the string you want to match with the corresponding key in the grammar. If the key is not present in the grammar, it is a literal, which needs to be matched with string equality.
If the key is present in the grammar, get the corresponding productions (rules) for that key, and start unifying each rule one by one on the string to be matched.

```
def unify_key(key, text, at):
if key not in grammar:
return (True, at + len(key)) if text[at:].startswith(key) else (False, at)
rules = grammar[key]
for rule in rules:
res, l = unify_rule(rule, text, at)
if res: return (res, l)
return (False, 0)
```

For unifying rules, the idea is similar. We take each token in the rule, and try to unify that token with the string to be matched. We rely on `unify_key`

for doing the unification of the token. if the unification fails, we return empty handed.

```
def unify_rule(rule, text, at):
for token in rule:
result, at = unify_key(token, text, at)
if not result: return (False, at)
return (True, at)
```

When we implemented the `unify_key`

, we made an important decision, which was that, we return as soon as a match was found. This is what distinguishes a `PEG`

parser from a general `CFG`

parser. In particular it means that rules have to be ordered.
That is, the following grammar wont work:

```
ifmatch = IF (expr) THEN stmts
| IF (expr) THEN stmts ELSE stmts
```

It has to be written as the following so that the rule that can match the longest string will come first.

```
ifmatch = IF (expr) THEN stmts ELSE stmts
| IF (expr) THEN stmts
```

If we now parse an `if`

statement without else using the above grammar, such a `IF (a=b) THEN return 0`

, the first rule will fail, and the parse will start for the second rule again at `IF`

. This backtracking is unnecessary as one can see that `IF (a=b) THEN return 0`

is already parserd by all terms in the second rule. What we want is to save the old parses so that we can simply return the already parsed result. That is,

```
if seen((token, text, at)):
return old_result
```

Fortunately, Python makes this easy using `functools.lru_cache`

which provides cheap memoization to functions. Adding memoizaion, and reorganizing code, we have our *PEG* parser.

```
import sys
import functools
term_grammar = {
'expr': [
['term', 'add_op', 'expr'],
['term']],
'term': [
['fact', 'mul_op', 'term'],
['fact']],
'fact': [
['digits'],
['(','expr',')']],
'digits': [
['digit','digits'],
['digit']],
'digit': [[str(i)] for i in list(range(10))],
'add_op': [['+'], ['-']],
'mul_op': [['*'], ['/']]
}
class peg_parse:
def __init__(self, grammar):
self.grammar = {k:[tuple(l) for l in rules] for k,rules in grammar.items()}
def literal_match(self, part, text, tfrom):
return (tfrom + len(part), (part, [])) if text[tfrom:].startswith(part) else (tfrom, None)
@functools.lru_cache(maxsize=None)
def unify_key(self, key, text, tfrom=0):
if key not in self.grammar: return self.literal_match(key, text, tfrom)
rules = self.grammar[key]
rets = (self.unify_line(rule, text, tfrom) for rule in rules)
tfrom, res = next((ret for ret in rets if ret[1] is not None), (tfrom, None))
return (tfrom, (key, res) if res is not None else None)
@functools.lru_cache(maxsize=None)
def unify_line(self, parts, text, tfrom):
results = []
for part in parts:
tfrom, res = self.unify_key(part, text, tfrom)
if res is None: return tfrom, None
results.append(res)
return tfrom, results
def main(to_parse):
result = peg_parse(term_grammar).unify_key('expr', to_parse)
assert (len(to_parse) - result[0]) == 0
print(result[1])
if __name__ == '__main__': main(sys.argv[1])
```

What we have here is only a subset of *PEG* grammar. A *PEG* grammar can contain

- Sequence: e1 e2
- Ordered choice: e1 / e2
- Zero-or-more: e*
- One-or-more: e+
- Optional: e?
- And-predicate: &e – match
`e`

but do not consume any input - Not-predicate: !e

We are yet to provide *e**, *e+*, and *e?*. However, these are only conveniences. One can easily modify any *PEG* that uses them to use grammar rules instead. The effect of predicates on the other hand can not be easily produced. However, the lack of predicates does not change (Ford 2004) the class of languages that such grammars can match, and even without the predicates, our *PEG* can be useful for easily representing a large category of programs.