In the previous post, I showed how to write a simple recursive descent parser by hand – that is using a set of mutually recursive procedures. Actually, I lied when I said context-free. The common hand-written parsers are usually an encoding of a kind of grammar called Parsing Expression Grammar or PEG for short.

The difference between PEG and a CFG is that PEG does not admit ambiguity. In particular, PEG uses ordered choice in its alternatives. Due to the ordered choice, the ordering of alternatives is important.

A few interesting things about PEG:

  • We know that L(PEG) is not a subset of L(CFG) (There are languages that can be expressed with a PEG that can not be expressed with a CFG – for example, ).
  • We do not know if L(PEG) is a superset of CFL. However, given that all PEGs can be parsed in , and the best general CFG parsers can only reach due to the equivalance to boolean matrix multiplication (Valiant 1975)(Lee 2002).
  • We do know that L(PEG) is at least as large as deterministic CFL.
  • We also know that an LL(1) grammar can be interpreted either as a CFG or a PEG and it will describe the same language. Further, any LL(k) grammar can be translated to L(PEG), but reverse is not always true – it will work only if the PEG lookahead pattern can be reduced to a DFA.

The problem with what we did in the previous post is that it is a rather naive implementation. In particular, there could be a lot of backtracking, which can make the runtime explore. One solution to that is incorporating memoization. Since we start with automatic generation of parser from a grammar (unlike previously, where we explored a handwritten parser first), we will take a slightly different tack in writing the algorithm.

The idea behind a simple PEG parser is that, you try to unify the string you want to match with the corresponding key in the grammar. If the key is not present in the grammar, it is a literal, which needs to be matched with string equality. If the key is present in the grammar, get the corresponding productions (rules) for that key, and start unifying each rule one by one on the string to be matched.

def unify_key(key, text, at):
   if key not in grammar:
       return (True, at + len(key)) if text[at:].startswith(key) else (False, at)
   rules = grammar[key]
   for rule in rules:
       res, l = unify_rule(rule, text, at)
       if res: return (res, l)
   return (False, 0)

For unifying rules, the idea is similar. We take each token in the rule, and try to unify that token with the string to be matched. We rely on unify_key for doing the unification of the token. if the unification fails, we return empty handed.

def unify_rule(rule, text, at):
    for token in rule:
          result, at = unify_key(token, text, at)
          if not result: return (False, at)
    return (True, at)

When we implemented the unify_key, we made an important decision, which was that, we return as soon as a match was found. This is what distinguishes a PEG parser from a general CFG parser. In particular it means that rules have to be ordered. That is, the following grammar wont work:

ifmatch = IF (expr) THEN stmts
        | IF (expr) THEN stmts ELSE stmts

It has to be written as the following so that the rule that can match the longest string will come first.

ifmatch = IF (expr) THEN stmts ELSE stmts
        | IF (expr) THEN stmts

If we now parse an if statement without else using the above grammar, such a IF (a=b) THEN return 0, the first rule will fail, and the parse will start for the second rule again at IF. This backtracking is unnecessary as one can see that IF (a=b) THEN return 0 is already parserd by all terms in the second rule. What we want is to save the old parses so that we can simply return the already parsed result. That is,

   if seen((token, text, at)):
       return old_result

Fortunately, Python makes this easy using functools.lru_cache which provides cheap memoization to functions. Adding memoizaion, and reorganizing code, we have our PEG parser.

import sys
import functools

term_grammar = {
    'expr': [
        ['term', 'add_op', 'expr'],
    'term': [
        ['fact', 'mul_op', 'term'],
    'fact': [
    'digits': [
    'digit': [[str(i)] for i in list(range(10))],
    'add_op': [['+'], ['-']],
    'mul_op': [['*'], ['/']]

class peg_parse:
    def __init__(self, grammar):
        self.grammar = {k:[tuple(l) for l in rules] for k,rules in grammar.items()}

    def literal_match(self, part, text, tfrom):
        return (tfrom + len(part), (part, [])) if text[tfrom:].startswith(part) else (tfrom, None)

    def unify_key(self, key, text, at=0):
        if key not in self.grammar:
            return (at + len(key), (key, [])) if text[at:].startswith(key) else (at, None)
        rules = self.grammar[key]
        for rule in rules:
            l, res = self.unify_rule(rule, text, at)
            if res is not None: return l, (key, res)
        return (0, None)

    def unify_rule(self, parts, text, tfrom):
        results = []
        for part in parts:
            tfrom, res = self.unify_key(part, text, tfrom)
            if res is None: return tfrom, None
        return tfrom, results

def main(to_parse):
    result = peg_parse(term_grammar).unify_key('expr', to_parse)
    assert (len(to_parse) - result[0]) == 0

if __name__ == '__main__': main(sys.argv[1])

What we have here is only a subset of PEG grammar. A PEG grammar can contain

  • Sequence: e1 e2
  • Ordered choice: e1 / e2
  • Zero-or-more: e*
  • One-or-more: e+
  • Optional: e?
  • And-predicate: &e – match e but do not consume any input
  • Not-predicate: !e

We are yet to provide e*, e+, and e?. However, these are only conveniences. One can easily modify any PEG that uses them to use grammar rules instead. The effect of predicates on the other hand can not be easily produced. However, the lack of predicates does not change (Ford 2004) the class of languages that such grammars can match, and even without the predicates, our PEG can be useful for easily representing a large category of programs.