Previously, we had discussed how a simple PEG parser, and a CFG parser can be constructed. At that time, I had mentioned that left-recursion was still to be implemented. Here is one way to implement left recursion correctly for the CFG parser.
For ease of reference, here was our original parser.
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
This will of course fail when given a grammar such as below:
The problem here is that
<E> is left recursive. So, a naive implementation such as above does not know when to stop recursing when
it tries to unify
<E>. The solution here is to look for any means to identify that the recursion has gone on longer than necessary.
Here is one such technique. The idea is to look at the minimum number of characters necessary to complete parsing if we use a
particular rule. If the rule requires more characters than what is present in the input string, then we know not to use that rule.
First, we compute the minimum length of each nonterminal statically. The minimum length of a nonterminal is defined as the length of minimum number of terminal symbols needed to satisfy that nonterminal from the grammar. The length of a terminal is obviously the length of that terminal symbol. From this definition, the minimum length of a nonterminal is the minimum of the minimum lengths of
any of the rules corresponding to it. The minimum length of a rule is the sum of the minimum lengths of each of its token symbols. If the same symbol is encountered again while computing, we return
(Another way to think about the minimum length is as the length of the minimal string produced when the grammar is used as a producer starting from the given nonterminal. The reason for
infinity for recursion becomes clear — the producer cannot terminate.).
The length of multiple keys can be computed as follows
Now, all it remains is to intelligently stop parsing whenever the minimum length of the remaining parts to parse becomes larger than the length of the remaining text.
It correctly accepts valid strings
and correctly rejects invalid ones
Note that our implementation relies on there being a minimal length. What if there are empty string derivations? Unfortunately, our parser can fail in these scenarios:
The issue is empty strings causing the minimal length to be zero. So, we are unable to make progress. One option is to completely remove empty strings from the grammar. While that is a better option than refactoring out left recursion, it is a bit unsatisfying. Is there a better way?
Another option is to look for a different stopping condition. The idea is that in left recursion, the left recursions always have to make progress (the non-progress-making left recursions can be generated from the single progress making left recursion, so we can discard the non-progress-making left recursions). That means that one would never have more number of recursions of the same key than there are remaining letters. Here is an attempt to implement this stopping condition.
This seems to work. However, one question remains unanswered. One could in principle use the second stopping condition on its own, without the first one. So, why use the first stopping condition at all? The reason is that, in my experiments at least, the second condition is much more expensive than the first So, we only use the second when we absolutely need to.
A few more stopping conditions can be applied. For example, if one has a hashmap at each character position for the nonterminals applied (starting) at that position, then
- No single nonterminal may be applied more times than the remaining length of the string
- If a single nonterminal has been applied multiple times consecutively, then the number of such applications can be removed from the remaining string length.
- Similarly, if a set of nonterminals are applied again and again, with no other symbols outside the set in the repetitions, then the number of such groups can be removed from the remaining string length.
Can we apply the same technique on a PEG parser? Here is one implementation
Note that restricting the recursion using input length has been well known from the sixities1. The latest research by Frost et. al 2 suggests a limit of
m * (1 + |s|) where
m is the number of nonterminals in the grammar and
|s| is the length of input.