Simple Parser For Context Free Languages with Forking Stacks
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
We previously saw how to parse simple context-free languages using combinatory parsers. In another post, I had also detailed how one can do a simple PEG and even a context-free parser. However, these parsers are unsatisfying in some sense. They do not handle left recursion correctly.
The essential idea in those is to make use of the Python stack for recursive descent style parsing. Using the language stack however comes with the restriction that one is restricted to a depth first strategy. One cannot suspend an exploration when one feels like it, and resume it later. How much does it take to implement left recursion? My post on CFG parses hints at a way. What we need to do is to handle the stack on our own, and do a breadth first search search of the possible parses.
That is, we consider multiple parses at once (as multiple threads), and prioritize the most optimistic parsing scenarios so that they get examined first.
We use the following heuristic: The threads that are farthest along in parsing (in terms of number of characters parsed) gets the highest priority. Next, the threads where the first item is the shallowest gets the highest priority.
Here is a simple implementation. It takes the input chars (
lst), the starting
grammar. Then, it extracts the most promising parse thread, explores its first element,
which may produce a new set of threads that represent alternative parse directions.
Our grammar is
The match method assumes that our queue contains the threads of parsing.
It extracts the top most (
current) thread, and explores the possible
threads that can result from it. If any of the threads result in a parse,
we yield it. If not, we check if we can continue parsing with a given thread,
and if we can, add the thread to the heap.
explore() method is fairly simple. It checks if the given element is a terminal or
a nonterminal. If it is a terminal, it is checked for a match, and if matched, the current
parsing point is updated, and returned. If not a match, the current thread is discarded.
If the gievn element is a nonterminal, then the parsing can proceed in any of the possible expansions of the nonterminal. So, the parsing thread is split into as many new threads, and the nonterminal is replaced with its particular expansion in each of the thread, and the new threads are returned.
The driver would be
While this can certainly handle left recursion, there is a new problem. The issue is that in the case of left recursion, and an incomplete prefix, the threads simply multiply, with out any means of actually parsing the input.
That is, given the usual grammar:
and the input
<E> will keep getting expanded again and again generating
new threads. So the parser will never terminate.
So, what we need is a way to discard invalid potential parses immediately. In our
case, we can see that if you have reached the end of
1+ where there are no more characters
to parse, we no longer can accept an expansion that has even a single terminal symbol that
is non empty. We can make this restriction into code as below:
we initialize the cost. This is a global variable for the purpose of this post.
That is, we find the minimum expansion length of each key and store it beforehand.
Next, we update our
explore so that if the minimum expansion length in any
of the potential threads is larger than the characters remaining, that thread is not
With this, we are now ready to parse any context-free language. Using the driver above:
As you can see, we can successfully use left recursive grammars. Note that this is still a recognizer. Turning it into a parser is not very difficult, and may be handled in a future post.
The complete code is available here.
I implemented it to scratch an itch, without first checking the literature about similar parsing techniques. However, now that I have implemented it, this technique seems similar to GLL). While my implmentation is
inefficient, a few avenues of optimization such as the standard memoization (packrat) techniques, and GSS (fairly easy to implement in that it is about how to maintain the
rule structure as a tree with common prefix) can help the situation.
The runnable Python source for this notebook is available here.