TLDR; This tutorial is a complete implementation of GLL Parser in Python
including SPPF parse tree extraction ^{1}.
The Python interpreter is embedded so that you can work through the
implementation steps.
A GLL parser is a generalization of LL parsers. The first generalized LL
parser was reported by Grune and Jacob ^{2} (11.2) from a
masters thesis report in 1993 (another possibly earlier paper looking at
generalized LL parsing is from Lang in 1974 ^{3} and
another from Bouckaert et al. ^{4}).
However, a better known generalization
of LL parsing was described by Scott and Johnstone ^{5}. This
post follows the later parsing technique.
In this post, I provide a complete
implementation and a tutorial on how to implement a GLL parser in Python.
We previously discussed
Earley parser which is a general context-free parser. GLL
parser is another general context-free parser that is capable of parsing
strings that conform to any given context-free grammar.
The algorithm is a generalization of the traditional recursive descent parsing
style. In traditional recursive descent parsing, the programmer uses the
call stack for keeping track of the parse context. This approach, however,
fails when there is left recursion. The problem is that recursive
descent parsers cannot advance the parsed index as it is not immediately
clear how many recursions are required to parse a given string. Bounding
of recursion as we discussed before
is a reasonable solution. However, it is very inefficient.
GLL parsing offers a solution. The basic idea behind GLL parsing is to
maintain the call stack programmatically, which allows us to iteratively
deepen the parse for any nonterminal at any given point. This combined with
sharing of the stack (GSS) and generation of parse forest (SPPF) makes the
GLL parsing very efficient. Furthermore, unlike Earley, CYK, and GLR parsers,
GLL parser operates by producing a custom parser for a given grammar. This
means that one can actually debug the recursive descent parsing program
directly. Hence, using GLL can be much more friendly to the practitioner.
Similar to Earley, GLR, CYK, and other general context-free parsers, the worst
case for parsing is \(O(n^3)\) . However, for LL(1) grammars, the parse time
is \(O(n)\) .
Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
Definitions
For this post, we use the following terms:
The alphabet is the set all of symbols in the input language. For example,
in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. Note that this is slightly different
from usual definitions (done here for ease of parsing). (Usually a terminal is
a contiguous sequence of symbols from the alphabet. However, both kinds of
grammars have a one to one correspondence, and can be converted easily.)
For example, x is a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined
in the grammar using rules for expansion.
For example, <term> is a nonterminal in the below grammar.
A rule is a finite sequence of terms (two types of terms: terminals and
nonterminals) that describe an expansion of a given terminal. A rule is
also called an alternative expansion.
For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.
A definition is a set of rules that describe the expansion of a given nonterminal.
For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>
A context-free grammar is composed of a set of nonterminals and
corresponding definitions that define the structure of the nonterminal.
The grammar given below is an example context-free grammar.
A terminal derives a string if the string contains only the symbols in the
terminal. A nonterminal derives a string if the corresponding definition
derives the string. A definition derives the string if one of the rules in
the definition derives the string. A rule derives a string if the sequence
of terms that make up the rule can derive the string, deriving one substring
after another contiguously (also called parsing).
A derivation tree is an ordered tree that describes how an input string is
derived by the given start symbol. Also called a parse tree.
A derivation tree can be collapsed into its string equivalent. Such a string
can be parsed again by the nonterminal at the root node of the derivation
tree such that at least one of the resulting derivation trees would be the
same as the one we started with.
The yield of a tree is the string resulting from collapsing that tree.
An epsilon rule matches an empty string.
Prerequisites
As before, we start with the prerequisite imports.
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
We need the fuzzer to generate inputs to parse and also to provide some
utilities
We use the display_tree() method in earley parser for displaying trees.
We use the random choice to extract derivation trees from the parse forest.
As before, we use the fuzzingbook grammar style.
Here is an example grammar for arithmetic expressions, starting at <start>.
A terminal symbol has exactly one character
(Note that we disallow empty string ('') as a terminal symbol).
Secondly, as per traditional implementations,
there can only be one expansion rule for the <start> symbol. We work around
this restriction by simply constructing as many charts as there are expansion
rules, and returning all parse trees.
Traditional Recursive Descent
Consider how you will parse a string that conforms to the following grammar
In traditional recursive descent, we write a parser in the following fashion
Using it
What if there is recursion? Here is another grammar with recursion
In traditional recursive descent, we write a parser in the following fashion
Using it
The problem happens when there is a left recursion. For example, the following
grammar contains a left recursion even though it recognizes the same language
as before.
Naive Threaded Recognizer
The problem with left recursion is that in traditional recursive descent
style, we are forced to follow a depth first exploration, completing the
parse of one entire rule before attempting then next rule. We can work around
this by managing the call stack ourselves. The idea is to convert each
procedure into a case label, save the previous label in the stack
(managed by us) before a sub procedure. When the exploration
is finished, we pop the previous label off the stack, and continue where we
left off.
We also need a way to hold the call stack. The call stack is actually stored
as a linked list with the current stack_top on the top. With multiple
alternatives being explored together, we actually have a tree, but
the leaf nodes only know about their parent (not the reverse).
For convenience, we use a wrapper for the call-stack, where we define a few
book keeping functions. First the initialization of the call stack.
Adding a thread simply means appending the label, current stack top, and
current parse index to the threads. We can also retrieve threads.
Next, we define how returns are handed. That is, before exploring a new
sub procedure, we have to save the return label in the stack, which
is handled by register_return(). The current stack top is added as a child
of the return label.
When we have finished exploring a given procedure, we return back to the
original position in the stack by poping off the prvious label.
Using it.
This unfortunately has a problem. The issue is that, when a string does not
parse, the recursion along with the epsilon rule means that there is always a
thread that keeps spawning new threads.
The GSS Graph
The way to solve it is to use something called a graph-structured stack^{6}.
A naive conversion of recursive descent parsing to generalized recursive
descent parsing can be done by maintaining independent stacks for each thread.
However, this approach is has problems as we saw previously, when it comes to
left recursion. The GSS converts the tree structured stack to a graph.
The GSS Node
A GSS node is simply a node that can contain any number of children. Each
child is actually an edge in the graph.
(Each GSS Node is of the form \(L_i^j\) where \(j\) is the index of the
character consumed. However, we do not need to know the internals of the label
here).
The GSS container
Next, we define the graph container. We keep two structures. self.graph
which is the shared stack, and self.P which is the set of labels that went
through a fn_return, i.e. pop operation.
A wrapper for book keeping functions.
GLL+GSS add_thread (add)
Our add_thread increases a bit in complexity. We now check if a thread already
exists before starting a new thread.
next_thread is same as before
GLL+GSS register_return (create)
A major change in this method. We now look for pre-existing
edges before appending edges (child nodes).
GLL+GSS fn_return (pop)
A small change in fn_return. We now save all parsed indexes at
every label when the parse is complete.
With GSS, we finally have a true GLL recognizer.
Here is the same recognizer unmodified, except for checking the parse
ending. Here, we check whether the start symbol is completely parsed
only when the threads are complete.
Using it.
GLL Parser
A recognizer is of limited utility. We need the parse tree if we are to
use it in practice. Hence, We will now see how to convert this recognizer to a
parser.
Utilities.
We start with a few utilities.
Symbols in the grammar
Here, we extract all terminal and nonterminal symbols in the grammar.
(Note we do not use this at present)
We need to compute the expected first character of a rule suffix.
To verify, we define an expression grammar.
SPPF Graph
We use a data-structure called Shared Packed Parse Forest to represent
the parse forest. We cannot simply use a parse tree because there may be
multiple possible derivations of the same input string (possibly even an
infinite number of them). The basic idea here is that multiple derivations
(even an infinite number of derivations) can be represented as links in the
graph.
The SPPF graph contains four kinds of nodes. The dummy node represents an
empty node, and is the simplest. The symbol node represents the parse of a
nonterminal symbol within a given extent (i, j).
Since there can be multiple derivations for a nonterminal
symbol, each derivation is represented by a packed node, which is the third
kind of node. Another kind of node is the intermediate node. An intermediate
node represents a partially parsed rule, containing a prefix rule and a suffix
rule. As in the case of symbol nodes, there can be many derivations for a rule
fragment. Hence, an intermediate node can also contain multiple packed nodes.
A packed node in turn can contain symbol, intermediate, or dummy nodes.
SPPF Node
SPPF Dummy Node
The dummy SPPF node is used to indicate the empty node at the end of rules.
SPPF Symbol Node
j and i are the extents.
Each symbol can contain multiple packed nodes each
representing a different derivation. See getNodeP
Note. In the presence of ambiguous parsing, we choose a derivation
at random. So, run the to_tree() multiple times to get all parse
trees. If you want a better solution, see the
forest generation in earley parser
which can be adapted here too.
SPPF Intermediate Node
Has only two children max (or 1 child).
SPPF Packed Node
The GLL parser
We can now build our GLL parser. All procedures change to include SPPF nodes.
We first define our initialization
GLL+GSS+SPPF add_thread (add)
GLL+GSS+SPPF fn_return (pop)
GLL+GSS+SPPF register_return (create)
GLL+GSS+SPPF utilities.
We also need the to produce SPPF nodes correctly.
getNode(x, i) creates and returns an SPPF node labeled (x, i, i+1) or
(epsilon, i, i) if x is epsilon
getNodeP(X::= alpha.beta, w, z) takes a grammar slot X ::= alpha . beta
and two SPPF nodes w, and z (z may be dummy node $).
the nodes w and z are not packed nodes, and will have labels of form
(s, j, k) and (r, k, i)
We can now use all these to generate trees.
We need trees
Using it
Building the parser with GLL
At this point, we are ready to build our parser compiler.
Compiling an empty rule
We start with compiling an epsilon rule.
Using it.
Compiling a Terminal Symbol
Compiling a Nonterminal Symbol
Using it.
Compiling a Rule
n_alt is the position of rule.
Using it.
Compiling a Definition
Note that if performance is important, you may want to check if the current
input symbol at parser.I[cur_idx] is part of the following, where X is a
nonterminal and p is a rule fragment. Note that if you care about the
performance, you will want to pre-compute first[p] for each rule fragment
rule[j:] in the grammar, and first and follow sets for each symbol in the
grammar. This should be checked before parser.add_thread.
Given that removing this check does not affect the correctness of the
algorithm, I have chosen not to add it.
Using it.
A template.
Compiling a Grammar
Using it
Running it
SPPF Parse Forest
Previously, we examined how to extract a single parse tree. However, this in
insufficient in many cases. Given that context-free grammars can contain
ambiguity we want to extract all possible parse trees. To do that, we need to
keep track of all choices we make.
ChoiceNode
The ChoiceNode is a node in a linked list of choices. The idea is that
whenever there is a choice between exploring different derivations, we pick
the first candidate, and make a note of that choice. Then, during further
explorations of the child nodes, if more choices are necessary, those choices
are marked in nodes linked from the current node.
_chosen contains the current choice
next holds the next choice done using _chosen
total holds he total number of choices for this node.
The chosen() returns the current candidate.
A ChoiceNode has exhausted its choices if the current candidate chosen
does not exist.
At the end of generation of a single tree, we increment the candidate number
in the last node in the linked list. Then, we check if the last node can
provide another candidate to explore. If the node has not exhausted its
candidates, then we have nothing more to do. However, if the node has
exhausted its candidates, we look for possible candidates in its parent.
EnhancedExtractor
The EnhancedExtractor classes uses the choice linkedlist to explore possible
parses.
Whenever there is a choice to be made, we look at the current node in the
choices linked list. If a previous iteration has exhausted all candidates,
we have nothing left. In that case, we simply return None, and the updated
linkedlist
While extracting, we have a choice. Should we allow infinite forests, or
should we have a finite number of trees with no direct recursion? A direct
recursion is when there exists a parent node with the same nonterminal that
parsed the same span. We choose here not to extract such trees. They can be
added back after parsing.
This is a recursive procedure that inspects a node, extracts the path required
to complete that node. A single path (corresponding to a nonterminal) may
again be composed of a sequence of smaller paths. Such paths are again
extracted using another call to extract_a_node() recursively.
What happens when we hit on one of the node recursions we want to avoid? In
that case, we return the current choice node, which bubbles up to
extract_a_tree(). That procedure increments the last choice, which in turn
increments up the parents until we reach a choice node that still has options
to explore.
What if we hit the end of choices for a particular choice node (i.e, we have
exhausted paths that can be taken from a node)? In this case also, we return
the current choice node, which bubbles up to extract_a_tree().
That procedure increments the last choice, which bubbles up to the next choice
that has some unexplored paths.
The extract_a_tree extracts one parse tree at a time, and keeps track of the choices.
Running it
A larger example
A few more examples
2
3
4
5
Expression
1
2
Since we use a random choice to get different parse trees in ambiguity, click
the run button again and again to get different parses.
A few more examples
We assign format parse tree so that we can refer to it from this module
Note: There is a bug in the SPPF EnhancedExtractor as of now (thanks Michael)
will extract the wrong tree for ssss. I have not solved the issue so far. If you find the issue, please drop me a note.
Note: There is now (2024) a reference implementation for GLL from the authors. It is available at https://github.com/AJohnstone2007/referenceImplementation. Unfortunately, I did not have access to this when I was developing this post, which means that there might be bugs (such as above) in my code, and in case of such bugs please refer to this repository for an authoritative implementation.
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel gllparser is available here.
Elizabeth Scott, Adrian Johnstone. “GLL parse-tree generation.” Science of Computer Programming 78.10 (2013): 1828-1844. ↩
Dick Grune and Ceriel J.H. Jacobs “Parsing Techniques A Practical Guide” 2008 ↩
Bernard Lang. “Deterministic techniques for efficient non-deterministic parsers.” International Colloquium on Automata, Languages, and Programming. Springer, Berlin, Heidelberg, 1974. ↩
M. Bouckaert, Alain Pirotte, M. Snelling. “Efficient parsing algorithms for general context-free parsers.” Information Sciences 8.1 (1975): 1-26. ↩
Elizabeth Scott, Adrian Johnstone. “GLL parsing.” Electronic Notes in Theoretical Computer Science 253.7 (2010): 177-189. ↩
Masaru Tomita. Efficient parsing for natural language: a fast algorithm for practical systems. Kluwer Academic Publishers, Boston, 1986. ↩