TLDR; This tutorial is a complete implementation of GLL Parser in Python including SPPF parse tree extraction 1. The Python interpreter is embedded so that you can work through the implementation steps.

A GLL parser is a generalization of LL parsers. The first generalized LL parser was reported by Grune and Jacob 2 (11.2) from a masters thesis report in 1993 (another possibly earlier paper looking at generalized LL parsing is from Lang in 1974 3 and another from Bouckaert et al. 4). However, a better known generalization of LL parsing was described by Scott and Johnstone 5. This post follows the later parsing technique. In this post, I provide a complete implementation and a tutorial on how to implement a GLL parser in Python.

We previously discussed Earley parser which is a general context-free parser. GLL parser is another general context-free parser that is capable of parsing strings that conform to any given context-free grammar. The algorithm is a generalization of the traditional recursive descent parsing style. In traditional recursive descent parsing, the programmer uses the call stack for keeping track of the parse context. This approach, however, fails when there is left recursion. The problem is that recursive descent parsers cannot advance the parsed index as it is not immediately clear how many recursions are required to parse a given string. Bounding of recursion as we discussed before is a reasonable solution. However, it is very inefficient.

GLL parsing offers a solution. The basic idea behind GLL parsing is to maintain the call stack programmatically, which allows us to iteratively deepen the parse for any nonterminal at any given point. This combined with sharing of the stack (GSS) and generation of parse forest (SPPF) makes the GLL parsing very efficient. Furthermore, unlike Earley, CYK, and GLR parsers, GLL parser operates by producing a custom parser for a given grammar. This means that one can actually debug the recursive descent parsing program directly. Hence, using GLL can be much more friendly to the practitioner.

Similar to Earley, GLR, CYK, and other general context-free parsers, the worst case for parsing is \(O(n^3)\) . However, for LL(1) grammars, the parse time is \(O(n)\) .

Synopsis

import gllparser as P
my_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = P.compile_grammar(my_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    print(P.format_parsetree(tree))

Contents

  1. Synopsis
  2. Definitions
    1. Prerequisites
  3. Traditional Recursive Descent
  4. Naive Threaded Recognizer
  5. The GSS Graph
    1. The GSS Node
    2. The GSS container
    3. GLL+GSS add_thread (add)
    4. GLL+GSS register_return (create)
    5. GLL+GSS fn_return (pop)
  6. GLL Parser
  7. Utilities.
    1. Symbols in the grammar
    2. First, Follow, Nullable sets
    3. First of a rule fragment.
  8. SPPF Graph
    1. SPPF Node
    2. SPPF Dummy Node
    3. SPPF Symbol Node
    4. SPPF Intermediate Node
    5. SPPF Packed Node
  9. The GLL parser
    1. GLL+GSS+SPPF add_thread (add)
    2. GLL+GSS+SPPF fn_return (pop)
    3. GLL+GSS+SPPF register_return (create)
    4. GLL+GSS+SPPF utilities.
  10. Building the parser with GLL
    1. Compiling an empty rule
    2. Compiling a Terminal Symbol
    3. Compiling a Nonterminal Symbol
    4. Compiling a Rule
    5. Compiling a Definition
    6. Compiling a Grammar
  11. Running it
  12. SPPF Parse Forest
    1. ChoiceNode
    2. EnhancedExtractor
  13. Running it
    1. 2
    2. 3
    3. 4
    4. 5
  14. Expression
    1. 1
    2. 2
  15. A few more examples
  16. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Definitions

For this post, we use the following terms:

  • The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.

  • A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)

    For example, x is a terminal symbol.

  • A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.

    For example, <term> is a nonterminal in the below grammar.

  • A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal. A rule is also called an alternative expansion.

    For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.

  • A definition is a set of rules that describe the expansion of a given nonterminal.

    For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>

  • A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.

    The grammar given below is an example context-free grammar.

  • A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).

  • A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
  • A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.

  • The yield of a tree is the string resulting from collapsing that tree.

  • An epsilon rule matches an empty string.

Prerequisites

As before, we start with the prerequisite imports.

Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
  2. earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".

We need the fuzzer to generate inputs to parse and also to provide some utilities



We use the display_tree() method in earley parser for displaying trees.



We use the random choice to extract derivation trees from the parse forest.



As before, we use the fuzzingbook grammar style. Here is an example grammar for arithmetic expressions, starting at <start>. A terminal symbol has exactly one character (Note that we disallow empty string ('') as a terminal symbol). Secondly, as per traditional implementations, there can only be one expansion rule for the <start> symbol. We work around this restriction by simply constructing as many charts as there are expansion rules, and returning all parse trees.

Traditional Recursive Descent

Consider how you will parse a string that conforms to the following grammar



In traditional recursive descent, we write a parser in the following fashion



Using it



What if there is recursion? Here is another grammar with recursion



In traditional recursive descent, we write a parser in the following fashion



Using it



The problem happens when there is a left recursion. For example, the following grammar contains a left recursion even though it recognizes the same language as before.



Naive Threaded Recognizer

The problem with left recursion is that in traditional recursive descent style, we are forced to follow a depth first exploration, completing the parse of one entire rule before attempting then next rule. We can work around this by managing the call stack ourselves. The idea is to convert each procedure into a case label, save the previous label in the stack (managed by us) before a sub procedure. When the exploration is finished, we pop the previous label off the stack, and continue where we left off.



We also need a way to hold the call stack. The call stack is actually stored as a linked list with the current stack_top on the top. With multiple alternatives being explored together, we actually have a tree, but the leaf nodes only know about their parent (not the reverse). For convenience, we use a wrapper for the call-stack, where we define a few book keeping functions. First the initialization of the call stack.



Adding a thread simply means appending the label, current stack top, and current parse index to the threads. We can also retrieve threads.



Next, we define how returns are handed. That is, before exploring a new sub procedure, we have to save the return label in the stack, which is handled by register_return(). The current stack top is added as a child of the return label.



When we have finished exploring a given procedure, we return back to the original position in the stack by poping off the prvious label.



Using it.



This unfortunately has a problem. The issue is that, when a string does not parse, the recursion along with the epsilon rule means that there is always a thread that keeps spawning new threads.



The GSS Graph

The way to solve it is to use something called a graph-structured stack 6. A naive conversion of recursive descent parsing to generalized recursive descent parsing can be done by maintaining independent stacks for each thread. However, this approach is has problems as we saw previously, when it comes to left recursion. The GSS converts the tree structured stack to a graph.

The GSS Node

A GSS node is simply a node that can contain any number of children. Each child is actually an edge in the graph.

(Each GSS Node is of the form \(L_i^j\) where \(j\) is the index of the character consumed. However, we do not need to know the internals of the label here).



The GSS container

Next, we define the graph container. We keep two structures. self.graph which is the shared stack, and self.P which is the set of labels that went through a fn_return, i.e. pop operation.



A wrapper for book keeping functions.



GLL+GSS add_thread (add)

Our add_thread increases a bit in complexity. We now check if a thread already exists before starting a new thread.



next_thread is same as before



GLL+GSS register_return (create)

A major change in this method. We now look for pre-existing edges before appending edges (child nodes).



GLL+GSS fn_return (pop)

A small change in fn_return. We now save all parsed indexes at every label when the parse is complete.



With GSS, we finally have a true GLL recognizer. Here is the same recognizer unmodified, except for checking the parse ending. Here, we check whether the start symbol is completely parsed only when the threads are complete.



Using it.



GLL Parser

A recognizer is of limited utility. We need the parse tree if we are to use it in practice. Hence, We will now see how to convert this recognizer to a parser.

Utilities.

We start with a few utilities.

Symbols in the grammar

Here, we extract all terminal and nonterminal symbols in the grammar.



Using it



First, Follow, Nullable sets

To optimize GLL parsing, we need the First, Follow, and Nullable sets. (Note we do not use this at present)

Here is a nullable grammar.



The definition is as follows.



Using



First of a rule fragment.

(Note we do not use this at present) We need to compute the expected first character of a rule suffix.



To verify, we define an expression grammar.



SPPF Graph

We use a data-structure called Shared Packed Parse Forest to represent the parse forest. We cannot simply use a parse tree because there may be multiple possible derivations of the same input string (possibly even an infinite number of them). The basic idea here is that multiple derivations (even an infinite number of derivations) can be represented as links in the graph.

The SPPF graph contains four kinds of nodes. The dummy node represents an empty node, and is the simplest. The symbol node represents the parse of a nonterminal symbol within a given extent (i, j). Since there can be multiple derivations for a nonterminal symbol, each derivation is represented by a packed node, which is the third kind of node. Another kind of node is the intermediate node. An intermediate node represents a partially parsed rule, containing a prefix rule and a suffix rule. As in the case of symbol nodes, there can be many derivations for a rule fragment. Hence, an intermediate node can also contain multiple packed nodes. A packed node in turn can contain symbol, intermediate, or dummy nodes.

SPPF Node



SPPF Dummy Node

The dummy SPPF node is used to indicate the empty node at the end of rules.



SPPF Symbol Node

j and i are the extents. Each symbol can contain multiple packed nodes each representing a different derivation. See getNodeP Note. In the presence of ambiguous parsing, we choose a derivation at random. So, run the to_tree() multiple times to get all parse trees. If you want a better solution, see the forest generation in earley parser which can be adapted here too.



SPPF Intermediate Node

Has only two children max (or 1 child).



SPPF Packed Node



The GLL parser

We can now build our GLL parser. All procedures change to include SPPF nodes. We first define our initialization



GLL+GSS+SPPF add_thread (add)



GLL+GSS+SPPF fn_return (pop)



GLL+GSS+SPPF register_return (create)



GLL+GSS+SPPF utilities.



We also need the to produce SPPF nodes correctly.

getNode(x, i) creates and returns an SPPF node labeled (x, i, i+1) or (epsilon, i, i) if x is epsilon

getNodeP(X::= alpha.beta, w, z) takes a grammar slot X ::= alpha . beta and two SPPF nodes w, and z (z may be dummy node $). the nodes w and z are not packed nodes, and will have labels of form (s, j, k) and (r, k, i)



We can now use all these to generate trees.



We need trees



Using it



Building the parser with GLL

At this point, we are ready to build our parser compiler.

Compiling an empty rule

We start with compiling an epsilon rule.



Using it.



Compiling a Terminal Symbol



Compiling a Nonterminal Symbol



Using it.



Compiling a Rule

n_alt is the position of rule.



Using it.



Compiling a Definition

Note that if performance is important, you may want to check if the current input symbol at parser.I[cur_idx] is part of the following, where X is a nonterminal and p is a rule fragment. Note that if you care about the performance, you will want to pre-compute first[p] for each rule fragment rule[j:] in the grammar, and first and follow sets for each symbol in the grammar. This should be checked before parser.add_thread.



Given that removing this check does not affect the correctness of the algorithm, I have chosen not to add it.



Using it.



A template.



Compiling a Grammar



Using it



Running it



SPPF Parse Forest

Previously, we examined how to extract a single parse tree. However, this in insufficient in many cases. Given that context-free grammars can contain ambiguity we want to extract all possible parse trees. To do that, we need to keep track of all choices we make.

ChoiceNode

The ChoiceNode is a node in a linked list of choices. The idea is that whenever there is a choice between exploring different derivations, we pick the first candidate, and make a note of that choice. Then, during further explorations of the child nodes, if more choices are necessary, those choices are marked in nodes linked from the current node.

  • _chosen contains the current choice
  • next holds the next choice done using _chosen
  • total holds he total number of choices for this node.


The chosen() returns the current candidate.



A ChoiceNode has exhausted its choices if the current candidate chosen does not exist.



At the end of generation of a single tree, we increment the candidate number in the last node in the linked list. Then, we check if the last node can provide another candidate to explore. If the node has not exhausted its candidates, then we have nothing more to do. However, if the node has exhausted its candidates, we look for possible candidates in its parent.



EnhancedExtractor

The EnhancedExtractor classes uses the choice linkedlist to explore possible parses.



Whenever there is a choice to be made, we look at the current node in the choices linked list. If a previous iteration has exhausted all candidates, we have nothing left. In that case, we simply return None, and the updated linkedlist



While extracting, we have a choice. Should we allow infinite forests, or should we have a finite number of trees with no direct recursion? A direct recursion is when there exists a parent node with the same nonterminal that parsed the same span. We choose here not to extract such trees. They can be added back after parsing.

This is a recursive procedure that inspects a node, extracts the path required to complete that node. A single path (corresponding to a nonterminal) may again be composed of a sequence of smaller paths. Such paths are again extracted using another call to extract_a_node() recursively.

What happens when we hit on one of the node recursions we want to avoid? In that case, we return the current choice node, which bubbles up to extract_a_tree(). That procedure increments the last choice, which in turn increments up the parents until we reach a choice node that still has options to explore.

What if we hit the end of choices for a particular choice node (i.e, we have exhausted paths that can be taken from a node)? In this case also, we return the current choice node, which bubbles up to extract_a_tree(). That procedure increments the last choice, which bubbles up to the next choice that has some unexplored paths.



The extract_a_tree extracts one parse tree at a time, and keeps track of the choices.



Running it



A larger example



A few more examples

2



3



4



5



Expression

1



2

Since we use a random choice to get different parse trees in ambiguity, click the run button again and again to get different parses.



A few more examples



We assign format parse tree so that we can refer to it from this module



Note: There is a bug in the SPPF EnhancedExtractor as of now (thanks Michael)

gamma_2 = { "<S>": [ ["<S>", "<S>", "<S>"], ["<S>", "<S>"], ["s"],] }
ee = EnhancedExtractor(forest)
ee.extract_a_tree()

will extract the wrong tree for ssss. I have not solved the issue so far. If you find the issue, please drop me a note.

Note: There is now (2024) a reference implementation for GLL from the authors. It is available at https://github.com/AJohnstone2007/referenceImplementation. Unfortunately, I did not have access to this when I was developing this post, which means that there might be bugs (such as above) in my code, and in case of such bugs please refer to this repository for an authoritative implementation.

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel gllparser is available here.

  1. Elizabeth Scott, Adrian Johnstone. “GLL parse-tree generation.” Science of Computer Programming 78.10 (2013): 1828-1844. 

  2. Dick Grune and Ceriel J.H. Jacobs “Parsing Techniques A Practical Guide” 2008 

  3. Bernard Lang. “Deterministic techniques for efficient non-deterministic parsers.” International Colloquium on Automata, Languages, and Programming. Springer, Berlin, Heidelberg, 1974. 

  4. M. Bouckaert, Alain Pirotte, M. Snelling. “Efficient parsing algorithms for general context-free parsers.” Information Sciences 8.1 (1975): 1-26. 

  5. Elizabeth Scott, Adrian Johnstone. “GLL parsing.” Electronic Notes in Theoretical Computer Science 253.7 (2010): 177-189. 

  6. Masaru Tomita. Efficient parsing for natural language: a fast algorithm for practical systems. Kluwer Academic Publishers, Boston, 1986.