TLDR; This tutorial is a complete implementation of Earley Parser in Python with Leo’s optimizations. The Python interpreter is embedded so that you can work through the implementation steps.

The Earley parsing algorithm was invented by Jay Earley 1 in 1970. It can be used to parse strings that conform to a context-free grammar. The algorithm uses a chart for parsing – that is, it is implemented as a dynamic program relying on solving simpler sub-problems.

Earley parsers are very appealing for a practitioner because they can use any context-free grammar for parsing a string, and from the parse forest generated, one can recover all (even an infinite number) of parse trees that correspond to the given grammar. Unfortunately, this style of parsing pays for generality by being slightly expensive. It takes \(O(n^3)\) time to parse in the worst case. However, if the grammar is unambiguous, it can parse in \(O(n^2)\) time, and all LR(k) grammars in linear time2 – \(O(n)\).

This implementation of Earley parser correctly handles the epsilon case as given by Aycock et al.3 Further, the LeoParser class incorporates Leo’s optimizations2.

Another detailed explanation of Earley parsing is by Loup Vaillant. Further, a fast industrial strength Earley parser implementation is Marpa.

This post is written as runnable Python source. You can download the notebook directly here, It the file is downloaded as earleyparser.py, it can be imported into your projects using import earleyparser.

Synopsis

import earleyparser as P
my_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = P.EarleyParser(my_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    print(P.format_parsetree(tree))

Contents

  1. Synopsis
  2. Definitions
  3. Summary
  4. The Column Data Structure
  5. The State Data Structure
  6. The Basic Parser Interface
    1. Nonterminals Deriving Empty Strings
  7. The Chart Parser
    1. The Chart Construction
      1. Predict
      2. Scan
      3. Complete
    2. Filling The Chart
  8. Derivation trees
    1. parse_prefix
    2. parse_on
    3. parse_paths
    4. parse_forest
    5. extract_trees
  9. Ambiguous Parsing
    1. Example
  10. Almost Infinite Parse Trees
  11. Leo Optimizations
    1. Parse Examples
  12. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Definitions

For this post, we use the following terms:

  • The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
  • A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)

    For example, x is a terminal symbol.

  • A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.

    For example, <term> is a nonterminal in the below grammar.

  • A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal.

    For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.

  • A definition is a set of rules that describe the expansion of a given nonterminal.

    For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>

  • A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.

    The grammar given below is an example context-free grammar.

  • A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).

  • A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
  • A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.

As before, we use the fuzzingbook grammar style. Here is an example grammar for arithmetic expressions, starting at <start>. A terminal symbol has exactly one character (Note that we disallow empty string ('') as a terminal symbol). Secondly, as per traditional implementations, there can only be one expansion rule for the <start> symbol. We work around this restriction by simply constructing as many charts as there are expansion rules, and returning all parse trees.



Here is another grammar that targets the same language. Unlike the first grammar, this grammar produces ambiguous parse results.



Summary

An Earley parser executes the following steps for parsing:

Use <start> as the entry into parsing. At this point, we want to parse the given string by the nonterminal <start>. The definition of <start> contains the possible expansion rule that can match the given string. Each expansion rule can be thought of as a parsing path, with contiguous substrings of the given input string matched by the particular terms in the rule.

  • When given a nonterminal to match the string, the essential idea is to get the rules in the definition, and add them to the current set of parsing paths to try with the given string. Within the parsing path, we have a parsed index which denotes the progress of parsing that particular path (i.e the point till which the string until now has been recognized by that path, and any parents of this path). When a rule is newly added, this parsed index is set to zero.

  • We next look at our set of possible parsing paths, and check if any of these paths start with a nonterminal. If one is found, then for that parsing path to be completed with the given string, that nonterminal has to be recognized first. So, we add the expansion rules corresponding to that nonterminal to the list of possible parsing paths. We do this recursively.

  • Now, examine the current letter in the input. Then select all parsing paths that have that particular letter at the parsed index. These expressions can now advance one step to the next index. We add such parsing paths to the set of parsing paths to try for the next character.

  • While doing this, any parsing paths have finished parsing, fetch its corresponding nonterminal and advance all parsing paths that have that nonterminal at the parsing index.

  • Continue recursively until the parsing path corresponding to <start> has finished.

The chart parser depends on a chart (a table) for parsing. The columns correspond to the characters in the input string. Each column represents a set of states, and corresponds to the legal rules to follow from that point on.

Say we start with the following grammar:



Earley parser produces a table of possible parse paths at each letter index of the table. Given an input adcd, we seed the column 0 with:

   <start>: | <A> <B>

where the | represents the parsing index (also called the dot). This indicates that we are at the starting, and the next step is to identify <A>. After this rule is processed, the column would contain two more states

   <A>: | a <B> <c>
   <A>: | a <A>

which represents two parsing paths to complete <A>.

After processing of column 0 (which corresponds to input character a), we would find the following in column 1 (which corresponds to the input character b)

   <A>: a | <B> c
   <A>: a | <A>
   <B>: | b <C>
   <B>: | <D>
   <A>: | a <B> c
   <A>: | a <A>
   <D>: | d

Similarly, the next column (column 2 corresponding to d) would contain the following.

   <D>: | d
   <B>: <D> |
   <A>: a <B> | c

Next, column 3 corresponding to c would contain:

   <A>: a <B> c |
   <start>: <A> | <B>
   <B>: | <b> <C>
   <B>: | <D>
   <D>: | d

Finally, column 4 (d) would contain this at the end of processing.

   <D>: d |
   <B>: <D> |
   <start>: <A> <B> |

This is how the table or the chart – from where the parsing gets its name: chart parsing – gets filled.

The Column Data Structure

The column contains a set of states. Each column corresponds to a character (or a token if tokens are used). Note that the states in a column corresponds to the parsing expression that will occur once that character has been read. That is, the first column will correspond to the parsing expression when no characters have been read.

The column allows for adding states, and checks to prevent duplication of states. Why do we need to prevent duplication? The problem is left recursion. We need to detect and curtail left recursion, which is indicated by non-unique states.



The State Data Structure

A state represents a parsing path (which corresponds to the nonterminal, and the expansion rule that is being followed) with the current parsed index. Each state contains the following:

  • name: The nonterminal that this rule represents.
  • expr: The rule that is being followed
  • dot: The point till which parsing has happened in the rule.
  • s_col: The starting point for this rule.
  • e_col: The ending point for this rule.


The convenience methods finished(), advance() and at_dot() should be self explanatory. For example,



That is, the next symbol to be parsed is <D>, and if we advance it,



The Basic Parser Interface

We start with a bare minimum interface for a parser. It should allow one to parse a given text using a given nonterminal (which should be present in the grammar).



We now initialize the Earley parser, which is a parser.



Nonterminals Deriving Empty Strings

Earley parser handles nullable nonterminals separately. A nullable nonterminal is a nonterminal that can derive an empty string. That is at least one of the expansion rules must derive an empty string. An expansion rule derives an empty string if all of the tokens can derive the empty string. This means no terminal symbols (assuming we do not have zero width terminal symbols), and all nonterminal symbols can derive empty string.

In this implementation, we first initialize the list of first level nullable nonterminals that contain an empty expansion. That is, they directly derive the empty string. Next, we remove any expansion rule that contains a token as these expansion rules will not result in empty strings. Next, we start with our current list of nullable nonterminals, take one at a time, and remove them from the current expansion rules. If any expansion rule becomes empty, the corresponding nonterminal is added to the nullable nonterminal list. This continues until all nullable nonterminals are processed.



An example



Checking



The Chart Parser

Earley parser is a chart parser. That is, it relies on a table of solutions to smaller problems. This table is called a chart (hence the name of such parsers – chart parsers).

The Chart Construction

Here, we begin the chart construction by seeding the chart with columns representing the tokens or characters. Consider our example grammar again. The starting point is,

   <start>: | <A> <B>

We add this state to the chart[0] to start the parse. Note that the term after dot is <A>, which will need to be recursively inserted to the column. We will see how to do that later.

Note: In traditional Earley parsing, the starting nonterminal always have a single expansion rule. However, in many cases, you want to parse a fragment and this rule makes it cumbersome to use Earley parsing. Hence, we have opted to allow any nonterminal to be used as the starting nonterminal irrespective of whether it has a single rule or not. Interestingly, this does not have an impact on the parsing itself, but in the extraction of results. In essence, we seed all expansion rules into of the current start symbol to the chart at column 0. We will take care of that difference while building parse trees.



We seed our initial state in the example



Then, we complete the chart. The idea here is to process one character or one element at a time. At each character, we examine the current parse paths (states) and continue forward any parse path that successfully parses the letter. We process any state that is present in the current column in the following fashion.

There are three main methods we use: predict(), scan(), and complete()

Predict

If in the current state, the term after the dot is a nonterminal, predict() is called. It adds the expansion of the nonterminal to the current column.

If the term is nullable, then we simply advance the current state, and add that to the current column. This fix to the original Earley parsing was suggested by Aycock et al.3.



If we look our example, we have seeded the first column with | <A> <B>. Now, fill_chart() will find that the next term is <A> and call predict() which will then add the expansions of <A>.



Next, we apply predict.



As you can see, the two rules of <A> has been added to the current column.

Scan

The scan() method is called if the next symbol in the current state is a terminal symbol. If the state matches the next term, moves the dot one position, and adds the new state to the column.

For example, consider this state.

   <B>: | b c

If we scan the next column’s letter, and that letter is b, then it matches the next symbol. So, we can advance the state by one symbol, and add it to the next column.

   <B>: b | c


Here is our continuing example.



As you can see, the state[1] in chart[0] that was waiting for a has advanced one letter after consuming a, and has been added to chart[1].

Complete

The complete() method is called if a particular state has finished the rule during execution. It first extracts the start column of the finished state, then for all states in the start column that is not finished, find the states that were parsing this current state (that is, we can go back to continue to parse those rules now). Next, shift them by one position, and add them to the current column.

For example, say the state we have is:

   <A>: a | <B> c
   <B>: b c |

The state <B> b c | is complete, and we need to advance any state that has <B> at the dot to one index forward, which is <A>: a <B> | c

How do we determine the parent states? During predict, we added the predicted child states to the same column as that of the inspected state. So, the states will be found in the starting column of the current state, with the same symbol at_dot as that of the name of the completed state.

We advance all such parents (producing new states) and add the new states to the current column.



Here is our example. We start parsing ad. So, we have three columns.



Next, we populate column 1 which corresponds to letter a.



You can see that the two states are waiting on <A> and <B> respectively at at_dot(). Hence, we run predict again to add the corresponding rules of <A> and <B> to the current column.



As you can see, we have a list of states that are waiting for b, a and d. Our next letter is:



We scan to populate column 2.



As we expected, only <D> could advance to the next column (chart[2]) after reading d Finally, we use complete, so that we can advance the parents of the <D> state above.



As you can see, that led to <B> being complete, and since <B> is complete, <A> also becomes complete.

Filling The Chart

In the below algorithm, whenever the at_dot() is at a nonterminal symbol, the expansion rules of that nonterminal are added to the current rule (predict()) since each rule represents one valid parsing path. If on the other hand, at_dot() indicates processing finished for that nonterminal, we lookup the parent symbols and advance their parsing state (complete()). If we find that we are at a terminal symbol, we simply check if the current state can advance to parsing the next character (scan()).



We can now recognize the given string as part of the language represented by the grammar.



The chart above only shows completed entries. The parenthesized expression indicates the column just before the first character was recognized, and the ending column. Notice how the <start> nonterminal shows the dot at the end. That is, fully parsed.



Derivation trees

We use the following procedures to translate the parse forest to individual trees.

parse_prefix



Here is an example of using it.



parse_on

Our parse_on() method is slightly different from usual Earley implementations in that we accept any nonterminal symbol, not just nonterminal symbols with a single expansion rule. We accomplish this by computing a different chart for each expansion.



parse_paths

The parse_paths() method tries to unify the given expression in named_expr with the parsed string. For that, it extracts the last symbol in named_expr and checks if it is a terminal symbol. If it is, then it checks the chart at til to see if the letter corresponding to the position matches the terminal symbol. If it does, extend our start index by the length of the symbol.

If the symbol was a nonterminal symbol, then we retrieve the parsed states at the current end column index (til) that correspond to the nonterminal symbol, and collect the start index. These are the end column indexes for the remaining expression.

Given our list of start indexes, we obtain the parse paths from the remaining expression. If we can obtain any, then we return the parse paths. If not, we return an empty list.



Example



That is, the parse path for <start> given the input adcd included recognizing the expression <A><B>. This was recognized by the two states: <A> from input(0) to input(2) which further involved recognizing the rule a<B>c, and the next state <B> from input(3) which involved recognizing the rule <D>.

parse_forest

The parse_forest() method takes the states which represents completed parses, and determines the possible ways that its expressions corresponded to the parsed expression. As we noted, it is here that we take care of multiple expansion rules for start symbol. (The _parse_forest() accepts a single state, and is the main driver that corresponds to traditional implementation,) For example, say we are parsing 1+2+3, and the state has [<expr>,+,<expr>] in expr. It could have been parsed as either [{<expr>:1+2},+,{<expr>:3}] or [{<expr>:1},+,{<expr>:2+3}].



Example



extract_trees

We show how to extract a single tree first, and then generalize it to all trees.



We need a way to display parse trees.



Displaying the tree



Example



Ambiguous Parsing

Ambiguous grammars can produce multiple derivation trees for some given string. In the above example, the a_grammar can parse 1+2+4 in as either [1+2]+4 or 1+[2+4].

That is, we need to extract all derivation trees. We enhance our extract_trees() as below.



Example

Using the same example,



Almost Infinite Parse Trees

There is a problem with our extract_trees() method. The issue is that it is too eager. The parse forest can have an infinite number of trees, and at this time we effectively try to extract all at the same time. So, in case of such grammars our extract_trees() will fail. Here are two example grammars.



An example run.



The problem is that, our implementation of extract_trees() is eager. That is, it attempts to extract all inner parse trees before it can construct the outer parse tree. When there is a self reference, this results in recursion. Here is a simple extractor that avoids this problem. The idea here is that we randomly and lazily choose a node to expand, which avoids the infinite recursion.



At this point, we also need a simple way to collapse the derivation tree to the original string







indirect reference





However, SimpleExtractor has a problem. The issue is that since we rely on randomness for exploration, it gives no guarantees on the uniqueness of the returned trees. Hence, we need a way to keep track of the explored paths. our next class EnahncedExtractor can do that. In EnhancedExtractor, different exploration paths form a tree of nodes.

First we define a data-structure to keep track of explorations.

  • _chosen contains the current choice
  • next holds the next choice done using _chosen
  • total holds he total number of choices for this node.


Initialization of the data-structure in the constructor.



Given an array and a choice node, choose_path() returns the element in array corresponding to the next choice node if it exists, or produces a new choice nodes, and returns that element.



While extracting, we have a choice. Should we allow infinite forests, or should we have a finite number of trees with no direct recursion? A direct recursion is when there exists a parent node with the same nonterminal that parsed the same span. We choose here not to extract such trees. They can be added back after parsing.

This is a recursive procedure that inspects a node, extracts the path required to complete that node. A single path (corresponding to a nonterminal) may again be composed of a sequence of smaller paths. Such paths are again extracted using another call to extract_a_node() recursively.

What happens when we hit on one of the node recursions we want to avoid? In that case, we return the current choice node, which bubbles up to extract_a_tree(). That procedure increments the last choice, which in turn increments up the parents until we reach a choice node that still has options to explore.

What if we hit the end of choices for a particular choice node (i.e, we have exhausted paths that can be taken from a node)? In this case also, we return the current choice node, which bubbles up to extract_a_tree(). That procedure increments the last choice, which bubbles up to the next choice that has some unexplored paths.



The extract_a_tree() is a depth first extractor of a single tree. It tries to extract a tree, and if the extraction returns None, it means that a particular choice was exhausted, or we hit on a recursion. In that case, we increment the choice, and explore a new path.



Note that the EnhancedExtractor only extracts nodes that are not directly recursive. That is, if it finds a node with a nonterminal that covers the same span as that of a parent node with the same nonterminal, it skips the node.





Leo Optimizations

One of the problems with the original Earley parser is that while it can parse strings using arbitrary Context Free Grammars, its performance on right-recursive grammars is quadratic. That is, it takes \(O(n^2)\) runtime and space for parsing with right-recursive grammars. For example, consider the parsing of the following string by two different grammars LR_GRAMMAR and RR_GRAMMAR.





Here is our input string



To see the problem, we need to enable logging. Here is the logged version of parsing with the LR_GRAMMAR





As can be seen from the parsing log for each letter, the number of states with representation <A>: a <A> | (i, j) increases at each stage, and these are simply a left over from the previous letter. They do not contribute anything more to the parse other than to simply complete these entries. However, they take up space, and require resources for inspection, contributing a factor of \(n\) in analysis.

Joop Leo2 found that this inefficiency can be avoided by detecting right recursion. The idea is that before starting the completion step, check whether the current item has a deterministic reduction path. If such a path exists, add a copy of the topmost element of the deterministic reduction path to the current column, and return. If not, perform the original completion step. Definition: An item is said to be on the deterministic reduction path above \([A \rightarrow \gamma.,i]\) if it is \([B \rightarrow \alpha A.,k]\) with \([B \rightarrow \alpha.A,k]\) being the only item in \(I_i\) with the dot in front of \(A\), or if it is on the deterministic reduction path above \([B \rightarrow \alpha A.,k]\). An item on such a path is called topmost one if there is no item on the deterministic reduction path above it2.

Finding a deterministic reduction path is as follows:

Given a complete state, represented by <A> : seq_1 | (s, e) where s is the starting column for this rule, and e the current column, there is a deterministic reduction path above it if two constraints are satisfied.

  1. There exist a single item in the form <B> : seq_2 | <A> (k, s) in column s.
  2. That should be the single item in s with dot in front of `1

The resulting item is of the form <B> : seq_2 <A> | (k, e), which is simply item from (1) advanced, and is considered above <A>:.. (s, e) in the deterministic reduction path. The seq_1 and seq_2 are arbitrary symbol sequences.

This forms the following chain of links, with <A>:.. (s_1, e) being the child of <B>:.. (s_2, e) etc.

Here is one way to visualize the chain:

<C> : seq_3 <B> | (s_3, e)  
             |  constraints satisfied by <C> : seq_3 | <B> (s_3, s_2)
            <B> : seq_2 <A> | (s_2, e)  
                         | constraints satisfied by <B> : seq_2 | <A> (s_2, s_1)
                        <A> : seq_1 | (s_1, e)

Essentially, what we want to do is to identify potential deterministic right recursion candidates, perform completion on them, and throw away the result. We do this until we reach the top. See Grune et al.4 for further information.

Note that the completions are in the same column (e), with each candidates with constraints satisfied in further and further earlier columns (as shown below):

<C> : seq_3 | <B> (s_3, s_2)  -->              <C> : seq_3 <B> | (s_3, e)
               |
              <B> : seq_2 | <A> (s_2, s_1) --> <B> : seq_2 <A> | (s_2, e)  
                             |
                            <A> : seq_1 |                        (s_1, e)

Following this chain, the topmost item is the item <C>:.. (s_3, e) that does not have a parent. The topmost item needs to be saved is called a transitive item by Leo, and it is associated with the non-terminal symbol that started the lookup. The transitive item needs to be added to each column we inspect.

Here is the skeleton for the parser LeoParser.

We first save our original complete





First, we update our Column class with the ability to add transitive items. Note that, while Leo asks the transitive to be added to the set \(I_k\) there is no actual requirement for the transitive states to be added to the states list. The transitive items are only intended for memoization and not for the fill_chart() method. Hence, we track them separately.



Remember the picture we drew of the deterministic path?

    <C> : seq_3 <B> | (s_3, e)
                 |  constraints satisfied by <C> : seq_3 | <B> (s_3, s_2)
                <B> : seq_2 <A> | (s_2, e)
                             | constraints satisfied by <B> : seq_2 | <A> (s_2, s_1)
                            <A> : seq_1 | (s_1, e)

We define a function uniq_postdot() that given the item <A> := seq_1 | (s_1, e), returns a <B> : seq_2 | <A> (s_2, s_1) that satisfies the constraints mentioned in the above picture.





We next define the function get_top() that is the core of deterministic reduction which gets the topmost state above the current state (A).



Once we have the machinery in place, deterministic_reduction() itself is simply a wrapper to call get_top()





Now, both LR and RR grammars should work within \(O(n)\) bounds.



Examples

























We verify that our parser works correctly on LR_GRAMMAR too.



We have fixed the complexity bounds. However, because we are saving only the topmost item of a right recursion, we need to fix our parser to be aware of our fix while extracting parse trees.

We first change the definition of add_transitive() so that results of deterministic reduction can be identified later.



We also need a back() method to create the constraints.



We update copy() to make TState items instead.



We now modify the LeoParser to keep track of the chain of constrains that we mentioned earlier.



Next, we update the uniq_postdot() so that it tracks the chain of links.



We next define a method expand_tstate() that, when given a TState, generates all the intermediate links that we threw away earlier for a given end column.



We define a rearrange() method to generate a reversed table where each column contains states that start at that column.



Here is the rearranged table.



We save the result of rearrange before going into parse_forest().



Finally, during parse_forest(), we first check to see if it is a transitive state, and if it is, expand it to the original sequence of states using traverse_constraints().



This completes our implementation of LeoParser .

Parse Examples

























Now this is still somewhat slow. Why is that? Note that recognition is \(O(n^2)\) and actual parsing is \(O(n^3)\). That is, using parse_prefix() to check whether a text can be parsed by a given grammar will be much faster than extracting a parse tree. A second issue is that we are building this over Python implemented on top of WASM. Python on its own is fairly slow. On our experiments, translating the earley parser to Java line by line resulted in an improvement over 300 times. The runnable Python source for this post is available here.

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel earleyparser is available here.

  1. Earley, Jay. “An efficient context-free parsing algorithm.” Communications of the ACM 13.2 (1970): 94-102. 

  2. Leo, Joop MIM. “A general context-free parsing algorithm running in linear time on every LR (k) grammar without using lookahead.” Theoretical computer science 82.1 (1991): 165-176.  2 3 4

  3. Aycock, John, and R. Nigel Horspool. “Practical earley parsing.” The Computer Journal 45.6 (2002): 620-630.  2

  4. Grune, Dick, and Ceriel JH Jacobs. “Introduction to Parsing.” Parsing Techniques. Springer, New York, NY, 2008. 61-102.