Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
The Earley parsing algorithm was invented by Jay Earley 1 in 1970. It can be used to parse strings that conform to a context-free grammar. The algorithm uses a chart for parsing – that is, it is implemented as a dynamic program relying on solving simpler sub-problems.
Earley parsers are very appealing for a practitioner because they can use any context-free grammar for parsing a string, and from the parse forest generated, one can recover all (even an infinite number) of parse trees that correspond to the given grammar. Unfortunately, this style of parsing pays for generality by being slightly expensive. It takes \(O(n^3)\) time to parse in the worst case. However, if the grammar is unambiguous, it can parse in \(O(n^2)\) time, and all LR(k) grammars in linear time2.
For a much more complete implementation and full recovery of parsing forests using iterative solutions, see our parsing implementation in the fuzzingbook (See the solved exercises). A very detailed explanation of Earley parsing is by Loup Vaillant. Further, a fast industrial strength Earley parser implementation is Marpa.
For this post, we use the following terms:
- The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)
xis a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.
<term>is a nonterminal in the below grammar.
A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal.
[<term>+<expr>]is one of the expansion rules of the nonterminal
A definition is a set of rules that describe the expansion of a given nonterminal.
[[<digit>,<digits>],[<digit>]]is the definition of the nonterminal
A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.
The grammar given below is an example context-free grammar.
A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).
- A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
- A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.
As before, we use the fuzzingbook grammar style.
Here is an example grammar for arithmetic expressions, starting at
A terminal symbol has exactly one character
(Note that we disallow empty string (
'') as a terminal symbol).
Secondly, as per traditional implementations,
there can only be one expansion rule for the
<start> symbol. We work around
this restriction by simply constructing as many charts as there are expansion
rules, and returning all parse trees.
Here is another grammar that targets the same language. Unlike the first grammar, this grammar produces ambiguous parse results.
An Earley parser executes the following steps for parsing:
<start> as the entry into parsing. At this point, we want to parse the
given string by the nonterminal
<start>. The definition of
contains the possible expansion rule that can match the given string. Each
expansion rule can be thought of as a parsing path, with contiguous
substrings of the given input string matched by the particular terms in the
When given a nonterminal to match the string, the essential idea is to get the rules in the definition, and add them to the current set of parsing paths to try with the given string. Within the parsing path, we have a parsed index which denotes the progress of parsing that particular path (i.e the point till which the string until now has been recognized by that path, and any parents of this path). When a rule is newly added, this parsed index is set to zero.
We next look at our set of possible parsing paths, and check if any of these paths start with a nonterminal. If one is found, then for that parsing path to be completed with the given string, that nonterminal has to be recognized first. So, we add the expansion rules corresponding to that nonterminal to the list of possible parsing paths. We do this recursively.
Now, examine the current letter in the input. Then select all parsing paths that have that particular letter at the parsed index. These expressions can now advance one step to the next index. We add such parsing paths to the set of parsing paths to try for the next character.
While doing this, any parsing paths have finished parsing, fetch its corresponding nonterminal and advance all parsing paths that have that nonterminal at the parsing index.
Continue recursively until the parsing path corresponding to
The chart parser depends on a chart (a table) for parsing. The columns correspond to the characters in the input string. Each column represents a set of states, and corresponds to the legal rules to follow from that point on.
Say we start with the following grammar:
Earley parser produces a table of possible parse paths at each letter index of
the table. Given an input
adcd, we seed the column
<start>: | <A> <B>
| represents the parsing index (also called the dot). This indicates
that we are at the starting, and the next step is to identify
<A>. After this
rule is processed, the column would contain two more states
<A>: | a <B> <c> <A>: | a <A>
which represents two parsing paths to complete
After processing of column
0 (which corresponds to input character
would find the following in column
1 (which corresponds to the input character
<A>: a | <B> c <A>: a | <A> <B>: | b <C> <B>: | <D> <A>: | a <B> c <A>: | a <A> <D>: | d
Similarly, the next column (column
2 corresponding to
d) would contain the following.
<D>: | d <B>: <D> | <A>: a <B> | c
3 corresponding to
c would contain:
<A>: a <B> c | <start>: <A> | <B> <B>: | <b> <C> <B>: | <D> <D>: | d
d) would contain this at the end of processing.
<D>: d | <B>: <D> | <start>: <A> <B> |
This is how the table or the chart – from where the parsing gets its name: chart parsing – gets filled.
The column contains a set of states. Each column corresponds to a character (or a token if tokens are used). Note that the states in a column corresponds to the parsing expression that will occur once that character has been read. That is, the first column will correspond to the parsing expression when no characters have been read.
The column allows for adding states, and checks to prevent duplication of states. Why do we need to prevent duplication? The problem is left recursion. We need to detect and curtail left recursion, which is indicated by non-unique states.
A state represents a parsing path (which corresponds to the nonterminal, and the expansion rule that is being followed) with the current parsed index. Each state contains the following:
- name: The nonterminal that this rule represents.
- expr: The rule that is being followed
- dot: The point till which parsing has happened in the rule.
- s_col: The starting point for this rule.
- e_col: The ending point for this rule.
The convenience methods
at_dot() should be
self explanatory. For example,
That is, the next symbol to be parsed is
<D>, and if we advance it,
We start with a bare minimum interface for a parser. It should allow one to parse a given text using a given nonterminal (which should be present in the grammar).
We now initialize the Earley parser, which is a parser.
Earley parser handles nullable nonterminals separately. A nullable nonterminal is a nonterminal that can derive an empty string. That is at least one of the expansion rules must derive an empty string. An expansion rule derives an empty string if all of the tokens can derive the empty string. This means no terminal symbols (assuming we do not have zero width terminal symbols), and all nonterminal symbols can derive empty string.
In this implementation, we first initialize the list of first level nullable nonterminals that contain an empty expansion. That is, they directly derive the empty string. Next, we remove any expansion rule that contains a token as these expansion rules will not result in empty strings. Next, we start with our current list of nullable nonterminals, take one at a time, and remove them from the current expansion rules. If any expansion rule becomes empty, the corresponding nonterminal is added to the nullable nonterminal list. This continues until all nullable nonterminals are processed.
Earley parser is a chart parser. That is, it relies on a table of solutions to smaller problems. This table is called a chart (hence the name of such parsers – chart parsers).
Here, we begin the chart construction by seeding the chart with columns representing the tokens or characters. Consider our example grammar again. The starting point is,
<start>: | <A> <B>
We add this state to the
chart to start the parse. Note that the term
after dot is
<A>, which will need to be recursively inserted to the column.
We will see how to do that later.
Note: In traditional Earley parsing, the starting nonterminal always have
a single expansion rule. However, in many cases, you want to parse a fragment
and this rule makes it cumbersome to use Earley parsing. Hence, we have
opted to allow any nonterminal to be used as the starting nonterminal
irrespective of whether it has a single rule or not.
Interestingly, this does not have an impact on the parsing itself, but in
the extraction of results.
In essence, we seed all expansion rules into of the current start symbol
to the chart at
column 0. We will take care of that difference while
building parse trees.
We seed our initial state in the example
Then, we complete the chart. The idea here is to process one character or one element at a time. At each character, we examine the current parse paths (states) and continue forward any parse path that successfully parses the letter. We process any state that is present in the current column in the following fashion.
There are three main methods we use:
If in the current state, the term after the dot is a nonterminal,
predict() is called. It
adds the expansion of the nonterminal to the current column.
If the term is nullable, then we simply advance the current state, and add that to the current column. This fix to the original Earley parsing was suggested by Aycock et al.3.
If we look our example, we have seeded the first column with
| <A> <B>. Now,
fill_chart() will find that the next term is
<A> and call
which will then add the expansions of
Next, we apply predict.
As you can see, the two rules of
<A> has been added to
the current column.
scan() method is called if the next symbol in the current state is a terminal symbol. If the
state matches the next term, moves the dot one position, and adds the new
state to the column.
For example, consider this state.
<B>: | b c
If we scan the next column’s letter, and that letter is
b, then it matches the
next symbol. So, we can advance the state by one symbol, and add it to the next
<B>: b | c
Here is our continuing example.
As you can see, the
chart that was waiting for
advanced one letter after consuming
a, and has been added to
complete() method is called if a particular state has finished the rule
during execution. It first extracts the start column of the finished state, then
for all states in the start column that is not finished, find the states that
were parsing this current state (that is, we can go back to continue to parse
those rules now). Next, shift them by one position, and add them to the current
For example, say the state we have is:
<A>: a | <B> c <B>: b c |
<B> b c | is complete, and we need to advance any state that
<B> at the dot to one index forward, which is
<A>: a <B> | c
How do we determine the parent states? During predict, we added the predicted child states to the same column as that of the inspected state. So, the states will be found in the starting column of the current state, with the same symbol at_dot as that of the name of the completed state.
We advance all such parents (producing new states) and add the new states to the current column.
Here is our example. We start parsing
ad. So, we have three columns.
Next, we populate column 1 which corresponds to letter
You can see that the two states are waiting on
Hence, we run predict again to add the corresponding rules of
to the current column.
As you can see, we have a list of states that are waiting
Our next letter is:
We scan to populate
As we expected, only
<D> could advance to the next column (
Finally, we use complete, so that we can advance the parents of the
<D> state above.
As you can see, that led to
<B> being complete, and since
<A> also becomes complete.
Filling the chart
In the below algorithm, whenever the
at_dot() is at a nonterminal
symbol, the expansion rules of that nonterminal are added to the current
predict()) since each rule represents one valid parsing path. If on the
at_dot() indicates processing finished for that nonterminal, we
lookup the parent symbols and advance their parsing state (
complete()). If we
find that we are at a terminal symbol, we simply check if the current state can
advance to parsing the next character (
We can now recognize the given string as part of the language represented by the grammar.
The chart above only shows completed entries. The parenthesized expression
indicates the column just before the first character was recognized, and the
Notice how the
<start> nonterminal shows the dot at the end. That is, fully parsed.
We use the following procedures to translate the parse forest to individual trees.
Here is an example of using it.
parse_on() method is slightly different from usual Earley implementations
in that we accept any nonterminal symbol, not just nonterminal symbols with a
single expansion rule. We accomplish this by computing a different chart for
The parse_paths() method tries to unify the given expression in
the parsed string. For that, it extracts the last symbol in
checks if it is a terminal symbol. If it is, then it checks the chart at
see if the letter corresponding to the position matches the terminal symbol.
If it does, extend our start index by the length of the symbol.
If the symbol was a nonterminal symbol, then we retrieve the parsed states
at the current end column index (
til) that correspond to the nonterminal
symbol, and collect the start index. These are the end column indexes for
the remaining expression.
Given our list of start indexes, we obtain the parse paths from the remaining expression. If we can obtain any, then we return the parse paths. If not, we return an empty list.
That is, the parse path for
<start> given the input
recognizing the expression
<A><B>. This was recognized by the two states:
<A> from input(0) to input(2) which further involved recognizing the rule
a<B>c, and the next state
<B> from input(3) which involved recognizing the
parse_forest() method takes the states which represents completed
parses, and determines the possible ways that its expressions corresponded to
the parsed expression. As we noted, it is here that we take care of multiple
expansion rules for start symbol. (The
_parse_forest() accepts a single
state, and is the main driver that corresponds to traditional implementation,)
For example, say we are parsing
1+2+3, and the
expr. It could have been parsed as either
We show how to extract a single tree first, and then generalize it to all trees.
We need a way to display parse trees.
Displaying the tree
Ambiguous grammars can produce multiple derivation trees for some given string.
In the above example, the
a_grammar can parse
1+2+4 in as either
That is, we need to extract all derivation trees.
We enhance our
extract_trees() as below.
Using the same example,
Almost infinite parse trees
There is a problem with our
extract_trees() method. The issue is that it is
too eager. The parse forest can have an infinite number of trees, and at this
time we effectively try to extract all at the same time. So, in case of
such grammars our
extract_trees() will fail. Here are two example grammars.
An example run.
The problem is that, our implementation of
extract_trees() is eager.
That is, it attempts to extract all inner parse trees before it can construct
the outer parse tree. When there is a self reference, this results in recursion.
Here is a simple extractor that avoids this problem. The idea here is that we
randomly and lazily choose a node to expand, which avoids the infinite
At this point, we also need a simple way to collapse the derivation tree to the original string
SimpleExtractor has a problem. The issue is that since we rely on
randomness for exploration, it gives no guarantees on the uniqueness of the
returned trees. Hence, we need a way to keep track of the explored paths.
our next class
EnahncedExtractor can do that. In
different exploration paths form a tree of nodes.
First we define a data-structure to keep track of explorations.
_chosencontains the current choice
nextholds the next choice done using
totalholds he total number of choices for this node.
Initialization of the data-structure in the constructor.
Given an array and a choice node,
choose_path() returns the element
in array corresponding to the next choice node if it exists, or produces
a new choice nodes, and returns that element.
While extracting, we have a choice. Should we allow infinite forests, or should we have a finite number of trees with no direct recursion? A direct recursion is when there exists a parent node with the same nonterminal that parsed the same span. We choose here not to extract such trees. They can be added back after parsing.
This is a recursive procedure that inspects a node, extracts the path required to complete that node. A single path (corresponding to a nonterminal) may again be composed of a sequence of smaller paths. Such paths are again extracted using another call to extract_a_node() recursively.
What happens when we hit on one of the node recursions we want to avoid?
In that case, we return the current choice node, which bubbles up to
extract_a_tree(). That procedure increments the last choice, which in
turn increments up the parents until we reach a choice node that still has
options to explore.
What if we hit the end of choices for a particular choice node
(i.e, we have exhausted paths that can be taken from a node)? In this case also,
we return the current choice node, which bubbles up to
That procedure increments the last choice, which bubbles up to the next choice
that has some unexplored paths.
extract_a_tree() is a depth first extractor of a single tree. It tries to
extract a tree, and if the extraction returns None, it means that a particular
choice was exhausted, or we hit on a recursion. In that case, we increment the
choice, and explore a new path.
Note that the
EnhancedExtractor only extracts nodes that are not directly
recursive. That is, if it finds a node with a nonterminal that covers the same
span as that of a parent node with the same nonterminal, it skips the node.
One of the problems with the original Earley parser is that while it can parse
strings using arbitrary Context Free Grammars, its performance on
right-recursive grammars is quadratic. That is, it takes \(O(n^2)\) runtime and
space for parsing with right-recursive grammars. For example, consider the
parsing of the following string by two different grammars
Here is our input string
To see the problem, we need to enable logging. Here is the logged version of parsing with the
As can be seen from the parsing log for each letter, the number of states with
<A>: a <A> | (i, j) increases at each stage, and these are
simply a left over from the previous letter. They do not contribute anything
more to the parse other than to simply complete these entries. However, they
take up space, and require resources for inspection, contributing a factor of \(n\) in analysis.
Joop Leo2 found that this inefficiency can be avoided by detecting right recursion. The idea is that before starting the completion step, check whether the current item has a deterministic reduction path. If such a path exists, add a copy of the topmost element of the deterministic reduction path to the current column, and return. If not, perform the original completion step. Definition: An item is said to be on the deterministic reduction path above \([A \rightarrow \gamma.,i]\) if it is \([B \rightarrow \alpha A.,k]\) with \([B \rightarrow \alpha.A,k]\) being the only item in \(I_i\) with the dot in front of \(A\), or if it is on the deterministic reduction path above \([B \rightarrow \alpha A.,k]\). An item on such a path is called topmost one if there is no item on the deterministic reduction path above it2.
Finding a deterministic reduction path is as follows:
Given a complete state, represented by
<A> : seq_1 | (s, e) where
s is the
starting column for this rule, and
e the current column, there is a
deterministic reduction path above it if two constraints are satisfied.
- There exist a single item in the form
<B> : seq_2 | <A> (k, s)in column
- That should be the single item in s with dot in front of `1
The resulting item is of the form
<B> : seq_2 <A> | (k, e), which is simply
item from (1) advanced, and is considered above
<A>:.. (s, e) in the
deterministic reduction path. The
seq_2 are arbitrary symbol sequences.
This forms the following chain of links, with
<A>:.. (s_1, e) being the child
<B>:.. (s_2, e) etc.
Here is one way to visualize the chain:
<C> : seq_3 <B> | (s_3, e) | constraints satisfied by <C> : seq_3 | <B> (s_3, s_2) <B> : seq_2 <A> | (s_2, e) | constraints satisfied by <B> : seq_2 | <A> (s_2, s_1) <A> : seq_1 | (s_1, e)
Essentially, what we want to do is to identify potential deterministic right recursion candidates, perform completion on them, and throw away the result. We do this until we reach the top. See Grune et al.4 for further information.
Note that the completions are in the same column (e), with each candidates with constraints satisfied in further and further earlier columns (as shown below):
<C> : seq_3 | <B> (s_3, s_2) --> <C> : seq_3 <B> | (s_3, e) | <B> : seq_2 | <A> (s_2, s_1) --> <B> : seq_2 <A> | (s_2, e) | <A> : seq_1 | (s_1, e)
Following this chain, the topmost item is the item
<C>:.. (s_3, e) that does
not have a parent. The topmost item needs to be saved is called a transitive
item by Leo, and it is associated with the non-terminal symbol that started the
lookup. The transitive item needs to be added to each column we inspect.
Here is the skeleton for the parser
We first save our original complete
First, we update our
Column class with the ability to add transitive items.
Note that, while Leo asks the transitive to be added to the set \(I_k\) there is
no actual requirement for the transitive states to be added to the states list.
The transitive items are only intended for memoization and not for the
fill_chart() method. Hence, we track them separately.
Remember the picture we drew of the deterministic path?
<C> : seq_3 <B> | (s_3, e) | constraints satisfied by <C> : seq_3 | <B> (s_3, s_2) <B> : seq_2 <A> | (s_2, e) | constraints satisfied by <B> : seq_2 | <A> (s_2, s_1) <A> : seq_1 | (s_1, e)
We define a function
uniq_postdot() that given the item
<A> := seq_1 | (s_1, e),
<B> : seq_2 | <A> (s_2, s_1) that satisfies the constraints
mentioned in the above picture.
We next define the function
get_top() that is the core of deterministic
reduction which gets the topmost state above the current state
Once we have the machinery in place,
deterministic_reduction() itself is
simply a wrapper to call
Now, both LR and RR grammars should work within \(O(n)\) bounds.
We verify that our parser works correctly on
We have fixed the complexity bounds. However, because we are saving only the topmost item of a right recursion, we need to fix our parser to be aware of our fix while extracting parse trees.
We first change the definition of
add_transitive() so that results of deterministic reduction can be identified later.
We also need a
back() method to create the constraints.
copy() to make
TState items instead.
We now modify the
LeoParser to keep track of the chain of constrains that we mentioned earlier.
Next, we update the
uniq_postdot() so that it tracks the chain of links.
We next define a method
expand_tstate() that, when given a
all the intermediate links that we threw away earlier for a given end column.
We define a
rearrange() method to generate a reversed table where each column contains states that start at that column.
Here is the rearranged table.
We save the result of rearrange before going into
parse_forest(), we first check to see if it is a transitive
state, and if it is, expand it to the original sequence of states using
This completes our implementation of
Now this is still somewhat slow. Why is that? Note that recognition is
$O(n^2)$ and actual parsing is $O(n^3)$. That is, using
check whether a text can be parsed by a given grammar will be much faster than
extracting a parse tree. A second issue is that we are building this over
Python implemented on top of WASM. Python on its own is fairly slow. On our
experiments, translating the earley parser to Java line by line
resulted in an improvement over 300 times.
The runnable Python source for this post is available here.
Earley, Jay. “An efficient context-free parsing algorithm.” Communications of the ACM 13.2 (1970): 94-102. ↩
Grune, Dick, and Ceriel JH Jacobs. “Introduction to Parsing.” Parsing Techniques. Springer, New York, NY, 2008. 61-102. ↩