Earley Parser
TLDR; This tutorial is a complete implementation of Earley Parser in Python with Leo’s optimizations. The Python interpreter is embedded so that you can work through the implementation steps.
The Earley parsing algorithm was invented by Jay Earley 1 in 1970. It can be used to parse strings that conform to a context-free grammar. The algorithm uses a chart for parsing – that is, it is implemented as a dynamic program relying on solving simpler sub-problems.
Earley parsers are very appealing for a practitioner because they can use any context-free grammar for parsing a string, and from the parse forest generated, one can recover all (even an infinite number) of parse trees that correspond to the given grammar. Unfortunately, this style of parsing pays for generality by being slightly expensive. It takes \(O(n^3)\) time to parse in the worst case. However, if the grammar is unambiguous, it can parse in \(O(n^2)\) time, and all LR(k) grammars in linear time2 – \(O(n)\).
This implementation of Earley parser correctly handles the epsilon case as
given by Aycock et al.3 Further, the LeoParser
class
incorporates Leo’s optimizations2.
Another detailed explanation of Earley parsing is by Loup Vaillant. Further, a fast industrial strength Earley parser implementation is Marpa.
This post is written as runnable Python source. You can download the
notebook directly here,
It the file is downloaded as earleyparser.py
, it can be imported into your
projects using import earleyparser
.
Synopsis
import earleyparser as P
my_grammar = {'<start>': [['1', '<A>'],
['2']
],
'<A>' : [['a']]}
my_parser = P.EarleyParser(my_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
print(P.format_parsetree(tree))
Contents
- Synopsis
- Definitions
- Summary
- The Column Data Structure
- The State Data Structure
- The Basic Parser Interface
- The Chart Parser
- Derivation trees
- Ambiguous Parsing
- Almost Infinite Parse Trees
- Leo Optimizations
- Artifacts
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Definitions
For this post, we use the following terms:
- The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
-
A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)
For example,
x
is a terminal symbol. -
A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.
For example,
<term>
is a nonterminal in the below grammar. -
A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal.
For example,
[<term>+<expr>]
is one of the expansion rules of the nonterminal<expr>
. -
A definition is a set of rules that describe the expansion of a given nonterminal.
For example,
[[<digit>,<digits>],[<digit>]]
is the definition of the nonterminal<digits>
-
A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.
The grammar given below is an example context-free grammar.
-
A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).
- A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
- A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.
As before, we use the fuzzingbook grammar style.
Here is an example grammar for arithmetic expressions, starting at <start>
.
A terminal symbol has exactly one character
(Note that we disallow empty string (''
) as a terminal symbol).
Secondly, as per traditional implementations,
there can only be one expansion rule for the <start>
symbol. We work around
this restriction by simply constructing as many charts as there are expansion
rules, and returning all parse trees.
Here is another grammar that targets the same language. Unlike the first grammar, this grammar produces ambiguous parse results.
Summary
An Earley parser executes the following steps for parsing:
Use <start>
as the entry into parsing. At this point, we want to parse the
given string by the nonterminal <start>
. The definition of <start>
contains the possible expansion rule that can match the given string. Each
expansion rule can be thought of as a parsing path, with contiguous
substrings of the given input string matched by the particular terms in the
rule.
-
When given a nonterminal to match the string, the essential idea is to get the rules in the definition, and add them to the current set of parsing paths to try with the given string. Within the parsing path, we have a parsed index which denotes the progress of parsing that particular path (i.e the point till which the string until now has been recognized by that path, and any parents of this path). When a rule is newly added, this parsed index is set to zero.
-
We next look at our set of possible parsing paths, and check if any of these paths start with a nonterminal. If one is found, then for that parsing path to be completed with the given string, that nonterminal has to be recognized first. So, we add the expansion rules corresponding to that nonterminal to the list of possible parsing paths. We do this recursively.
-
Now, examine the current letter in the input. Then select all parsing paths that have that particular letter at the parsed index. These expressions can now advance one step to the next index. We add such parsing paths to the set of parsing paths to try for the next character.
-
While doing this, any parsing paths have finished parsing, fetch its corresponding nonterminal and advance all parsing paths that have that nonterminal at the parsing index.
-
Continue recursively until the parsing path corresponding to
<start>
has finished.
The chart parser depends on a chart (a table) for parsing. The columns correspond to the characters in the input string. Each column represents a set of states, and corresponds to the legal rules to follow from that point on.
Say we start with the following grammar:
Earley parser produces a table of possible parse paths at each letter index of
the table. Given an input adcd
, we seed the column 0
with:
<start>: | <A> <B>
where the |
represents the parsing index (also called the dot). This indicates
that we are at the starting, and the next step is to identify <A>
. After this
rule is processed, the column would contain two more states
<A>: | a <B> <c>
<A>: | a <A>
which represents two parsing paths to complete <A>
.
After processing of column 0
(corresponds to the start of the parse), we
would find the following in column 1
(which corresponds to the end of parse for literal a
)
<A>: a | <B> c
<A>: a | <A>
<B>: | b <C>
<B>: | <D>
<A>: | a <B> c
<A>: | a <A>
<D>: | d
Similarly, the next column (column 2
corresponding to d
) would contain the following.
<D>: | d
<B>: <D> |
<A>: a <B> | c
Next, column 3
corresponding to c
would contain:
<A>: a <B> c |
<start>: <A> | <B>
<B>: | <b> <C>
<B>: | <D>
<D>: | d
Finally, column 4
(d
) would contain this at the end of processing.
<D>: d |
<B>: <D> |
<start>: <A> <B> |
This is how the table or the chart – from where the parsing gets its name: chart parsing – gets filled.
The Column Data Structure
The column contains a set of states. Each column corresponds to a character (or a token if tokens are used). Note that the states in a column corresponds to the parsing expression that will occur once that character has been read. That is, the first column will correspond to the parsing expression when no characters have been read.
The column allows for adding states, and checks to prevent duplication of states. Why do we need to prevent duplication? The problem is left recursion. We need to detect and curtail left recursion, which is indicated by non-unique states.
The State Data Structure
A state represents a parsing path (which corresponds to the nonterminal, and the expansion rule that is being followed) with the current parsed index. Each state contains the following:
- name: The nonterminal that this rule represents.
- expr: The rule that is being followed
- dot: The point till which parsing has happened in the rule.
- s_col: The starting point for this rule.
- e_col: The ending point for this rule.
The convenience methods finished()
, advance()
and at_dot()
should be
self explanatory. For example,
That is, the next symbol to be parsed is <D>
, and if we advance it,
The Basic Parser Interface
We start with a bare minimum interface for a parser. It should allow one to parse a given text using a given nonterminal (which should be present in the grammar).
We now initialize the Earley parser, which is a parser.
Nonterminals Deriving Empty Strings
Earley parser handles nullable nonterminals separately. A nullable nonterminal is a nonterminal that can derive an empty string. That is at least one of the expansion rules must derive an empty string. An expansion rule derives an empty string if all of the tokens can derive the empty string. This means no terminal symbols (assuming we do not have zero width terminal symbols), and all nonterminal symbols can derive empty string.
In this implementation, we first initialize the list of first level nullable nonterminals that contain an empty expansion. That is, they directly derive the empty string. Next, we remove any expansion rule that contains a token as these expansion rules will not result in empty strings. Next, we start with our current list of nullable nonterminals, take one at a time, and remove them from the current expansion rules. If any expansion rule becomes empty, the corresponding nonterminal is added to the nullable nonterminal list. This continues until all nullable nonterminals are processed.
An example
Checking
The Chart Parser
Earley parser is a chart parser. That is, it relies on a table of solutions to smaller problems. This table is called a chart (hence the name of such parsers – chart parsers).
The Chart Construction
Here, we begin the chart construction by seeding the chart with columns representing the tokens or characters. Consider our example grammar again. The starting point is,
<start>: | <A> <B>
We add this state to the chart[0]
to start the parse. Note that the term
after dot is <A>
, which will need to be recursively inserted to the column.
We will see how to do that later.
Note: In traditional Earley parsing, the starting nonterminal always have
a single expansion rule. However, in many cases, you want to parse a fragment
and this rule makes it cumbersome to use Earley parsing. Hence, we have
opted to allow any nonterminal to be used as the starting nonterminal
irrespective of whether it has a single rule or not.
Interestingly, this does not have an impact on the parsing itself, but in
the extraction of results.
In essence, we seed all expansion rules into of the current start symbol
to the chart at column 0
. We will take care of that difference while
building parse trees.
We seed our initial state in the example
Then, we complete the chart. The idea here is to process one character or one element at a time. At each character, we examine the current parse paths (states) and continue forward any parse path that successfully parses the letter. We process any state that is present in the current column in the following fashion.
There are three main methods we use: predict()
, scan()
, and complete()
Predict
If in the current state, the term after the dot is a nonterminal, predict()
is called. It
adds the expansion of the nonterminal to the current column.
If the term is nullable, then we simply advance the current state, and add that to the current column. This fix to the original Earley parsing was suggested by Aycock et al.3.
If we look our example, we have seeded the first column with | <A> <B>
. Now,
fill_chart()
will find that the next term is <A>
and call predict()
which will then add the expansions of <A>
.
Next, we apply predict.
As you can see, the two rules of <A>
has been added to
the current column.
Scan
The scan()
method is called if the next symbol in the current state is a terminal symbol. If the
state matches the next term, moves the dot one position, and adds the new
state to the column.
For example, consider this state.
<B>: | b c
If we scan the next column’s letter, and that letter is b
, then it matches the
next symbol. So, we can advance the state by one symbol, and add it to the next
column.
<B>: b | c
Here is our continuing example.
As you can see, the state[1]
in chart[0]
that was waiting for a
has
advanced one letter after consuming a
, and has been added to chart[1]
.
Complete
The complete()
method is called if a particular state has finished the rule
during execution. It first extracts the start column of the finished state, then
for all states in the start column that is not finished, find the states that
were parsing this current state (that is, we can go back to continue to parse
those rules now). Next, shift them by one position, and add them to the current
column.
For example, say the state we have is:
<A>: a | <B> c
<B>: b c |
The state <B> b c |
is complete, and we need to advance any state that
has <B>
at the dot to one index forward, which is <A>: a <B> | c
How do we determine the parent states? During predict, we added the predicted child states to the same column as that of the inspected state. So, the states will be found in the starting column of the current state, with the same symbol at_dot as that of the name of the completed state.
We advance all such parents (producing new states) and add the new states to the current column.
Here is our example. We start parsing ad
. So, we have three columns.
Next, we populate column 1 which corresponds to letter a
.
You can see that the two states are waiting on <A>
and <B>
respectively at at_dot()
.
Hence, we run predict again to add the corresponding rules of <A>
and <B>
to the current column.
As you can see, we have a list of states that are waiting
for b
, a
and d
.
Our next letter is:
We scan to populate column 2
.
As we expected, only <D>
could advance to the next column (chart[2]
)
after reading d
Finally, we use complete, so that we can advance the parents of the <D>
state above.
As you can see, that led to <B>
being complete, and since <B>
is
complete, <A>
also becomes complete.
Filling The Chart
In the below algorithm, whenever the at_dot()
is at a nonterminal
symbol, the expansion rules of that nonterminal are added to the current
rule (predict()
) since each rule represents one valid parsing path. If on the
other hand, at_dot()
indicates processing finished for that nonterminal, we
lookup the parent symbols and advance their parsing state (complete()
). If we
find that we are at a terminal symbol, we simply check if the current state can
advance to parsing the next character (scan()
).
We can now recognize the given string as part of the language represented by the grammar.
The chart above only shows completed entries. The parenthesized expression
indicates the column just before the first character was recognized, and the
ending column.
Notice how the <start>
nonterminal shows the dot at the end. That is, fully parsed.
Derivation trees
We use the following procedures to translate the parse forest to individual trees.
parse_prefix
Here is an example of using it.
parse_on
Our parse_on()
method is slightly different from usual Earley implementations
in that we accept any nonterminal symbol, not just nonterminal symbols with a
single expansion rule. We accomplish this by computing a different chart for
each expansion.
parse_paths
The parse_paths() method tries to unify the given expression in named_expr
with
the parsed string. For that, it extracts the last symbol in named_expr
and
checks if it is a terminal symbol. If it is, then it checks the chart at til
to
see if the letter corresponding to the position matches the terminal symbol.
If it does, extend our start index by the length of the symbol.
If the symbol was a nonterminal symbol, then we retrieve the parsed states
at the current end column index (til
) that correspond to the nonterminal
symbol, and collect the start index. These are the end column indexes for
the remaining expression.
Given our list of start indexes, we obtain the parse paths from the remaining expression. If we can obtain any, then we return the parse paths. If not, we return an empty list.
Example
That is, the parse path for <start>
given the input adcd
included
recognizing the expression <A><B>
. This was recognized by the two states:
<A>
from input(0) to input(2) which further involved recognizing the rule
a<B>c
, and the next state <B>
from input(3) which involved recognizing the
rule <D>
.
parse_forest
The parse_forest()
method takes the states which represents completed
parses, and determines the possible ways that its expressions corresponded to
the parsed expression. As we noted, it is here that we take care of multiple
expansion rules for start symbol. (The _parse_forest()
accepts a single
state, and is the main driver that corresponds to traditional implementation,)
For example, say we are parsing 1+2+3
, and the
state has [<expr>,+,<expr>]
in expr
. It could have been parsed as either
[{<expr>:1+2},+,{<expr>:3}]
or [{<expr>:1},+,{<expr>:2+3}]
.
Example
extract_trees
We show how to extract a single tree first, and then generalize it to all trees.
We need a way to display parse trees.
Displaying the tree
Example
Ambiguous Parsing
Ambiguous grammars can produce multiple derivation trees for some given string.
In the above example, the a_grammar
can parse 1+2+4
in as either [1+2]+4
or 1+[2+4]
.
That is, we need to extract all derivation trees.
We enhance our extract_trees()
as below.
Example
Using the same example,
Almost Infinite Parse Trees
There is a problem with our extract_trees()
method. The issue is that it is
too eager. The parse forest can have an infinite number of trees, and at this
time we effectively try to extract all at the same time. So, in case of
such grammars our extract_trees()
will fail. Here are two example grammars.
An example run.
The problem is that, our implementation of extract_trees()
is eager.
That is, it attempts to extract all inner parse trees before it can construct
the outer parse tree. When there is a self reference, this results in recursion.
Here is a simple extractor that avoids this problem. The idea here is that we
randomly and lazily choose a node to expand, which avoids the infinite
recursion.
At this point, we also need a simple way to collapse the derivation tree to the original string
indirect reference
However, SimpleExtractor
has a problem. The issue is that since we rely on
randomness for exploration, it gives no guarantees on the uniqueness of the
returned trees. Hence, we need a way to keep track of the explored paths.
our next class EnahncedExtractor
can do that. In EnhancedExtractor
,
different exploration paths form a tree of nodes.
First we define a data-structure to keep track of explorations.
_chosen
contains the current choicenext
holds the next choice done using_chosen
total
holds he total number of choices for this node.
Initialization of the data-structure in the constructor.
Given an array and a choice node, choose_path()
returns the element
in array corresponding to the next choice node if it exists, or produces
a new choice nodes, and returns that element.
While extracting, we have a choice. Should we allow infinite forests, or should we have a finite number of trees with no direct recursion? A direct recursion is when there exists a parent node with the same nonterminal that parsed the same span. We choose here not to extract such trees. They can be added back after parsing.
This is a recursive procedure that inspects a node, extracts the path required to complete that node. A single path (corresponding to a nonterminal) may again be composed of a sequence of smaller paths. Such paths are again extracted using another call to extract_a_node() recursively.
What happens when we hit on one of the node recursions we want to avoid?
In that case, we return the current choice node, which bubbles up to
extract_a_tree()
. That procedure increments the last choice, which in
turn increments up the parents until we reach a choice node that still has
options to explore.
What if we hit the end of choices for a particular choice node
(i.e, we have exhausted paths that can be taken from a node)? In this case also,
we return the current choice node, which bubbles up to extract_a_tree()
.
That procedure increments the last choice, which bubbles up to the next choice
that has some unexplored paths.
The extract_a_tree()
is a depth first extractor of a single tree. It tries to
extract a tree, and if the extraction returns None, it means that a particular
choice was exhausted, or we hit on a recursion. In that case, we increment the
choice, and explore a new path.
Note that the EnhancedExtractor
only extracts nodes that are not directly
recursive. That is, if it finds a node with a nonterminal that covers the same
span as that of a parent node with the same nonterminal, it skips the node.
Leo Optimizations
One of the problems with the original Earley parser is that while it can parse
strings using arbitrary Context Free Grammars, its performance on
right-recursive grammars is quadratic. That is, it takes \(O(n^2)\) runtime and
space for parsing with right-recursive grammars. For example, consider the
parsing of the following string by two different grammars LR_GRAMMAR
and
RR_GRAMMAR
.
Here is our input string
To see the problem, we need to enable logging. Here is the logged version of parsing with the LR_GRAMMAR
As can be seen from the parsing log for each letter, the number of states with
representation <A>: a <A> | (i, j)
increases at each stage, and these are
simply a left over from the previous letter. They do not contribute anything
more to the parse other than to simply complete these entries. However, they
take up space, and require resources for inspection, contributing a factor of \(n\) in analysis.
Joop Leo2 found that this inefficiency can be avoided by detecting right recursion. The idea is that before starting the completion step, check whether the current item has a deterministic reduction path. If such a path exists, add a copy of the topmost element of the deterministic reduction path to the current column, and return. If not, perform the original completion step. Definition: An item is said to be on the deterministic reduction path above \([A \rightarrow \gamma.,i]\) if it is \([B \rightarrow \alpha A.,k]\) with \([B \rightarrow \alpha.A,k]\) being the only item in \(I_i\) with the dot in front of \(A\), or if it is on the deterministic reduction path above \([B \rightarrow \alpha A.,k]\). An item on such a path is called topmost one if there is no item on the deterministic reduction path above it2.
Finding a deterministic reduction path is as follows:
Given a complete state, represented by <A> : seq_1 | (s, e)
where s
is the
starting column for this rule, and e
the current column, there is a
deterministic reduction path above it if two constraints are satisfied.
- There exist a single item in the form
<B> : seq_2 | <A> (k, s)
in columns
. - That should be the single item in s with dot in front of `1
The resulting item is of the form <B> : seq_2 <A> | (k, e)
, which is simply
item from (1) advanced, and is considered above <A>:.. (s, e)
in the
deterministic reduction path. The seq_1
and seq_2
are arbitrary symbol sequences.
This forms the following chain of links, with <A>:.. (s_1, e)
being the child
of <B>:.. (s_2, e)
etc.
Here is one way to visualize the chain:
<C> : seq_3 <B> | (s_3, e)
| constraints satisfied by <C> : seq_3 | <B> (s_3, s_2)
<B> : seq_2 <A> | (s_2, e)
| constraints satisfied by <B> : seq_2 | <A> (s_2, s_1)
<A> : seq_1 | (s_1, e)
Essentially, what we want to do is to identify potential deterministic right recursion candidates, perform completion on them, and throw away the result. We do this until we reach the top. See Grune et al.4 for further information.
Note that the completions are in the same column (e), with each candidates with constraints satisfied in further and further earlier columns (as shown below):
<C> : seq_3 | <B> (s_3, s_2) --> <C> : seq_3 <B> | (s_3, e)
|
<B> : seq_2 | <A> (s_2, s_1) --> <B> : seq_2 <A> | (s_2, e)
|
<A> : seq_1 | (s_1, e)
Following this chain, the topmost item is the item <C>:.. (s_3, e)
that does
not have a parent. The topmost item needs to be saved is called a transitive
item by Leo, and it is associated with the non-terminal symbol that started the
lookup. The transitive item needs to be added to each column we inspect.
Here is the skeleton for the parser LeoParser
.
We first save our original complete
First, we update our Column
class with the ability to add transitive items.
Note that, while Leo asks the transitive to be added to the set \(I_k\) there is
no actual requirement for the transitive states to be added to the states list.
The transitive items are only intended for memoization and not for the
fill_chart()
method. Hence, we track them separately.
Remember the picture we drew of the deterministic path?
<C> : seq_3 <B> | (s_3, e)
| constraints satisfied by <C> : seq_3 | <B> (s_3, s_2)
<B> : seq_2 <A> | (s_2, e)
| constraints satisfied by <B> : seq_2 | <A> (s_2, s_1)
<A> : seq_1 | (s_1, e)
We define a function uniq_postdot()
that given the item <A> := seq_1 | (s_1, e)
,
returns a <B> : seq_2 | <A> (s_2, s_1)
that satisfies the constraints
mentioned in the above picture.
We next define the function get_top()
that is the core of deterministic
reduction which gets the topmost state above the current state (A)
.
Once we have the machinery in place, deterministic_reduction()
itself is
simply a wrapper to call get_top()
Now, both LR and RR grammars should work within \(O(n)\) bounds.
Examples
We verify that our parser works correctly on LR_GRAMMAR
too.
We have fixed the complexity bounds. However, because we are saving only the topmost item of a right recursion, we need to fix our parser to be aware of our fix while extracting parse trees.
We first change the definition of add_transitive()
so that results of deterministic reduction can be identified later.
We also need a back()
method to create the constraints.
We update copy()
to make TState
items instead.
We now modify the LeoParser
to keep track of the chain of constrains that we mentioned earlier.
Next, we update the uniq_postdot()
so that it tracks the chain of links.
We next define a method expand_tstate()
that, when given a TState
, generates
all the intermediate links that we threw away earlier for a given end column.
We define a rearrange()
method to generate a reversed table where each column contains states that start at that column.
Here is the rearranged table.
We save the result of rearrange before going into parse_forest()
.
Finally, during parse_forest()
, we first check to see if it is a transitive
state, and if it is, expand it to the original sequence of states using
traverse_constraints()
.
This completes our implementation of LeoParser
.
Parse Examples
Now this is still somewhat slow. Why is that? Note that recognition is
\(O(n^2)\) and actual parsing is \(O(n^3)\). That is, using parse_prefix()
to
check whether a text can be parsed by a given grammar will be much faster than
extracting a parse tree. A second issue is that we are building this over
Python implemented on top of WASM. Python on its own is fairly slow. On our
experiments, translating the earley parser to Java line by line
resulted in an improvement over 300 times.
The runnable Python source for this post is available here.
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel earleyparser
is available here.
-
Earley, Jay. “An efficient context-free parsing algorithm.” Communications of the ACM 13.2 (1970): 94-102. ↩
-
Leo, Joop MIM. “A general context-free parsing algorithm running in linear time on every LR (k) grammar without using lookahead.” Theoretical computer science 82.1 (1991): 165-176. ↩ ↩2 ↩3 ↩4
-
Aycock, John, and R. Nigel Horspool. “Practical earley parsing.” The Computer Journal 45.6 (2002): 620-630. ↩ ↩2
-
Grune, Dick, and Ceriel JH Jacobs. “Introduction to Parsing.” Parsing Techniques. Springer, New York, NY, 2008. 61-102. ↩