Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
TLDR; This tutorial is a complete implementation of some of the
shift-reduce parsers in Python. We build LR(0) parser, SLR(1) Parser and the
canonical LR(1) parser, and show how to extract the parse trees.
Python code snippets are provided throughout so that you can
work through the implementation steps.
A shift-reduce parser is a bottom-up parser that works by shifting input
symbols on to a stack, run matching procedures to identify what is on the
stack, and attempts to reduce the symbols on the top of the stack to
other symbols based on the matched production rules.
LR parsers are a class of bottom-up shift-reduce parsers^{1}
that rely on constructing an automaton called LR-Automata (typically a parsing
table) for their operation.
The L in the LR parsing stands for scanning
the input left-to-right, and the R stands for constructing a rightmost
derivation. This contrasts with LL parsers
which are again left-to-right but construct the leftmost derivation.
(Intuitively, LL parsers process and output the parse tree with pre-order
traversal (root, left, right) where as LR
outputs post order traversal, (left, right, root).)
Such parsers are shift-reduce parsers because the operation of the LR parser
is to repeatedly shift an input symbol (left-to-right) into the stack,
and based on the parsing-table, decide to either reduce the stack or shift
more input symbols on to the stack. The parsing table, in this case, is
constructed based on the production rules of the grammar.
The Right-most in this case refers to the nonterminal that the parsing
strategy decides to rewrite next. That is, in right-most strategy, the symbols
on the top of the stack are reduced first.
To illustrate this, consider the following grammar:
In the case of leftmost derivation, the sentence 0 + 1 * 1 would be parsed
as follows, starting from <E>. The period (.) shows the current parse
point in the rule, and the pipe (|) indicates the input stream. We show the
top down parsing. We assume here that the parser simply uses the first rule
to parse, and uses a lookahead when necessary. See my LL(1) post for more
advanced techniques.
As you can see, for top-down parsing, we needed to sort of unwrap the
nonterminals from the start symbol before starting to match. Hence, we
are forced to unwrap the leftmost nonterminal (if the next symbol in the
production is a nonterminal, otherwise, simply match with the next input
symbol) to match with the next input symbol from the stream.
That is, the leftmost nonterminal is always rewritten first.
Below is the bottom-up right most parsing. We will be discussing how the
rules are chosen later in this post.
As you can see, we construct nonterminals from the symbols on the stack.
If the current token that was just shifted matches a nonterminal, then we
rewrite the stack top with that nonterminal. So, the stack top is always
considered for next action. Since we consider the stack to be the
representation of a partially parsed rule, and the top of the stack is the
right most part of the parsed rule, we say it is rightmost first.
Definitions
For this post, we use the following terms:
The alphabet is the set all of symbols in the input language. For example,
in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. Note that this is slightly different
from usual definitions (done here for ease of parsing). (Usually a terminal is
a contiguous sequence of symbols from the alphabet. However, both kinds of
grammars have a one to one correspondence, and can be converted easily.)
For example, x is a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined
in the grammar using rules for expansion.
For example, <term> is a nonterminal in the below grammar.
A rule (also production rule) is a finite sequence of terms (two types
of terms: terminals and nonterminals) that describe an expansion of a given
terminal. A rule is also called an alternative expansion.
For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.
A definition is a set of rules that describe the expansion of a given nonterminal.
For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>
A context-free grammar is composed of a set of nonterminals and
corresponding definitions that define the structure of the nonterminal.
A terminalderives a string if the string contains only the symbols in the
terminal. A nonterminal derives a string if the corresponding definition
derives the string. A definition derives the string if one of the rules in
the definition derives the string. A rule derives a string if the sequence
of terms that make up the rule can derive the string, deriving one substring
after another contiguously (also called parsing).
A derivation tree is an ordered tree that describes how an input string is
derived by the given start symbol. Also called a parse tree.
A derivation tree can be collapsed into its string equivalent. Such a string
can be parsed again by the nonterminal at the root node of the derivation
tree such that at least one of the resulting derivation trees would be the
same as the one we started with.
An epsilon rule matches an empty string, and an epsilon transition is a
transition that does not consume any tokens.
A recognizer checks whether the input string can be matched by a given
grammar. That is, given the starting nonterminal symbol, can any set of
expansions result in the given input string?
A parser is a recognizer that additionally returns corresponding
parser trees for the given input string and grammar.
A parse table is a representation of the LR automation that is derived
from the production rules of the grammar.
A DFA ([deterministic-finite-automaton[(https://en.wikipedia.org/wiki/Deterministic_finite_automaton)) is a state machine of the
representation of a state machine that consumes input symbols to decide
which state to shift to.
A NFA (nondeterministic-finite-automaton) is a state machine that allows
both epsilon transitions as well as transitions to multiple states for the
same symbol.
Prerequisites
As before, we start with the prerequisite imports.
Note: The following libraries may need to be installed separately.
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
Since this notebook serves both as a web notebook as well as a script
that can be run on the command line, we redefine canvas if it is not
defined already. The __canvas__ function is defined externally when it is
used as a web notebook.
Importing the fuzzer for a few simple utilities.
We use the display_tree() method in earley parser for displaying trees.
Pydot is needed for drawing
As before, we use the fuzzingbook grammar
style. For example, given below is a simple grammar for nested parentheses.
Equivalently,
<P> := '(' <P> ')'
| '(' <D> ')'
<D> := 0 | 1
We have seen LL parsers
in the previous posts, culminating with the
GLL parser.
The main difference between an LR parser and an LL parser is that an LL
parser uses the current nonterminal and the next symbol to determine the
production rule to apply. That is, starting from the top-most, that is the
start symbol, it recursively expands the rules until it finds a rule that
starts with the token that was currently read in from the stream.
In contrast, the LR parser acts bottom up. It starts by trying to recognize
bottom most tokens, and push the recognized tokens into the stack, recursively
building more higher level tokens.
We say that the LR parser uses the
viable-prefix and the next symbol to determine the next action to
take. The viable prefix is the portion of the input string that has been
recognized so far. This recognition is accomplished by the LR automaton,
which we describe next.
Let us consider the following grammar
E -> ( D + E )
E -> D
D -> 1
To apply LR parsing to it, we need a starting definition such that the
start symbol has only one production rule as its definition. So, we add
a production rule (called augmentation) to conform.
S` -> E
For a naive translation of the grammar into the automata,
we start with the starting production rule
S' -> E. Since we are starting to parse, let us indicate the parse point
also with .'. This would be before E is parsed. That is, S’ -> .E.
That is, the period (.’) represents the current parse point.
If we somehow parsed E now, then we would transition to an accept state.
This is represented by S' -> E.. However, since the transition is a
nonterminal, it can’t happen by reading
the corresponding symbol E from the input stream. It has to happen through
another path. Hence, we indicate this by a dashed line. Next, when the parse
is at S' -> .E, any of the expansions of E can now be parsed. So, we add
each expansion of E as \(\epsilon\) transition away. These are
E := . ( D + E ) and E := . D. Continuing in this fashion, we have:
Notice that this state machine is a nondeterministic finite automaton (NFA).
That is, we have epsilon transitions. Furthermore the
NFA is not complete. For example, what happens when 6. D := 1 .
is complete? Then, the parse needs to transition to a state that has just
completed D. For example, 8. E := ( D . + E ) or 7. E := D .. Here is how
it looks like.
As before, the dashed arrows represent non-terminal transitions that are
actually completed through other paths. The red arrows represent reductions.
You will notice that there can be multiple reductions from the same
item. At this point, this NFA over-approximates our grammar.
The right reduction needs to be chosen based on the prefix so far, and we
will see how to do this later.
Building the NFA
Let us try and build these dynamically.
We first build an NFA of the grammar. For that, we begin by adding a new
state <> to grammar.
Augment Grammar with Start
A sample grammar.
Test
A utility procedure to extract all symbols in a grammar.
Test
The State Data-structure
For building an NFA, all we need is to start with start item, and then
recursively identify the transitions. First, we define the state
data structure.
A state for an NFA is simply a production rule and a parse position.
It can be tested this way
The NFA class
Next, we build an NFA out of a given grammar. An NFA is composed of
different states connected together by transitions.
NFA Initialization routines
NFA Start State (create_start)
Let us test this.
We can list the grammar production rules as follows:
Let us use the grammar.
Advance the state of parse by one token
Let us test this.
NFA find_transitions
Next, given a state, we need to find all other states reachable from it.
The first one is simply a transition from the current to the next state
when moving the parsing point one location. For example,
'<S>::= <A> | <B>'
is connected to the below by the key <A>.
'<S>::= <A> <B> |'
Note that we are building an NFA. Hence, epsilon
transitions are allowed. This is what we use when we are
starting to parse nonterminal symbols. For example, when
we have a state with the parsing just before <B>, for e.g.
'<S>::= <A> | <B>'
Then, we add a new state
'<A>::= | a',
and connect this new state to the previous one with an \(\epsilon\)
transition.
Combining both procedures and processing the state itself for its transitions.
If the dot is before a symbol, then we add the transition to the
advanced state with that symbol as the transition. If the key at
the dot is a nonterminal, then add all expansions of that nonterminal
as epsilon transfers.
Let us test this.
Defining a utility method. This is only used to display the graph.
Given a key, we want to get all states where this key has just been parsed.
We use this method to identify where to go back to, after parsing a specific
key.
Let us test this.
NFA build_nfa
We can now build the complete NFA.
Testing the build_nfa
Constructing a graph
To display the NFA (and DFAs) we need a graph. We construct this out of the
table we built previously.
Viewing the NFA
LR0 Automata
An LR(0) automaton is composed of multiple states, and each state represents a set
of items that indicate the parsing progress. The states are connected together
using transitions which are composed of the terminal and nonterminal symbols
in the grammar.
To construct the LR(0) automaton, we start with the initial state containing the
augmented start symbol (if necessary), and we apply closure to expand the
context. For the closure, we simply merge all epsilon transitions to the current
item.
Closure
A closure represents all possible parse paths at a given
point. The idea is to look at the current parse progress; Identify any
nonterminals that need to be expanded next, and add the production rules of that
nonterminal to the current item, with parse point (dot) at the beginning.
For example, given the first state, where * represent the parse progress
<S`> := * <E>
Applying closure, we expand <E> further.
<S`> := * <E>
<E> := * ( <D> + <E> )
<D> := * 1
No more nonterminals to expand. Hence, this is the closure of the first state.
Consider what happens when we apply a transition of ( to this state.
<E> := ( * <D> + <E> )
<D> := * 1
Here is how we can compute a closure
Test it.
This gives us the following graph with each closure, and the transitions indicated. Note that
the nonterminal transitions are dashed.
This is the basic automaton. However, you may notice that there are two types
of nodes in this diagram. The first one represents partial parses which
contain the dot at a position other than the end, and the second one
represents a complete parse of a rule with the dot at the end. You will also
note that the complete parse nodes seem to have red outgoing arrows, and
at least in one, multiple red outgoing arrows. That is, it is not a true
DFA. The next state to transition to is actually chosen based on the path
the input string took through the DFA with the help of a stack.
Compiling DFA States
So, how should we represent these states? If you look at state 0, it would
be possible to represent it as a procedure as below. We first save in to the
stack the current state, then extract the current symbol, and depending on
whether the symbol is one of the expected, we shift to the next state.
Note that we ignore the blue dashed arrows (nonterminal symbols) because
they were included just to indicate that the transition happens by another
path. Another note is that each state represents one row of the automation
table.
This is the first reduction state. In a reduction state, what we do is to
look at the stack if the corresponding rule was popped off the stack.
From a reduction state, we can transition to any state that has just
parsed the RHS (head) of the reduction.
Note that while the above does not contain multiple reductions, it is possible
that a state can contain multiple reductions on more complex (e.g. LR(1))
grammars. But otherwise, the general parsing is as above.
Building the DFA
Given a grammar, we will next consider how to build such a DFA.
For DFA, unlike an NFA, a state is no longer a single item. So, let us define
item separately.
Item
DFAState
A DFAState contains many items.
LR(0) DFA
We define our DFA initialization. We also define two utilities for
creating new items and DFA states.
DFA Compute the closure
We have discussed computing the closure before. The idea is to identify any
nonterminals that are next to be parsed, and recursively expand them, adding
to the items.
DFA Start State (create_start)
The start item is similar to before. The main difference is that
rather than returning multiple states, we return a single state containing
multiple items.
Let us test this.
Advance the state of parse by the given token
Unlike in NFA, where we had only one item, and hence, there
was only one advancing possible, here we have multiple items.
Hence, we are given a token by which to advance, and return
all items that advance by that token, advanced.
Let us test this.
DFA find_transitions
Next, we define the transitions. Unlike in the case of NFA where we had only a
single item, we have multiple items. So, we look through each possible
token (terminals and nonterminals)
Let us test this.
add_reduce
This is different from NFA because we are iterating over items, but we need
to add reduction to the dfastate
Similarly the graph related utilities also change a bit.
LR0DFA build_dfa
Bringing all these together, let us build the DFA. (Compare to build_nfa).
Let us test building the DFA.
Let us try graphing
LR0Recognizer
We can now provide the complete parser that relies on this automata.
Testing it.
LR0Parser
We’ll next implement the LR(0) parser, which includes parse tree extraction.
Parse tree extraction involves building a tree structure that represents the
syntactic structure of the input string according to the grammar rules.
Now, let us build parse trees
Notice that we have used a quite simple grammar. For reference, this
was our g1 grammar.
E -> ( D + E )
E -> D
D -> 1
The interesting fact about this grammar was that
you could look at the current symbol, and decide which of these rules
to apply. That is, if the current symbol was ( then rule 0 applies,
and if the symbol was 1, then rule 1 applies.
What if you have a grammar where that is impossible?
Here is one such grammar
E -> D + E
E -> D
D -> 1
As you can see, it is no longer clear which rule of <E> to apply when we
have a <D> parsed. To decide on such cases, we need to go up one level
complexity.
Let us build the parse table.
As you can see, on State 2, we have two possible choices – s4 and r:1.
This is called a shift/reduce conflict. The issue is that when we come to
state 2, that is.
<E>::= <D> | + <E>
<E>::= <D> |
We have two possible choices. We can either reduce to <E> or shift +.
To determine which one to act upon, we need a lookahead. If the
next token is +, then we should shift it to stack. If not,
we should reduce to <E>. This is what SLR parsers do.
SLR1 Automata
SLR(1) parsers, or Simple LR(1) parsers, are an improvement over LR(0) parsers.
They use lookahead to resolve some conflicts that occur in LR(0) parsers.
A lookahead is the next token in the input that hasn’t been processed yet.
By considering this lookahead token, SLR(1) parsers can make more informed
decisions about which production to use when reducing.
For using SLR1 automation, we require first and follow sets. This has been
discussed previously. Hence, providing the code here directly.
First and follow
The definition is as follows.
Testing it.
Building the DFA
DFA Initialization routines
Next, we build the dfa. There is only one change. (See CHANGED)
When reducing, we only reduce if the token lookahead is in the follow set.
Let us see if it works.
We can also view it as before.
SLR1Parser
There is no difference in the parser.
Let us try parsing with it.
But is this enough? Can we parse all useful grammars this way?
Consider this grammar.
S -> E + T
S -> T
E -> + T
E -> 1
T -> E
Let us see if it works.
You will notice a conflict in State 3. [‘s8’, ‘r:4’]
The question is whether to shift +
and go to State 8, or to reduce with rule r:4.
To resolve this, we need the full LR(1) parser.
LR1 Automata
LR(1) parsers, or Canonical LR(1) parsers, are the most powerful in the LR parser family.
They are needed when SLR(1) parsers are not sufficient to handle certain complex grammars.
LR(1) parsers differ from SLR(1) parsers in that they incorporate lookahead information
directly into the parser states, allowing for even more precise parsing decisions.
Building the DFA
LR1Item
The LR1 item is similar to the Item, except that it contains a lookahead.
This also is the most important difference between LR(0) and SLR(1) on one
hand and LR(1) on the other. SLR uses LR(0) items which mean exactly one item
per production rule + parse dot. However, in the case of LR(1) you can have
multiple items with the same LR(0) core–that is, production rule and parse
point–but with different lookahead. One may ask, what if use the LR(0) items
but add possible lookaheads as extra information to it? This gets you LALR(1).
LR1DFA class
We also need update on create_item etc to handle the lookahead.
The advance_item is called from advance(item) which does not require
changes.
The compute_closure now contains the lookahead.
This is the biggest procedure change from LR0DFA.
The main difference is in computing the lookahead.
The lines added and modified from LR0DFA are indicated in the procedure.
Here, say we are computing the closure of A -> Alpha . <B> <Beta>.
Remember that when we create new items for closure, we have to provide it with
a lookahead.
So, To compute the closure at <B>, we create items with lookahead which are
characters that can follow <B>. This need not be just the first(<Beta>)
but also what may follow <Beta> if <Beta> is nullable. This would be the
lookahead of the item A -> Alpha . <B> <Beta> which we already have, let us
say this is l. So, we compute first(<Beta> l) for lookahead.
LR1DFA building DFA
This method create_start changes to include the ‘$’ as lookahead.
Another major change, we no longer add a reduction to all follows of item.name
instead, we restrict it to just item.lookahead.
Note. A possible confusion here is about the treatment of lookaheads,
in add_reduction. In LR0DFA.add_reduction, we add a reduction link for
all terminal symbols to each item. In SLR1DFA.add_reduction, we add a
reduction link for all terminal symbols that are in the follow(item.name)
Here, we may hence expect the lookaheads to be a subset of follow(item.name)
and to add the reduction for each lookahead of a given item.
The difference is in the LR1Item compared to Item, where LR1Item contains one
lookahead token per item. That is, there could be multiple items with same
parse points, corresponding to each possible lookahead. Since there is one
lookahead per LR1Item rather than multiple lookaheads per Item, we only need
to add one reduction per item.
LR1Parser
the parse class does not change.
Parsing
We can also view it as before.
Test the parser with some input strings
LALR1 Automata
One of the problems with LR(1) parsing is that while powerful, (all
deterministic context-free languages can be expressed as an LR(1) grammar)
the number of states required is very large because we incorporate lookahead
information into items. That is, each lookahead symbol in combination with the
LR(0) core becomes an LR1 item. Hence, the number of such LR(1) items can be
as large as the number of LR(0) items multiplied by the number of terminal
symbols in the grammar. Hence, even for a simple language such as C, an LR(1)
grammar may result in a table that occupies 1 MB of RAM. While this is not
too onerous for modern computers, one may ask if there is some trade off that
we can make such that we can parse with a more larger set of DCFG grammars
while still keeping the memory requirements similar to the SLR(1) table.
The LALR is one such technique. The idea is that similar to SLR(1) and LR(1)
we maintain the lookahead. But unlike SLR(1) (and like LR(1)) we lookahead
per production rule rather than per nonterminal. However, for any given state
in the, automata, we merge those items in the state that has the same LR(0)
core (but with different lookahead) into one item with a set of lookahead
symbols.
LALR1 Item
As we stated above, we keep a set of lookahead symbols instead of one.
LALR1 DFA
LALR1DFA creating a new state is similar to before, but with an additional
wrinkle. We update the lookahead symbols when creating a new state if the
LR(0) core is the same. Furthermore, we merge those states that differ only
on the lookahead symbols of their items.
LALR1 Closure
Note how we have to keep reprocessing the items until the lookahead symbols
are completely processed. This is because the first_of_rule() return can
change with new lookahead symbols on current item.
The parser itself is no different from other parsers.
For LALR(1), it took me some time to understand that we do not merge items
with same LR(0) cores globally. For example see this grammar
State 5 and State 8. The item A -> d. has different lookahead symbols (a and c) in these states.
References
Artifacts
The runnable Python source for this notebook is available here.
Dick Grune and Ceriel J.H. Jacobs “Parsing Techniques A Practical Guide” 2008 ↩