Contents

  1. Prerequisites
  2. Building the NFA
    1. Augment Grammar with Start
    2. The State Data-structure
    3. The NFA class
      1. NFA Initialization routines
      2. NFA Start State (create_start)
      3. Advance the state of parse by one token
      4. NFA find_transitions
      5. NFA build_nfa
  3. Constructing a graph
  4. LR0 Automata
    1. Closure
      1. Compiling DFA States
    2. Building the DFA
      1. Item
      2. DFAState
      3. LR(0) DFA
        1. DFA Compute the closure
        2. DFA Start State (create_start)
        3. Advance the state of parse by the given token
        4. DFA find_transitions
        5. add_reduce
        6. LR0DFA build_dfa
  5. LR0Recognizer
  6. LR0Parser
  7. SLR1 Automata
    1. First and follow
    2. Building the DFA
      1. DFA Initialization routines
      2. SLR1Parser
  8. LR1 Automata
    1. Building the DFA
      1. LR1Item
      2. LR1DFA class
        1. LR1DFA building DFA
      3. LR1Parser
    2. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

TLDR; This tutorial is a complete implementation of shift-reduce parsers in Python. We build LR(0) parser, SLR(1) Parser and the canonical LR(1) parser, and show how to extract the parse trees. Python code snippets are provided throughout so that you can work through the implementation steps. An LR parser is a bottom-up parser. The L stands for scanning the input left-to-right, and the R stands for constructing a rightmost derivation. This contrasts with LL parsers which are again left-to-right but construct the leftmost derivation. It is a shift reduce parser because the operation of the parser is to repeatedly shift an input symbol (left-to-right) into the stack, and to match the current stack with some production rule based on the length of the production rule, and if it matches, reduce the symbols on the top of the stack to the head of the production rule.

Prerequisites

As before, we start with the prerequisite imports. Note: The following libraries may need to be installed separately.

Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
  2. earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
  3. pydot-1.4.1-py2.py3-none-any.whl

Since this notebook serves both as a web notebook as well as a script that can be run on the command line, we redefine canvas if it is not defined already. The __canvas__ function is defined externally when it is used as a web notebook.



Importing the fuzzer for a few simple utilities.



We use the display_tree() method in earley parser for displaying trees.



Pydot is needed for drawing



As before, we use the fuzzingbook grammar style. For example, given below is a simple grammar for nested parentheses.



Equivalently,

<P> := '(' <P> ')'
     | '(' <D> ')'
<D> := 0 | 1

The main difference between an LR parser and an LL parser is that an LL parser uses the current nonterminal and the next symbol to determine the production rule to apply. In contrast, the LR parser uses the current viable prefix and the next symbol to determine the next action to take.

The viable prefix is the portion of the input string that has been recognized so far. This recognition is accomplished by the LR automaton, which we describe next.

But before that, let us start slow. Let us consider the following grammar

E -> ( D + E )
E -> D
D -> 1

Let us also introduce an augmented rule

S` -> E

If we are going for a naive translation of the grammar into an automata, this is what we could do. That is, we start with the starting production rule S' -> E. Since we are starting to parse, let us indicate the parse point also, which would be before E is parsed. That is, S' -> .E. The period represents the current parse point. If we somehow parsed E now, then we would transition to an accept state. This is represented by S' -> E.. However, since the transition is a nonterminal, it can’t happen by reading the corresponding symbol E from the input stream. It has to happen through another path. Hence, we indicate this by a dashed line. Next, when the parse is at S' -> .E, any of the expansions of E can now be parsed. So, we add each expansion of E as \(\epsilon\) transition away. These are E := . ( D + E ) and E := . D. Continuing in this fashion, we have:



Notice that this NFA is not complete. For example, what happens when D := 1 . is complete? Then, the parse needs to transition to a state that has just completed D. For example, E := ( D . + E ) or E := D .. Here is how it looks like.



As before, the dashed arrows represent non-terminal transitions that are actually completed through other paths. The red arrows represent reductions. You will notice that there can be multiple reductions from the same item. At this point, this NFA over-approximates our grammar. The right reduction needs to be chosen based on the prefix so far, and we will see how to do this later.

Building the NFA

Let us try and build these dynamically. We first build an NFA of the grammar. For that, we begin by adding a new state <> to grammar.

Augment Grammar with Start



A sample grammar.



Test



A utility procedure to extract all symbols in a grammar.



The State Data-structure

For building an NFA, all we need is to start with start item, and then recursively identify the transitions. First, we define the state data structure. A state for an NFA is simply a production rule and a parse position.



It can be tested this way



The NFA class

Next, we build an NFA out of a given grammar. An NFA is composed of different states connected together by transitions.

NFA Initialization routines



NFA Start State (create_start)



Let us test this. We can list the grammar production rules as follows:



Let us use the grammar.



Advance the state of parse by one token



Let us test this.



NFA find_transitions

Next, given a state, we need to find all other states reachable from it. The first one is simply a transition from the current to the next state when moving the parsing point one location. For example,

'<S>::= <A> | <B>'

is connected to the below by the key <A>.

'<S>::= <A> <B> |'


Note that we are building an NFA. Hence, epsilon transitions are allowed. This is what we use when we are starting to parse nonterminal symbols. For example, when we have a state with the parsing just before <B>, for e.g.

'<S>::= <A> | <B>'

Then, we add a new state

'<A>::= | a',

and connect this new state to the previous one with an \(\epsilon\) transition.



Combining both procedures and processing the state itself for its transitions. If the dot is before a symbol, then we add the transition to the advanced state with that symbol as the transition. If the key at the dot is a nonterminal, then add all expansions of that nonterminal as epsilon transfers.



Let us test this.



Defining a utility method. This is only used to display the graph. Given a key, we want to get all states where this key has just been parsed. We use this method to identify where to go back to, after parsing a specific key.



Let us test this.



NFA build_nfa

We can now build the complete NFA.



Testing the build_nfa



Constructing a graph

To display the NFA (and DFAs) we need a graph. We construct this out of the table we built previously.



Viewing the NFA



LR0 Automata

An LR(0) automaton is composed of multiple states, and each state represents a set of items that indicate the parsing progress. The states are connected together using transitions which are composed of the terminal and nonterminal symbols in the grammar. To construct the LR(0) automaton, we start with the initial state containing the augmented start symbol (if necessary), and we apply closure to expand the context. For the closure, we simply merge all epsilon transitions to the current item.

Closure

A closure represents all possible parse paths at a given point. The idea is to look at the current parse progress; Identify any nonterminals that need to be expanded next, and add the production rules of that nonterminal to the current item, with parse point (dot) at the beginning. For example, given the first state, where * represent the parse progress

<S`> := * <E>

Applying closure, we expand <E> further.

<S`> := * <E>
<E>  := * ( <D> + <E> )
<D>  := * 1

No more nonterminals to expand. Hence, this is the closure of the first state. Consider what happens when we apply a transition of ( to this state.

<E> := ( * <D> + <E> )
<D> := * 1

This gives us the following graph with each closure, and the transitions indicated. Note that the nonterminal transitions are dashed.



This is the basic automaton. However, you may notice that there are two types of nodes in this diagram. The first one represents partial parses which contain the dot at a position other than the end, and the second one represents a complete parse of a rule with the dot at the end. You will also note that the complete parse nodes seem to have red outgoing arrows, and at least in one, multiple red outgoing arrows. That is, it is not a true DFA. The next state to transition to is actually chosen based on the path the input string took through the DFA with the help of a stack.

Compiling DFA States

So, how should we represent these states? If you look at state 0, it would be possible to represent it as a procedure as below. We first save in to the stack the current state, then extract the current symbol, and depending on whether the symbol is one of the expected, we shift to the next state. Note that we ignore the blue dashed arrows (nonterminal symbols) because they were included just to indicate that the transition happens by another path. Another note is that each state represents one row of the automation table.

<>::= . <E> $
<E>::= . ( <D> + <E> ) 
<E>::= . <D> 
<D>::= . 1


<E>::= ( . <D> + <E> )
<D>::= . 1


<D>::= 1 .

This is the first reduction state. In a reduction state, what we do is to look at the stack if the corresponding rule was popped off the stack. From a reduction state, we can transition to any state that has just parsed the RHS (head) of the reduction.



<E>::= <D> .


<>::= <E> . $


<E>::= ( <D> . + <E> )


<>::= <E> $ .


<E>::= ( <D> + . <E> ) 
<E>::= . ( <D> + <E> ) 
<E>::= . <D>
<D>::= . 1


<E>::= ( <D> + <E> . )


<E>::= ( <D> + <E> ) .


Let us now verify if our parser works.



Note that while the above does not contain multiple reductions, it is possible that a state can contain multiple reductions on more complex (e.g. LR(1)) grammars. But otherwise, the general parsing is as above.

Building the DFA

Given a grammar, we will next consider how to build such a DFA. For DFA, unlike an NFA, a state is no longer a single item. So, let us define item separately.

Item



DFAState

A DFAState contains many items.



LR(0) DFA

We define our DFA initialization. We also define two utilities for creating new items and DFA states.



DFA Compute the closure

We have discussed computing the closure before. The idea is to identify any nonterminals that are next to be parsed, and recursively expand them, adding to the items.



DFA Start State (create_start)

The start item is similar to before. The main difference is that rather than returning multiple states, we return a single state containing multiple items.



Let us test this.



Advance the state of parse by the given token

Unlike in NFA, where we had only one item, and hence, there was only one advancing possible, here we have multiple items. Hence, we are given a token by which to advance, and return all items that advance by that token, advanced.



Let us test this.



DFA find_transitions

Next, we define the transitions. Unlike in the case of NFA where we had only a single item, we have multiple items. So, we look through each possible token (terminals and nonterminals)



Let us test this.



add_reduce

This is different from NFA because we are iterating over items, but we need to add reduction to the dfastate



Similarly the graph related utilities also change a bit.



LR0DFA build_dfa

Bringing all these together, let us build the DFA. (Compare to build_nfa).



Let us test building the DFA.



Let us try graphing



LR0Recognizer

We can now provide the complete parser that relies on this automata.



Testing it.



LR0Parser

We’ll next implement the LR(0) parser, which includes parse tree extraction. Parse tree extraction involves building a tree structure that represents the syntactic structure of the input string according to the grammar rules.



Now, let us build parse trees



Notice that we have used a quite simple grammar. For reference, this was our g1 grammar.

E -> ( D + E )
E -> D
D -> 1

The interesting fact about this grammar was that you could look at the current symbol, and decide which of these rules to apply. That is, if the current symbol was ( then rule 0 applies, and if the symbol was 1, then rule 1 applies. What if you have a grammar where that is impossible? Here is one such grammar

E -> D + E
E -> D
D -> 1

As you can see, it is no longer clear which rule of <E> to apply when we have a <D> parsed. To decide on such cases, we need to go up one level complexity.



Let us build the parse table.



As you can see, on State 2, we have two possible choices – s4 and r:1. This is called a shift/reduce conflict. The issue is that when we come to state 2, that is.

 <E>::= <D> | + <E>
 <E>::= <D> |

We have two possible choices. We can either reduce to <E> or shift +. To determine which one to act upon, we need a lookahead. If the next token is +, then we should shift it to stack. If not, we should reduce to <E>. This is what SLR parsers do.

SLR1 Automata

SLR(1) parsers, or Simple LR(1) parsers, are an improvement over LR(0) parsers. They use lookahead to resolve some conflicts that occur in LR(0) parsers. A lookahead is the next token in the input that hasn’t been processed yet. By considering this lookahead token, SLR(1) parsers can make more informed decisions about which production to use when reducing. For using SLR1 automation, we require first and follow sets. This has been discussed previously. Hence, providing the code here directly.

First and follow



The definition is as follows.



Testing it.



Building the DFA

DFA Initialization routines



Next, we build the dfa. There is only one change. (See CHANGED) When reducing, we only reduce if the token lookahead is in the follow set.



Let us see if it works.



SLR1Parser

There is no difference in the parser.



Let us try parsing with it.



But is this enough? Can we parse all useful grammars this way? Consider this grammar, which has a single rule addition from before.

E -> D + E
E -> E * D
E -> D
D -> 1


Let us see if it works.



You will notice a conflict in State 7. The question is whether to shift * and go to State 6, or to reduce with rule r:0. To resolve this, we need the full LR(1) parser.

LR1 Automata

LR(1) parsers, or Canonical LR(1) parsers, are the most powerful in the LR parser family. They are needed when SLR(1) parsers are not sufficient to handle certain complex grammars. LR(1) parsers differ from SLR(1) parsers in that they incorporate lookahead information directly into the parser states, allowing for even more precise parsing decisions.

Building the DFA

LR1Item

The LR1 item is similar to the Item, except that it contains a lookahead. This also is the most important difference between LR(0) and SLR(1) on one hand and LR(1) on the other. SLR uses LR(0) items which mean exactly one item per production rule + parse dot. However, in the case of LR(1) you can have multiple items with the same LR(0) core–that is, production rule and parse point–but with different lookahead. One may ask, what if use the LR(0) items but add possible lookaheads as extra information to it? This gets you LALR(1).



LR1DFA class

We also need update on create_item etc to handle the lookahead. The advance_item is called from advance(item) which does not require changes.



The compute_closure now contains the lookahead. This is the biggest procedure change from LR0DFA. The main difference is in computing the lookahead. The lines added and modified from LR0DFA are indicated in the procedure.

Here, say we are computing the closure of A -> Alpha . <B> <Beta>. Remember that when we create new items for closure, we have to provide it with a lookahead.

So, To compute the closure at <B>, we create items with lookahead which are characters that can follow <B>. This need not be just the first(<Beta>) but also what may follow <Beta> if <Beta> is nullable. This would be the lookahead of the item A -> Alpha . <B> <Beta> which we already have, let us say this is l. So, we compute first(<Beta> l) for lookahead.



LR1DFA building DFA

This method create_start changes to include the ‘$’ as lookahead.



Another major change, we no longer add a reduction to all follows of item.name instead, we restrict it to just item.lookahead.

Note. A possible confusion here is about the treatment of lookaheads, in add_reduction. In LR0DFA.add_reduction, we add a reduction link for all terminal symbols to each item. In SLR1DFA.add_reduction, we add a reduction link for all terminal symbols that are in the follow(item.name) Here, we may hence expect the lookaheads to be a subset of follow(item.name) and to add the reduction for each lookahead of a given item.

The difference is in the LR1Item compared to Item, where LR1Item contains one lookahead token per item. That is, there could be multiple items with same parse points, corresponding to each possible lookahead. Since there is one lookahead per LR1Item rather than multiple lookaheads per Item, we only need to add one reduction per item.



LR1Parser

the parse class does not change.



Parsing



Artifacts

The runnable Python source for this notebook is available here.