TLDR; This tutorial is a complete implementation of Leslie G Valiant’s general context-free Parser1 adapted from the paper “CFG Parsing and Boolean Matrix Multiplication” by Franziska Ebert 2, implemented in Python. The Python interpreter is embedded so that you can work through the implementation steps.

Note. This implementation is not optimized. It prioritizes clarity over performance.

Valiant’s parer is a general context-free parser, and like CYK and Earley, it operates on a chart. The claim to fame of Valiant’s parser is that it showed how to transform recognition of general-context free languages to an instance of Boolean Matrix Multiplication. That is, if you have a good matrix multiplication algorithm, you can use it to improve the speed of context-free language recognition. In fact, while previously known algorithms such as CYK and Earley were \(O(n^3)\) in the worst case, Valiant’s algorithm showed how recognition could be done in \(O(n^{2.81})\) time using Strassen’s matrix multiplication algorithm (This post uses the traditional multiplication algorithm, but improved algorithms can be substituted at the multiply_bool_matrices() function ). It uses the same chart as that of CYK, and similar to CYK, it requires the grammar to be in the Chomsky Normal Form, which allows at most two symbols on the right hand side of any production. In particular, all the rules have to conform to

\[A -> B C\] \[A -> a\] \[S -> \epsilon\]

Where A,B, and C are nonterminal symbols, a is any terminal symbol, S is the start symbol, and \(\epsilon\) is the empty string.

We previously discussed Earley parser which is a general context-free parser. CYK parser is another general context-free parser that is capable of parsing strings that conform to any given context-free grammar.

A peculiarity of Valiant’s parser that it shares with CYK parser is that unlike Earley, GLL, and GLR, it is not a left-to-right parser. Rather, it is a bottom-up parser similar to CYK that builds substrings of fixed length at each pass. That is, the parsing starts everywhere at once. Not just at a corner.

Synopsis

import valiantparser as P
my_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = P.ValiantParser(my_grammar)
assert my_parser.recognize_on(text='1a', start_symbol='<start>'):
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    P.display_tree(tree)

Contents

  1. Synopsis
  2. Definitions
    1. Prerequisites
  3. ValiantRecognizer
    1. Initialize the table
    2. Matrix multiplication
    3. Matrix union
    4. Transitive relation
    5. Transitive closure
    6. Recognize
  4. ValiantParser
    1. Extracting trees
    2. Parse
  5. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Definitions

In the interests of having a self contained post, we repeat the definitions. We use the following terms:

  • The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.

  • A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)

    For example, x is a terminal symbol.

  • A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.

    For example, <term> is a nonterminal in the below grammar.

  • A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal. A rule is also called an alternative expansion.

    For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.

  • A production is a combination of a nonterminal and one of its productions.

    For example <expr>: [<term>+<expr>] is a production.

  • A definition is a set of rules that describe the expansion of a given nonterminal.

    For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>

  • A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.

    The grammar given below is an example context-free grammar.

  • A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).

  • A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
  • A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.

  • The yield of a tree is the string resulting from collapsing that tree.

  • An epsilon rule matches an empty string.

Prerequisites

As before, we start with the prerequisite imports.

Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
  2. earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
  3. cykparser-0.0.1-py2.py3-none-any.whl from "CYK Parser".

We need the fuzzer to generate inputs to parse and also to provide some utilities



We use the display_tree() method in earley parser for displaying trees.



We use the chart display from cykparser



We use the random choice to extract derivation trees from the parse forest.



As before, we use the fuzzingbook grammar style. A terminal symbol has exactly one character (Note that we disallow empty string ('') as a terminal symbol). Secondly, as per traditional implementations, there can only be one expansion rule for the start ymbol. Let us start with the following grammar.



ValiantRecognizer

We initialize our recognizer with the grammar, and identify the terminal and nonterminal productions separately. termiinal productions are those that are of the form <A> ::= a where a is a terminal symbol. Nonterminal productions are of the form <A> ::= <B><C> where <B> and <C> are nonterminal symbols.



Initialize the table

We first initialize the matrix that holds the results. The cell[i][j] represents the nonterminals that can parse the substring text[i..j] Note that this table is exactly the same as what CYK produces.



Using it



We reuse the base case from the CYKParser. The idea is that, we look at each character in the input, and for each i in the input we identify cell[i][i+1] and add the nonterminal symbol that derives the corresponding token.



Using it



Matrix multiplication

Next, we define the multi-token parse. We start with multiplication of matrices which is the core of this algorithm. The table which we defined allows sets of nonterminal symbols at each cell. So, we define the multiplication of individual cells.

Given two sets of nonterminal symbols \(N_1\), \(N_2\), we have

\[N_1 ∗ N_2 = \{ A_i | \exists A_j \in N_1, A_k \in N_2 : (A_i -> A_j A_k) \in P \}\]

where \(P\) is the set of production rules. The essential idea is, given a rule \(A_i -> A_j A_k\), and the parsing of \(A_j\) and \(A_k\) is available, then mark \(A_i\) as parsable.



Let us try testing it.



Next, we can use the cell multiplication to define table multiplication. Note that we do not do transformation into boolean matrices here.



Let us try testing the matrix multiplication.



If we want to convert to boolean matrices given matrices \(a\) and \(b\):, we start by computing \(h = |N|\) where \(N\) is the set of nonterminals. We start with matrix \(a\). Next, we generate \(h\) matrices one for each nonterminal \(k \in N\). Let us call such a matrix \(M_k\). We fill it in this way: If nonterminal \(k\) is present in the cell \((i,j)\) of \(a\), the corresponding cell in \(M_k\) will be true, and false otherwise. We do the same for matrix \(b\), getting \(2*h\) boolean matrices.



Let us call these \(M_k^a\) and \(M_k^b\) where \(k\) indicates the nonterminal.

Next, we pair each matrix from \(M_k^a\) and \(M_k^b\) for each \(k \in N\) obtaining \(h^2\) pairs, and compute the boolean matrix multiplication of each pairs. We address each as \(r(l,m)\) where \(l \in N\) and \(m \in N\).



multiply two boolean matrices



In the final matrix \(c = a * b\), for the cell \(c(i,j)\) it will contain the nonterminal \(p \in N\) iff there exist l,m such that a rule \(p -> l m\) exists, and the matrix \(r(l,m)\) contains \(1\) in cell \((i,j)\).



Let us try testing the matrix multiplication.



Matrix union

Next, we want to define how to make a union of two matrices



Let us try testing it.



Transitive relation

Valiant showed that we can compute the transitive relationparsable in i steps – can be computed using matrix multiplication. For a matrix \(a^{(i)}\), the relation is given by:

\(a^{(i)} = U_{j=1}^{i-1} a^{(j)} * a^{(i-j)}\) when \(i > 1\)

\(a^{(1)} = a\) when \(i = 1\)

The intuition here is that if we have a 4 letter input, it may be parsed by splitting into 1+3, 2+2, or 3+1. So, we compute \(a^{(1)}*a^{(3)} U a^{(2)}*a^{(2)} U a^{(3)}*a^{(1)}\).

At this point, we are ready to define the transitive relation.



We can now test it. step 2 b1 = A(1)



step 3.a [i=2] => b1= A(1) U b2= A(j=1)*A(i-j=1) – till j == i-1



step 3.b [i=3] => b1=A(1) U b2=A(j=1)*A(i-j=2) U b3=A(j=2)*A(i-j=1) – till j == i-1



Here is another grammar where we work through the same steps.



On parsing ababa, we get the following CYK table.

a b a b a
A, C B      
S, C A, S      
B S, C B    
B B A, C B  
S, C, A B S, C A, S A, C

Please note that the representation of the matrix is different from how we do it.

Next, let us run the steps



Transitive closure

Valiant further showed that the transitive closure of all these matrices, that is

\[a^{+} = a^{(1)} U a^{(2)} ...\]

is the parse matrix. That is, building the transitive closure builds the complete parse chart. We can now check if the input was parsed.



Testing.



Recognize

Let us hook it up.



Using it



ValiantParser

Note: The recognizer works well, but the tree extraction is naive.

At this point, we have the recognition matrix. To make this into a true parser, similar to CYK, we can add back pointers. However, Ruzzo3 showed that if we have the CYK or Valiant recognition matrix (both are same) we can extract a parse tree in at most \(O(log(n))\) slower than the recognizer. Here, we implement a naive algorithm that just shows how we can extract a single tree.

Extracting trees

Unlike GLL, GLR, and Earley, and like CYK, due to restricting epsilons to the start symbol, there are no infinite parse trees. Furthermore, we only pick the first available tree. This can be trivially extended if needed.

The basic idea here is that, given a rule \(S -> A B\) that parsed text \(W\), and we find \(S\) in the recognition matrix, the nonterminal \(B\) that helped in parsing \(W\) has to be found in the same column as that of \(S\). So, we look through the final column and generate a list of tuples where the tuple has the format (idx, nonterminal) where idx is the point where \(B\) started parsing from. At this point, if we look at the column idx-1, then at the top of the column (in 0th row) there has to be the nonterminal \(A\) that is on the other side of the rule.



Testing it



Incorporating the breaks in tree.



Parse

Adding the extract tree



Using it (uses random choice, click run multiple times to get other trees).



assign display_tree



Let us do a final test



Using



Artifacts

The runnable Python source for this notebook is available here.

  1. Leslie G. Valiant “General context-free recognition in less than cubic time” 1975 

  2. Franziska Ebert “CFG Parsing and Boolean Matrix Multiplication” 2006 

  3. Ruzzo, Walter L. “On the complexity of general context-free language parsing and recognition.” 1979