Valiant's Parser
TLDR; This tutorial is a complete implementation of Leslie G Valiant’s general context-free Parser1 adapted from the paper “CFG Parsing and Boolean Matrix Multiplication” by Franziska Ebert 2, implemented in Python. The Python interpreter is embedded so that you can work through the implementation steps.
Note. This implementation is not optimized. It prioritizes clarity over performance.
Valiant’s parer is a general context-free parser, and like CYK and Earley, it
operates on a chart. The claim to fame of Valiant’s parser is that it showed
how to transform recognition of general-context free languages to an
instance of Boolean Matrix Multiplication. That is, if you have a good matrix
multiplication algorithm, you can use it to improve the speed of context-free
language recognition. In fact, while previously known algorithms such as CYK
and Earley were \(O(n^3)\) in the worst case, Valiant’s algorithm showed how
recognition could be done in \(O(n^{2.81})\) time using Strassen’s matrix
multiplication algorithm (This post uses the traditional multiplication
algorithm, but improved algorithms can be substituted at the
multiply_bool_matrices()
function ).
It uses the same chart as that of CYK, and similar to CYK, it requires the
grammar to be in the Chomsky Normal Form, which
allows at most two symbols on the right hand side of any production.
In particular, all the rules have to conform to
Where A,B, and C are nonterminal symbols, a is any terminal symbol, S is the start symbol, and \(\epsilon\) is the empty string.
We previously discussed Earley parser which is a general context-free parser. CYK parser is another general context-free parser that is capable of parsing strings that conform to any given context-free grammar.
A peculiarity of Valiant’s parser that it shares with CYK parser is that unlike Earley, GLL, and GLR, it is not a left-to-right parser. Rather, it is a bottom-up parser similar to CYK that builds substrings of fixed length at each pass. That is, the parsing starts everywhere at once. Not just at a corner.
Synopsis
import valiantparser as P
my_grammar = {'<start>': [['1', '<A>'],
['2']
],
'<A>' : [['a']]}
my_parser = P.ValiantParser(my_grammar)
assert my_parser.recognize_on(text='1a', start_symbol='<start>'):
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
P.display_tree(tree)
Contents
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Definitions
In the interests of having a self contained post, we repeat the definitions. We use the following terms:
-
The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
-
A terminal is a single alphabet symbol. Note that this is slightly different from usual definitions (done here for ease of parsing). (Usually a terminal is a contiguous sequence of symbols from the alphabet. However, both kinds of grammars have a one to one correspondence, and can be converted easily.)
For example,
x
is a terminal symbol. -
A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion.
For example,
<term>
is a nonterminal in the below grammar. -
A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal. A rule is also called an alternative expansion.
For example,
[<term>+<expr>]
is one of the expansion rules of the nonterminal<expr>
. -
A production is a combination of a nonterminal and one of its productions.
For example
<expr>: [<term>+<expr>]
is a production. -
A definition is a set of rules that describe the expansion of a given nonterminal.
For example,
[[<digit>,<digits>],[<digit>]]
is the definition of the nonterminal<digits>
-
A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal.
The grammar given below is an example context-free grammar.
-
A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).
- A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
-
A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.
-
The yield of a tree is the string resulting from collapsing that tree.
- An epsilon rule matches an empty string.
Prerequisites
As before, we start with the prerequisite imports.
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.We need the fuzzer to generate inputs to parse and also to provide some utilities
We use the display_tree()
method in earley parser for displaying trees.
We use the chart display from cykparser
We use the random choice to extract derivation trees from the parse forest.
As before, we use the fuzzingbook grammar style.
A terminal symbol has exactly one character
(Note that we disallow empty string (''
) as a terminal symbol).
Secondly, as per traditional implementations,
there can only be one expansion rule for the start ymbol.
Let us start with the following grammar.
ValiantRecognizer
We initialize our recognizer with the grammar, and identify the terminal and
nonterminal productions separately. termiinal productions are those that
are of the form <A> ::= a
where a is a terminal symbol. Nonterminal
productions are of the form <A> ::= <B><C>
where <B>
and <C>
are
nonterminal symbols.
Initialize the table
We first initialize the matrix that holds the results. The cell[i][j]
represents the nonterminals that can parse the substring text[i..j]
Note that this table is exactly the same as what CYK produces.
Using it
We reuse the base case from the CYKParser.
The idea is that, we look at each character in the input,
and for each i
in the input we identify cell[i][i+1]
and add the
nonterminal symbol that derives the corresponding token.
Using it
Matrix multiplication
Next, we define the multi-token parse. We start with multiplication of matrices which is the core of this algorithm. The table which we defined allows sets of nonterminal symbols at each cell. So, we define the multiplication of individual cells.
Given two sets of nonterminal symbols \(N_1\), \(N_2\), we have
\[N_1 ∗ N_2 = \{ A_i | \exists A_j \in N_1, A_k \in N_2 : (A_i -> A_j A_k) \in P \}\]where \(P\) is the set of production rules. The essential idea is, given a rule \(A_i -> A_j A_k\), and the parsing of \(A_j\) and \(A_k\) is available, then mark \(A_i\) as parsable.
Let us try testing it.
Next, we can use the cell multiplication to define table multiplication. Note that we do not do transformation into boolean matrices here.
Let us try testing the matrix multiplication.
If we want to convert to boolean matrices given matrices \(a\) and \(b\):,
we start by computing \(h = |N|\) where \(N\)
is the set of nonterminals. We start with matrix \(a\).
Next, we generate \(h\) matrices one for each nonterminal \(k \in N\).
Let us call such a matrix \(M_k\). We fill it in this way:
If nonterminal \(k\) is present in the cell \((i,j)\) of \(a\),
the corresponding cell in \(M_k\) will be true
, and false
otherwise.
We do the same for matrix \(b\), getting \(2*h\) boolean matrices.
Let us call these \(M_k^a\) and \(M_k^b\) where \(k\) indicates the nonterminal.
Next, we pair each matrix from \(M_k^a\) and \(M_k^b\) for each \(k \in N\) obtaining \(h^2\) pairs, and compute the boolean matrix multiplication of each pairs. We address each as \(r(l,m)\) where \(l \in N\) and \(m \in N\).
multiply two boolean matrices
In the final matrix \(c = a * b\), for the cell \(c(i,j)\) it will contain the nonterminal \(p \in N\) iff there exist l,m such that a rule \(p -> l m\) exists, and the matrix \(r(l,m)\) contains \(1\) in cell \((i,j)\).
Let us try testing the matrix multiplication.
Matrix union
Next, we want to define how to make a union of two matrices
Let us try testing it.
Transitive relation
Valiant showed that we can compute the transitive relation – parsable in i steps – can be computed using matrix multiplication. For a matrix \(a^{(i)}\), the relation is given by:
\(a^{(i)} = U_{j=1}^{i-1} a^{(j)} * a^{(i-j)}\) when \(i > 1\)
\(a^{(1)} = a\) when \(i = 1\)
The intuition here is that if we have a 4 letter input, it may be parsed by splitting into 1+3, 2+2, or 3+1. So, we compute \(a^{(1)}*a^{(3)} U a^{(2)}*a^{(2)} U a^{(3)}*a^{(1)}\).
At this point, we are ready to define the transitive relation.
We can now test it.
step 2 b1 = A(1)
step 3.a [i=2] => b1= A(1) U b2= A(j=1)*A(i-j=1)
– till j == i-1
step 3.b [i=3] => b1=A(1) U b2=A(j=1)*A(i-j=2) U b3=A(j=2)*A(i-j=1)
– till j == i-1
Here is another grammar where we work through the same steps.
On parsing ababa
, we get the following CYK table.
a | b | a | b | a |
---|---|---|---|---|
A, C | B | |||
S, C | A, S | |||
B | S, C | B | ||
B | B | A, C | B | |
S, C, A | B | S, C | A, S | A, C |
Please note that the representation of the matrix is different from how we do it.
Next, let us run the steps
Transitive closure
Valiant further showed that the transitive closure of all these matrices, that is
\[a^{+} = a^{(1)} U a^{(2)} ...\]is the parse matrix. That is, building the transitive closure builds the complete parse chart. We can now check if the input was parsed.
Testing.
Recognize
Let us hook it up.
Using it
ValiantParser
Note: The recognizer works well, but the tree extraction is naive.
At this point, we have the recognition matrix. To make this into a true parser, similar to CYK, we can add back pointers. However, Ruzzo3 showed that if we have the CYK or Valiant recognition matrix (both are same) we can extract a parse tree in at most \(O(log(n))\) slower than the recognizer. Here, we implement a naive algorithm that just shows how we can extract a single tree.
Extracting trees
Unlike GLL, GLR, and Earley, and like CYK, due to restricting epsilons to the start symbol, there are no infinite parse trees. Furthermore, we only pick the first available tree. This can be trivially extended if needed.
The basic idea here is that, given a rule \(S -> A B\) that parsed text
\(W\), and we find \(S\) in the recognition matrix, the nonterminal \(B\) that
helped in parsing \(W\) has to be found in the same column as that of \(S\).
So, we look through the final column and generate a list of tuples where the
tuple has the format (idx, nonterminal)
where idx
is the point where \(B\)
started parsing from. At this point, if we look at the column idx-1
, then at
the top of the column (in 0th row) there has to be the nonterminal \(A\) that
is on the other side of the rule.
Testing it
Incorporating the breaks in tree.
Parse
Adding the extract tree
Using it (uses random choice, click run multiple times to get other trees).
assign display_tree
Let us do a final test
Using
Artifacts
The runnable Python source for this notebook is available here.