TLDR; This tutorial is a complete implementation of CYK Parser in Python^{1}
The Python interpreter is embedded so that you can work through the
implementation steps.
A CYK parser is a general context-free parser.
It uses dynamic # programming to fill in a chart. Unlike Earley, GLL and GLR
parsers, it requires the grammar to be in the Chomsky normal form, which
allows at most two symbols on the right hand side of any production.
In particular, all the rules have to conform to
\[A -> BC\]
\[A -> a\]
\[S -> \epsilon\]
Where A,B, and C are nonterminal symbols, a is a terminal symbol, S is the
start symbol, and \(\epsilon\) is the empty string.
We previously discussed
Earley parser which is a general context-free parser. CYK
parser is another general context-free parser that is capable of parsing
strings that conform to any given context-free grammar.
Similar to Earley, GLR, GLL, and other general context-free parsers, the worst
case for CYK parsing is \(O(n^3)\) . However, unlike those parsers, the best
case is also \(O(n^3)\) for all grammars. A peculiarity of CYK parser is that
unlike Earley, GLL, and GLR, it is not a left-to-right parser. In this respect,
the best known other parser that is not left-to-right is Ungar’s parser.
Furthermore, it is a bottom-up parser that builds substrings of fixed length
from the bottom up.
Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
Definitions
For this post, we use the following terms:
The alphabet is the set all of symbols in the input language. For example,
in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. Note that this is slightly different
from usual definitions (done here for ease of parsing). (Usually a terminal is
a contiguous sequence of symbols from the alphabet. However, both kinds of
grammars have a one to one correspondence, and can be converted easily.)
For example, x is a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined
in the grammar using rules for expansion.
For example, <term> is a nonterminal in the below grammar.
A rule is a finite sequence of terms (two types of terms: terminals and
nonterminals) that describe an expansion of a given terminal. A rule is
also called an alternative expansion.
For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.
A production is a combination of a nonterminal and one of its productions.
For example <expr>: [<term>+<expr>] is a production.
A definition is a set of rules that describe the expansion of a given nonterminal.
For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>
A context-free grammar is composed of a set of nonterminals and
corresponding definitions that define the structure of the nonterminal.
The grammar given below is an example context-free grammar.
A terminal derives a string if the string contains only the symbols in the
terminal. A nonterminal derives a string if the corresponding definition
derives the string. A definition derives the string if one of the rules in
the definition derives the string. A rule derives a string if the sequence
of terms that make up the rule can derive the string, deriving one substring
after another contiguously (also called parsing).
A derivation tree is an ordered tree that describes how an input string is
derived by the given start symbol. Also called a parse tree.
A derivation tree can be collapsed into its string equivalent. Such a string
can be parsed again by the nonterminal at the root node of the derivation
tree such that at least one of the resulting derivation trees would be the
same as the one we started with.
The yield of a tree is the string resulting from collapsing that tree.
An epsilon rule matches an empty string.
Prerequisites
As before, we start with the prerequisite imports.
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
To install, simply download the wheel file (`pkg.whl`) and install using
`pip install pkg.whl`.
We need the fuzzer to generate inputs to parse and also to provide some
utilities
We use the display_tree() method in earley parser for displaying trees.
We use the random choice to extract derivation trees from the parse forest.
As before, we use the fuzzingbook grammar style.
A terminal symbol has exactly one character
(Note that we disallow empty string ('') as a terminal symbol).
Let us start with the following grammar.
We initialize our parser with the grammar, and identify the terminal and
nonterminal productions separately. termiinal productions are those that
are of the form <A> ::= a where a is a terminal symbol. Nonterminal
productions are of the form <A> ::= <B><C> where <B> and <C> are
nonterminal symbols.
Next, we define the recognizer. The idea here is that CYK algorithm formulates
the recognition problem as a dynamic problem where the parse of a string of
length n using a nonterminal <A> which is defined as <A> ::= <B> <C> is
defined as a parse of the substring 0..x with the nonterminal <B> and the
parse of the substring x..n with nonterminal <C> where 0 < x < n. That
is, <A> parses the string if there exists such a parse with <B> and <C>
for some x.
We first initialize the matrix that holds the results. The cell[i][j]
represents the nonterminals that can parse the substring text[i..j]
Let us define a printing routine.
Using it
Next, we define the base case, which is when a single input token matches one
of the terminal symbols. In that case, we look at each character in the input,
and for each i in the input we identify cell[i][i+1] and add the
nonterminal symbol that derives the corresponding token.
Using it
Next, we define the multi-token parse. The idea is to build the table
incrementally. We have already indicated in the table which nonterminals parse
a single token. Next, we find all nonterminals that parse two tokens, then
using that, we find all nonterminals that can parse three tokens etc.
Using it for example, on substrings of length 2
For length 3
For length 4
We combine everything together. At the end, we check if the start_symbol can
parse the given tokens by checking (0, n) in the table.
Using it
At this point, we have a recognizer for a grammar in CNF form. However, this
is a bit unsatisfying because the CNF form is not very userfriendly. In
particluar it lacks two conveniences we are used to in context-free gramamrs
Having more than two tokens in a rule
The ability to add epsilon rules.
The first one is not very difficult to solve. Given an expansion rule for a
nonterminal
and so on. We can also recover the structure back from any parse tree by
combining the corresponding tokens. The second restriction is more difficult
Having to factor out epsilon can change the grammar completely. Turns out, it
is not very difficult to incorporate epsilons into this parser.
First, we extract all nullable nonterminals of our grammar. (See Earley parser)
A grammar. Note that we use <> to represent empty (epsilon) nonterminal
Testing
So at this point, all we have to do is, after each cell is computed fully, we
just have to extend the cell with the guaranteed parse.
Testing
Handling epsilons now allows us to get to the next step in supporting any
context free grammar. We require two more steps to support.
Replacing terminal symbols with nonterminal symbols representing tokens
Ensuring that there are exactly two nonterminal symbols in any nonterminal
rule. We will handle the first one now.
Test
Next, we want to replace any rule that contains more than two tokens with
it decomposition.
[] = []
[t1] = [t1]
[t1, t2] = [t1, t2]
[t1, t2, t3] = [t1, _t2], _t2 = [t2, t3]
test
decompose grammar
all that remains now is to ensure that each rule is exactly two
token nonterminal, or a single token terminal.
connecting everything together
A grammar
Test
Testing
CYKParser
Now, all we need to do is to add trees. Unlike GLL, GLR, and Earley, due to
restricting epsilons to the start symbol, there are no infinite parse trees.
The parse_1 for terminal symbols
The substring parse
Parsing
Using it (uses random choice, click run multiple times to get other trees).
assign display_tree
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel cykparser is available here.
Dick Grune and Ceriel J.H. Jacobs “Parsing Techniques A Practical Guide” 2008 ↩