We talked about Earley parsers previously.
One of the interesting things about Earley parsers is that it also forms the
basis of best known general context-free error correcting parser. A parser is
error correcting if it is able to parse corrupt inputs that only partially
conform to a given grammar. For example, given a JSON input such as
'[{"abc":[]'
The error correcting parser will be able to supply the input that is
necessary to make the input valid. In this case, it will supply }].
The algorithm is minimal error correcting if the correction provided is
minimal in some sense. For example, if the correction is , "":[]}], the
correction is not minimal.
The particular algorithm we will be examining is
the minimum distance error correcting parser by Aho et al.1.
There are two parts to this algorithm. The first is the idea of a
covering grammar that parses any corrupt input and the second is the
extraction of the best possible parse from the corresponding parse forest.
Aho et al. uses Earley parser for their error correcting parser. So, we will
follow in their foot steps.
Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
The idea from Aho et al. is to first transform the given grammar into a
covering grammar. A grammar \(G_2\) covers another grammar \(G_1\) if
all productions in \(G_1\) have a one to one correspondence to some production
in \(G_2\), and a string that is parsed by \(G_1\) is guaranteed to be parsed
by \(G_2\), and all the parses from \(G_1\) are guaranteed to exist in the set
of parses from \(G_2\) (with the given homomorphism of productions).
So, we first construct a covering grammar that can handle any corruption of
input, with the additional property that there will be a parse of the corrupt
string which contains the minimum number of modifications needed such that
if they are applied on the string, it will make it parsed by the original
grammar.
First, we load the prerequisites
The following is our grammar and its start symbol.
The grammar can be printed as follows.
For example,
Here is the (slightly simplified – not all space characters are terminals) JSON grammar
Now, constructing a covering grammar proceeds as follows.
First, we take each terminal symbol in the given grammar. For example, the
below contains all terminal symbols from our grammar
Next, we consider the following corruptions of the valid input:
The input symbol being considered may have been deleted
The input symbol being considered may have some junk value in front
The input symbol may have been mistakenly written as something else.
A moment’s reflection should convince you that a covering grammar only needs
to handle these three cases (In fact, only the first two cases are sufficient
but we add the third because it is also a simple mutation).
The main idea is that we replace the given terminal symbol with an equivalent
nonterminal that lets you make these mistakes. So, we first define that
nonterminal that corresponds to each terminal symbol.
We also define a convenience function that when given a rule, translates the
terminal symbols in that rule to the above nonterminal symbol.
How are these nonterminals defined? Each nonterminal has the following
expansion rules
<$ [a]> -> a
| <$.+> a
| <$>
| <$![a]>
That is, each nonterminal that corresponds to a terminal symbol has the
following expansions:
it matches the original terminal symbol
there is some junk before the terminal symbol. So, match and discard
that junk before matching the terminal symbol
– <$.+> matches any number of any symbols. These are the corresponding
nonterminals names
the terminal symbol was deleted. So, skip this terminal symbol by matching
empty (<$>)
the input symbol was a mistake. That is, it matches any input symbol other
than the expected input symbol a – <$![a]>. We have to define as many
nonterminals as there are terminal symbols again. So, we define a function.
Note: It could be possible to completely remove Any_not by simply using
Any_one instead. The idea is that if Any_one matches a symbol, we apply a
single penalty, and if it matches the current symbol (thus becoming a NOP),
the use of this rule would be penalized, and hence not extracted. This will
reduce the grammar size overhead.
What happens if there is junk after parsing? We take care of that by wrapping
the start symbol as follows
<$corrupt_start> -> <start>
| <start> <$.+>
It is used as follows
Finally we are ready to augment the original given grammar so that what we
have is a covering grammar. We first extract the symbols used, then produce
the nonterminal Any_one that correspond to any symbol match. Next,
we use Any_not to produce an any symbol except match. We then have a
Empty to match the absence of the nonterminal.
Here is the augmented grammar
Here is the augmented grammar for JSON
At this point, we are ready to check the covering properties of our grammar.
We define a `tree_to_str_fix() that indicates the corrections produced.
We define a `tree_to_str_delta() that indicates the corrections produced.
As you can see, the covering grammar can recognize the input, but we have no
guarantee that it will only parse the input corresponding to the original
grammar.
What about an error?
As you can see, we can parse corrupt inputs, but the inputs that we parse are
not necessarily the smallest. The next step is how to extract
the minimally corrupt parse.
The minimally corrupt parse.
First, we need to modify the Earley parser so that it can keep track of the
penalties. We essentially assign a penalty if any of the following us used.
Any use of <$.> : Any_one — note, Any_plus gets +1 automatically
Any use of <$> : Empty
Any use of <$ !{.}> : Any_not
For getting this to work, we have to reengineer our nullable nonterminals,
keeping track of the corruptions introduced.
This is how it is used. We do not expect any nullable keys in grammar.
covering grammar
JSON covering grammar
Now, we attach our nullable function to our parser.
We also need to keep track of penalty. The first place is
in the complete, where we propagate a penalty to the parent
if the sate being completed resulted in a penalty
Next, we also need to account for the fact that some of our
corrections are empty, which contains their own penalties.
So, we hook it up to predict.
So, how do we hook up the penalty for corrections? We do that in
the penalized states as below. We also make sure that the
penalties are propagated.
Now, we come to adding a state to the column. If you remember from
the earley parser post, the uniqueness
check in the add prevents unbounded left recursion for nonterminals that
allow empty expansion.
There is one complication here. We need to keep track of the minimum penalty
that a state incurred. In particular, any time we find a less corrupt parse,
we need to use that penalty. Now, simply updating the penalty of the state
will not work because child states from the previous state with the higher
penalty also needs to be updated. So, to force the issue, we simply append
the new state to the columns state list. We can ignore that that current
state column state list contains a higher penalty state because at the end,
the forest builder looks for states with the lowest penalty.
We need to hook up our new column and state to Earley parser.
Finally, we hook up our simple extractor to choose the lowest cost path.
Here is how we use it.
Caution, this command can take time. 10 seconds in Mac Book Pro.
Why is this so slow? One reason is that, for conceptual clarity, and
generality, we opted to expand two terms from the original paper.
For example, we chose set Any_one: <$.> as well as
Any_not_str: <$![.]> as nonterminal symbols. This means that
to match <$.>, (say we have \(T\) terminal symbols,) we have to carry
an extra \(T\) symbols per each symbol – essentially giving us \(T^2\)
extra matches to perform. For matching <$![.]>, the situation is worse,
we have to carry \(T^2\) symbols per each terminal, giving \(T^3\)
matches per original terminal symbol.
Fortunately, there is an optimization possible here. We can set the
Any_one: . and Any_not(a): !a to be terminal symbols, and fix the
terminal match so that we match any character on . and except the given
character (e.g. a) on !a. What we lose there is generality. That is, the
augmented context-free grammar will no longer be usable by other parsers
(unless they are augmented got match regular expressions).
We modify our Earley parser to expect these. First our strings.
Now our parser.
Our grammars are augmented this way.
Using it.
Testing x+y
Testing x+1
Now, why is this so slow? This is because of the following reasons:
The algorithm for recognition is \(O(n^3)\). This is a consequence
of the fact that our covering grammar is simply a context-free grammar, and
as you can see, there is only a constant size increase in the grammar \((|G|+ |T|^3)\)
where \(|G|\) is the original size, and \(|T|\) is the number of terminals.
The second is that we are running the algorithm on Python implemented over
WASM. Python by itself is fairly slow. In our benchmark on EarleyParser, we
found that translating a Python parser to Java line by line resulted in an
improvement over 300 times (t = original/300).
The runnable Python source for this notebook is available here.
Alfred V. Aho and Thomas G. Peterson, A Minimum Distance Error-Correcting Parser for Context-Free Languages, SIAM Journal on Computing, 1972 https://doi.org/10.1137/0201022↩