Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
We talked about Earley parsers previously.
One of the interesting things about Earley parsers is that it also forms the
basis of best known general context-free error correcting parser. A parser is
error correcting if it is able to parse corrupt inputs that only partially
conform to a given grammar. The particular algorithm we will be examining is the
minimum distance error correcting parser by Aho et al.1.
Covering Grammar
The idea from Aho et al. is to first transform the given grammar into a
covering grammar. A grammar \(G_2\) covers another grammar \(G_1\) if
(in essence) all productions in \(G_1\) are in \(G_2\), and a string that
is parsed by \(G_1\) is guaranteed to be parsed by \(G_2\), and all the
parses from \(G_1\) are guaranteed to exist in the set of parses from \(G_2\).
So, we first construct a covering grammar that can handle any corruption of
input, with the additional property that there will be a parse of the corrupt
string which contains the minimum number of modifications needed such that
if they are applied on the string, it will make it parsed by the original
grammar.
First, the prerequisites
Our Grammar
We can also print it.
Checking whether a term is nonterminal
Now, the covering grammar itself. The covering grammar constructed by Aho et al.
is fairly straight forward. It handles three possible mutations of the input
The replacement of a terminal symbol by another terminal symbol
The insertion of an extra terminal symbol
Deletion of a terminal symbol
Any number and combinations of these mutations can accumulate in an input.
That is, in effect, any string can be considered a mutation of a parsable
string, and hence we can expect the covering grammar to parse it.
Next, to make sure that any string is parsable, we first define a nonterminal
that is capable of parsing any string. For ease of parsing, let us define
a new terminal symbol that stands in for any terminal symbol. This stands for \(I\)
in Aho’s paper.
We should be able to parse any number of such symbols. So, we define a new
nonterminal for that. This stands for \(H\) in Aho’s paper.
In a similar fashion, we also need a terminal symbol that will match any except
a given terminal symbol. Since this is specific to a terminal symbol, let us make
it a method.
Note that both Any_one and Any_not can be made into nonterminal symbols
with corresponding definitions if required so that the resulting grammar is
fully in the context-free format. We do not do that here because it is
easier this way.
How do we check for match between a terminal symbol and a given input symbol?
Checking it
We need to transform our grammar. Essentially, the idea is
that for each terminal symbol in the grammar, we add a
nonterminal symbol that handles the following possibilities
The terminal is matched exactly as provided with a symbol in input
The symbol that matches terminal is replaced by something else in the input, which means another symbol instead of the expected one
Some junk symbols are present before the symbol that matches the given terminal
The expected symbol was deleted from input string
That is, given a is a terminal symbol, we add the following error productions, where <$ a> is the corresponding nonterminal.
<$ a> -> a
<$ a> -> <$.+> a
<$ a> -> \(\epsilon\)
<$ a> -> {!a}
For each such correction, we add one penalty. In essence, the following
general correction rules get one penalty if they are used. That is, each
any character {$.} is a correction, and the count of such any
characters is the penalty in these rules.
<$.+> -> <$.+> {$.}
<$.+> -> {$.}
Also, these terminal corrections get one penalty. Again, like above, the penalty
here is because of not matching a particular expected character.
<$ a> -> \(\epsilon\)
<$ a> -> {!a}
Notice that we do not have to add penalty for the junk insertion because that
is already been applied by the general corrections.
These are added to the grammar as follows.
We modify the grammar to add this new nonterminal.
Finally, we need to modify the start symbol to let junk symbols after the parse.
This is handled by adding a new start symbol as below.
<$start> -> <start>
<$start> -> <start> <$.+>
Next, we extract the nullable keys used in Earley parsing
The column definition is exactly the same as before, but with one crucial
difference. When we compare uniqueness of states in add(), we give
priority to states with less penalty. That is, if the state being
added has a lower penalty than an existing state, it is replaced.
No changes in the definition of state.
Next, the parser itself. We need to make sure that the grammar we use is the
updated grammar.
Remaining is almost exactly same as the original Earley parser except for
complete() where we have to transfer the penalties to the parent parse.
The extraction of parse trees.
Now, we can use a simple extractor to extract one of the error correcting parses
with lowest penalties.
Alfred V. Aho and Thomas G. Peterson, A Minimum Distance Error-Correcting Parser for Context-Free Languages, SIAM Journal on Computing, 1972 https://doi.org/10.1137/0201022↩