Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We talked about Earley parsers previously. One of the interesting things about Earley parsers is that it also forms the basis of best known general context-free error correcting parser. A parser is error correcting if it is able to parse corrupt inputs that only partially conform to a given grammar. The particular algorithm we will be examining is the minimum distance error correcting parser by Aho et al.1.

Covering Grammar

The idea from Aho et al. is to first transform the given grammar into a covering grammar. A grammar \(G_2\) covers another grammar \(G_1\) if (in essence) all productions in \(G_1\) are in \(G_2\), and a string that is parsed by \(G_1\) is guaranteed to be parsed by \(G_2\), and all the parses from \(G_1\) are guaranteed to exist in the set of parses from \(G_2\).

So, we first construct a covering grammar that can handle any corruption of input, with the additional property that there will be a parse of the corrupt string which contains the minimum number of modifications needed such that if they are applied on the string, it will make it parsed by the original grammar.

First, the prerequisites

Our Grammar

We can also print it.

Checking whether a term is nonterminal

Now, the covering grammar itself. The covering grammar constructed by Aho et al. is fairly straight forward. It handles three possible mutations of the input

  • The replacement of a terminal symbol by another terminal symbol
  • The insertion of an extra terminal symbol
  • Deletion of a terminal symbol

Any number and combinations of these mutations can accumulate in an input. That is, in effect, any string can be considered a mutation of a parsable string, and hence we can expect the covering grammar to parse it.

Next, to make sure that any string is parsable, we first define a nonterminal that is capable of parsing any string. For ease of parsing, let us define a new terminal symbol that stands in for any terminal symbol. This stands for \(I\) in Aho’s paper.

We should be able to parse any number of such symbols. So, we define a new nonterminal for that. This stands for \(H\) in Aho’s paper.

In a similar fashion, we also need a terminal symbol that will match any except a given terminal symbol. Since this is specific to a terminal symbol, let us make it a method.

Note that both Any_one and Any_not can be made into nonterminal symbols with corresponding definitions if required so that the resulting grammar is fully in the context-free format. We do not do that here because it is easier this way.

How do we check for match between a terminal symbol and a given input symbol?

Checking it

We need to transform our grammar. Essentially, the idea is that for each terminal symbol in the grammar, we add a nonterminal symbol that handles the following possibilities

  • The terminal is matched exactly as provided with a symbol in input
  • The symbol that matches terminal is replaced by something else in the input, which means another symbol instead of the expected one
  • Some junk symbols are present before the symbol that matches the given terminal
  • The expected symbol was deleted from input string

That is, given a is a terminal symbol, we add the following error productions, where <$ a> is the corresponding nonterminal.

  • <$ a> -> a
  • <$ a> -> <$.+> a
  • <$ a> -> \(\epsilon\)
  • <$ a> -> {!a}

For each such correction, we add one penalty. In essence, the following general correction rules get one penalty if they are used. That is, each any character {$.} is a correction, and the count of such any characters is the penalty in these rules.

  • <$.+> -> <$.+> {$.}
  • <$.+> -> {$.}

Also, these terminal corrections get one penalty. Again, like above, the penalty here is because of not matching a particular expected character.

  • <$ a> -> \(\epsilon\)
  • <$ a> -> {!a}

Notice that we do not have to add penalty for the junk insertion because that is already been applied by the general corrections.

These are added to the grammar as follows.

We modify the grammar to add this new nonterminal.

Finally, we need to modify the start symbol to let junk symbols after the parse. This is handled by adding a new start symbol as below.

  • <$start> -> <start>
  • <$start> -> <start> <$.+>

Next, we extract the nullable keys used in Earley parsing

The column definition is exactly the same as before, but with one crucial difference. When we compare uniqueness of states in add(), we give priority to states with less penalty. That is, if the state being added has a lower penalty than an existing state, it is replaced.

No changes in the definition of state.

Next, the parser itself. We need to make sure that the grammar we use is the updated grammar.

Remaining is almost exactly same as the original Earley parser except for complete() where we have to transfer the penalties to the parent parse.

The extraction of parse trees.

Now, we can use a simple extractor to extract one of the error correcting parses with lowest penalties.

  1. Alfred V. Aho and Thomas G. Peterson, A Minimum Distance Error-Correcting Parser for Context-Free Languages, SIAM Journal on Computing, 1972 https://doi.org/10.1137/0201022