Contents

  1. Binary Normal Form
  2. Triplet rules
  3. Remove invalid terminal transitions
  4. Remove undefined nonterminals
  5. Construct the full grammar

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We previously saw how to produce a grammar that is an intersection of two regular grammars. One of the interesting things about regular grammars is that, you can produce an intersection between a regular grammar and a context-free grammar, and the result will be context-free. The traditional technique for intersecting between a CFL and an RL is to produce the PDA and the DFA equivalents of both, and produce a product PDA. However, there is an even better way called the Bar-Hiller construction1 that lets you compute an intersection between a CFG and RG directly.

The essential idea is to first recognize that a right linear grammar that encodes a regular language can be made to have a final nonterminal to expand. The below are patterns in right linear grammars.

<s> := a <a>
     | b <b>
<a> := c
     | {empty}
<b> := b

Given such a grammar, one can convert it as below so that the last nonterminal to be expanded is <f>

<s> := a <a>
     | b <b>
<a> := c <f>
     | <f>
<b> := b <f>
<f> := {empty}

Why do we do this? because this lets us represent the grammar as a non-deterministic finite automation with a single start state (<S>) and a single end state (<F>). This is required for the Bar-Hiller construction.

Next, we also convert the context-free grammar to Chomsky-Normal-Form (actually we do not need as much restrictions, as we will see later). The CNF looks like below.

<S> := <A><B>
<A> := a
     | {empty} 
<B> := b

The essential idea is to take all nonterminal symbols from the regular grammar, and all nonterminal symbols from the context-free grammar, and produce triplets which starts with a nonterminal from RG, and ends with another nonterminal from the RG, and has a nonterminal from CFG in the middle. E.g. <a,A,b>. The intersection grammar is represented by the start symbol <s,S,f> where <s> is the start symbol of the regular grammar, and <f> is the final symbol as we discussed above. <S> is the start symbol of the context-free grammar. The essential idea is that if we want to produce <s,S,f> then it can only be produced if the rules can be written such that they start with <s> and end with <f>. That is, the definition of <s,S,f> is as follows:

<s,S,f> := <s,A,x><x,B,f>

where <x> is a nonterminal symbol in the regular grammar such that it is reachable from <s>, and <f> is reachable from <x>. Further, it means that if we go from <s> to <x> by consuming a string, then that string must also be parsable by <A>. In our example, this could be one rule.

<s,S,f> := <s,A,a><a,B,f>

If one of the tokens in the context-free rule is a terminal symbol, then we get an opportunity to immediately verify our construction.

<s,A,a> := [<s>,a,<a>]

As you can see, <A> has one of the rules that contain a single terminal symbol โ€“ a. So, we can immediately see that the requirement <s,A,a> was satisfied. That is, <s> goes to <a> by consuming a, and this is witnessed by [<s>,a,<a>]. So, we will keep this rule in the intersection grammar as

<s,A,a> := a

What about the second rule?

<s,A,a> := [<s>,{empty},<a>]

This however, does not work because there is no epsilon transition from <s> to <a>. Hence, this rule is skipped in the resulting grammar. Let us see how to implement this technique.

We start by importing the prerequisites.

System Imports These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.
  1. sympy
Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl
  2. gatleastsinglefault-0.0.1-py2.py3-none-any.whl
  3. gfaultexpressions-0.0.1-py2.py3-none-any.whl
  4. gmultiplefaults-0.0.1-py2.py3-none-any.whl
  5. earleyparser-0.0.1-py2.py3-none-any.whl
  6. hdd-0.0.1-py2.py3-none-any.whl
  7. ddset-0.0.1-py2.py3-none-any.whl
  8. rxfuzzer-0.0.1-py2.py3-none-any.whl
  9. rxregular-0.0.1-py2.py3-none-any.whl
  10. rxcanonical-0.0.1-py2.py3-none-any.whl

The imported modules



Binary Normal Form

Next, we want to limit the number of forms of production rules that we want to handle. Hence, we convert the context-free grammar to a normal form such that it conforms exclusively to the following pattern.

  1. \[A \rightarrow a\]
  2. \[A \rightarrow B\]
  3. \[A \rightarrow aB\]
  4. \[A \rightarrow Bb\]
  5. \[A \rightarrow BC\]
  6. \[S \rightarrow \epsilon\]

That is, each production rule has at most two tokens. The new normal form is provided by binary_normal_form(). Note that this is not exactly the Chomsky Normal Form. We do not need the full restrictions of CNF. However, our algorithms will also work if a grammar in CNF form is given.



Let us define a simple expression grammar to use as an example.



We verify that our BNF converter works.



Here is another, the grammar for JSON



We again verify that our BNF converter works.



Triplet rules

As we discussed previously, we transform the grammar such that we produce every variations of triplets with nonterminals from regular grammar starting and ending, and nonterminal from context-free grammar in the middle. That is, given this,

<S> := <A><B>

and given <a>, <b>, <c>, <f> as the regular nonterminals, each reachable from previous, we produce

<a,S,a> := <a,A,a><a,B,a>
<a,S,b> := <a,A,a><a,B,b>
         | <a,A,b><b,B,b>
<a,S,c> := <a,A,a><a,B,c>
         | <a,A,a><c,B,c>
         | <a,A,c><c,B,c>
         | <a,A,b><b,B,c>
         | <a,A,b><c,B,c>

and so on.



We modify our grammar display so that it knows about our triples.



Remove invalid terminal transitions

Next, we remove invalid terminal transitions. That is, given

<s,A,a> := [<s>,a,<a>]

We check that <s> can reach <a> by consuming a. If not, this is an invalid rule, and we remove it from the production rules.



Remove undefined nonterminals

At this point we may have multiple nonterminals with no rules defining them. That is, such nonterminals canโ€™t be expanded. Hence, these can be removed.



Construct the full grammar

We are now ready to construct our full grammar.



Let us see if our construction works.



The runnable code for this post is available here

  1. Bar-Hiller, M. Perles, and E. Shamir. On formal properties of simple phrase structure grammars. Zeitschrift fur Phonetik Sprachwissenschaft und Kommunikationforshung, 14(2):143โ€“172, 1961.ย