A regular grammar can in theory have rules with any of the following forms

• $A \rightarrow a$
• $A \rightarrow a B$
• $A \rightarrow a b c B$
• $A \rightarrow B$
• $A \rightarrow \epsilon$

with no further restrictions. This is the direct equivalent grammar for a nondeterministic finite state automaton (NFA). However, for working with regular grammars, such freedom can be unwieldy. Hence, without loss of generality, we define a canonical format for regular grammars, to which any regular grammar can be converted to.

• $A \rightarrow a B$
• $S \rightarrow E$
• $E \rightarrow \epsilon$

where given a nonterminal $$A$$ and a terminal symbol $$a$$ at most one of its rules will start with a terminal symbol $$a$$. That is, if the original grammar had multiple rules that started with $$a$$, they will be collected into a new nonterminal symbol. Further, there will be at most one terminal symbol in a rule. That is, if there are more terminal symbols, then we bundle that to a new nonterminal symbol. Finally, the $$\epsilon$$ is attached to the $$E$$ symbol, which is the only termination point. If the language contains $$\epsilon$$, then the single degenerate rule, $$S \rightarrow E$$ is added to the rules. As you can see, each node has at most one transition for a given terminal symbol. Hence, this canonical form is equivalent to a deterministic finite state automation (DFA). We start with importing the prerequisites

## Contents

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

System Imports These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.
1. sympy
Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine.

The imported modules

We start with a few common rules. First, we define the empty rule.

<_>  :=


We also define our TERMINAL_SYMBOLS

Then, use it to define NT_ANY_STAR

<.*>  := . <.*>
| <NT_EMPTY>


Use it to also define NT_ANY_PLUS

<.+>  := . <.+>
| . <NT_EMPTY>


use any_plus where possible.

## Remove degenerate rules

A degenerate rule is a rule with a format $$A \rightarrow B$$ where $$A$$ and $$B$$ are nonterminals in the grammar. The way to eliminate such nonterminals is to recursively merge the rules of $$B$$ to the rules of $$A$$.

Using it

## Removing terminal sequences

A terminal sequence is a sequence of terminal symbols in a rule. For example, in the rule

<A> ::= a b c <B>


a b c is a terminal sequence. We want to replace such sequences by a new nonterminal. For example,

<A> ::= a <Aa>
<Aa> ::= b <Aab>
<Aab> ::= c <B>


Using it

## Fix empty rules

If there are any rules of the form $$A \rightarrow b$$, we replace it by $$A \rightarrow b E$$, $$E \rightarrow \epsilon$$. Next, if we have rules of the form $$A \rightarrow \epsilon$$, and $$B \rightarrow a A$$ then, we remove the $$A \rightarrow \epsilon$$ and add a new rule $$B \rightarrow a E$$ The reason for doing this is to make sure that we have a single termination point. If you are using NT_ANY_STAR, then make sure you run this after (or use NT_ANY_PLUS instead).

## Collapse similar starting rules

Here, the idea is to join any set of rules of the form $$A \rightarrow b B$$, $$A \rightarrow b C$$ to $$A \rightarrow b \,or(B,C)$$. First, we define how to join rules that all have the same terminal symbol as the starting token.

Using it.

Next, we split any given definition into rulesets that start wit the same terminal symbol.

Using it.

Given a list of keys, construct their or(.) from the given grammar.

Using it.

defining the rule collapse.

Using it.

## Display canonical grammar

We also need the ability to compactly display a canonical regular grammar and we define it as below.

Make sure it works

Now, all together.

Using it.

## A canonical regular grammar from a regular expression.

This function extracts the DFA equivalent grammar for the regular expression given.

Using it.

The runnable code for this post is available here.