A deterministic regular grammar is a DFA (Deterministic Finite Automaton) represented as a regular grammar. Such a grammar can have rules of the following format:

  • \[A \rightarrow a B\]
  • \[S \rightarrow E\]
  • \[E \rightarrow \epsilon\]

where given a nonterminal \(A\) and a terminal symbol \(a\) at most one of its rules will start with a terminal symbol \(a\). Secondly, the \(\epsilon\) is attached to the \(E\) symbol, which is the only termination point. If the language contains \(\epsilon\), then the single degenerate rule, \(S \rightarrow E\) is added to the rules. As you can see, each node has at most one transition for a given terminal symbol. Hence, this canonical form is equivalent to a deterministic finite state automation (DFA).

A deterministic regular grammar (also a DFA) is useful in many applications. However, in most applications where a such a grammar can be used (i.e., parsing and fuzzing) the performance of algorithms can be improved much further by minimizing the grammar such that it has the smallest size possible for the language it represents. This post tackles how to minimize a DFA using the classical algorithm 1. Note, it is referred to as Hopcroft’s erroneously in wikipedia 2, however Hopcroft has a different algorithm based on partitioning.

We start with importing the prerequisites

Contents

  1. Minimization of the Regular Grammar
  2. Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
  2. gatleastsinglefault-0.0.1-py2.py3-none-any.whl from "Specializing Context-Free Grammars for Inducing Faults".
  3. hdd-0.0.1-py2.py3-none-any.whl from "Hierarchical Delta Debugging".
  4. rxfuzzer-0.0.1-py2.py3-none-any.whl from "Fuzzing With Regular Expressions".
  5. ddset-0.0.1-py2.py3-none-any.whl from "Simple DDSet".
  6. earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".

The imported modules



Minimization of the Regular Grammar

We assume that we have a DFA that is represented as a grammar where the states in the DFA are nonterminal symbols in the grammar and terminals are the transitions. However, this DFA is not necessarily minimal. That is, thre could be duplicate nonterminal symbols that could behave exactly alike. That is, if we take two such states as start symbols of the grammar, then the strings accepted by such an grammars would be exactly the same. So, the idea behind minimizing a regular grammar is to identify nonterminals that are duplicates of each other

Interestingly, Brzozowski 3 observed that if you reverse the arrows in the DFA, resulting in an NFA, and then convert the NFA to DFA, then do this again, the resulting DFA is minimal. However, this can have exponential worst case complexity (but can be much faster in common patterns). Hence, we look at a more common algorithm for minimization.

The idea is as follows:

  1. Mark the start state and all final states
  2. Collect all pairs of nonterminals from the grammar
  3. while there are pairs to be marked a. for each nonmarked pairs p,q do i. for each symbol a in alphabet do * if the pair delta(p, a) and delta(q, a) is marked, mark p,q
  4. Construct the reduced grammar by merging unmarked groups of pairs. We start with a matrix nonterminals, and iteratively refine them. We start by initializing our data structure.


Next, we want to update the markings. To improve the performance, we define a datastruture to keep the transitions from states.



We are now ready to update our markings.



Using it.



At this point, all that remains to be done is to merge the nonterminals in the pairs which are indistinguished. That is, the value is None.



Using it.



The runnable code for this post is available here.

Artifacts

The runnable Python source for this notebook is available here.

  1. Yingjie Xu “Describing an n log n algorithm for minimizing states in deterministic finite automaton” 2008 

  2. John Hopcroft “An n log n algorithm for minimizing states in a finite automaton” 1971 

  3. Janusz Brzozowski “Canonical regular expressions and minimal state graphs for definite events” 1963