A deterministic regular grammar is a DFA (Deterministic Finite Automaton) represented as a regular grammar. Such a grammar can have rules of the following format:
- \[A \rightarrow a B\]
- \[S \rightarrow E\]
- \[E \rightarrow \epsilon\]
where given a nonterminal \(A\) and a terminal symbol \(a\) at most one of its rules will start with a terminal symbol \(a\). Secondly, the \(\epsilon\) is attached to the \(E\) symbol, which is the only termination point. If the language contains \(\epsilon\), then the single degenerate rule, \(S \rightarrow E\) is added to the rules. As you can see, each node has at most one transition for a given terminal symbol. Hence, this canonical form is equivalent to a deterministic finite state automation (DFA).
A deterministic regular grammar (also a DFA) is useful in many applications. However, in most applications where a such a grammar can be used (i.e., parsing and fuzzing) the performance of algorithms can be improved much further by minimizing the grammar such that it has the smallest size possible for the language it represents. This post tackles how to minimize a DFA using the classical algorithm 1. Note, it is referred to as Hopcroft’s erroneously in wikipedia 2, however Hopcroft has a different algorithm based on partitioning.
We start with importing the prerequisites
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Available PackagesThese are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
- simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
- gatleastsinglefault-0.0.1-py2.py3-none-any.whl from "Specializing Context-Free Grammars for Inducing Faults".
- hdd-0.0.1-py2.py3-none-any.whl from "Hierarchical Delta Debugging".
- rxfuzzer-0.0.1-py2.py3-none-any.whl from "Fuzzing With Regular Expressions".
- ddset-0.0.1-py2.py3-none-any.whl from "Simple DDSet".
- earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
The imported modules
Minimization of the Regular Grammar
We assume that we have a DFA that is represented as a grammar where the states in the DFA are nonterminal symbols in the grammar and terminals are the transitions. However, this DFA is not necessarily minimal. That is, thre could be duplicate nonterminal symbols that could behave exactly alike. That is, if we take two such states as start symbols of the grammar, then the strings accepted by such an grammars would be exactly the same. So, the idea behind minimizing a regular grammar is to identify nonterminals that are duplicates of each other
Interestingly, Brzozowski 3 observed that if you reverse the arrows in the DFA, resulting in an NFA, and then convert the NFA to DFA, then do this again, the resulting DFA is minimal. However, this can have exponential worst case complexity (but can be much faster in common patterns). Hence, we look at a more common algorithm for minimization.
The idea is as follows:
- Mark the start state and all final states
- Collect all pairs of nonterminals from the grammar
- while there are pairs to be marked a. for each nonmarked pairs p,q do i. for each symbol a in alphabet do * if the pair delta(p, a) and delta(q, a) is marked, mark p,q
- Construct the reduced grammar by merging unmarked groups of pairs. We start with a matrix nonterminals, and iteratively refine them. We start by initializing our data structure.
Next, we want to update the markings. To improve the performance, we define a datastruture to keep the transitions from states.
We are now ready to update our markings.
At this point, all that remains to be done is to merge the nonterminals in the
pairs which are indistinguished. That is, the value is
The runnable code for this post is available here.
The runnable Python source for this notebook is available here.