Regular Expression to Regular Grammar

Union of Regular Grammars
Concatenation of Regular Grammars
Kleene Plus of Regular Grammars
Kleene Star of Regular Grammars
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

In the previous post, we discussed how to produce a grammar out of regular expressions. This is useful to make the regular expression a generator of matching inputs. However, one detail is unsatisfying. The grammar produced is a context-free grammar. Regular expressions actually correspond to regular grammars (which are strictly less powerful than context-free grammars). For reference, a context-free grammar is a grammar where the rules are of the form \(A \rightarrow \alpha\) where \(A\) is a single nonterminal symbol and \(\alpha\) is any sequence of terminal or nonterminal symbols including \(\epsilon\) (empty). A regular grammar on the other hand, is a grammar where the rules can take one of the following forms:

\[A \rightarrow a\]
\[A \rightarrow a B\]
\[A \rightarrow \epsilon\]

where \(A\) and \(B\) are nonterminal symbols, \(a\) is a terminal symbol, and \(\epsilon\) is the empty string. So, why is producing a context-free grammar instead of regular grammar unsatisfying? Because such regular grammars have more interesting properties such as being closed under intersection and complement. By using a context-free grammar, we miss out on such properties. Hence, it would be really good if we could translate the regular expression directly into a regular grammar. This is what we will do in this post. We start with importing the prerequisites

System Imports

These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.

sympy

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

The imported modules

We want to produce regular grammars directly from regular expressions.

Union of Regular Grammars

Given two regular grammars such that their nonterminals do not overlap, we need to produce a union grammar. The idea is that you only need to modify the start symbol such that the definition of the new start symbol is a combination of starts from both grammars.

Using it

Concatenation of Regular Grammars

Next, given two regular grammars \(G1\) and \(G2\) such that their nonterminals do not overlap, producing a concatenation grammar is as follows: We collect all terminating rules from \(G1\) which looks like \(A \rightarrow a\) where \(a\) is a terminal symbol. We then transform them to \(A \rightarrow a S2\) where \(S2\) is the start symbol of \(G2\). If \(\epsilon\) was present in one of the rules of \(G1\), then we simply produce \(A \rightarrow S2\).

We start with catenation of nonterminals.

Next, we define what happens when we catenate a nontrminal to a rule. It returns any new keys created, along with the new rule

Finally, we define our regular catenation of two grammars.

Using it

Kleene Plus of Regular Grammars

Given a nonterminal symbol and the grammar in which it is defined, the Kleene plus is simply a regular concatenation of the nontrerminal with itself (recursive), with a regular union of its nonterminal’s rules. The small difference here from regular concatenation is that, when we concatenate the nonterminal with itself, we do not need to check for disjointness of nonterminals, because the definitions of other nonterminals are exactly the same. Further, \(G2\) is never used in the algorithm except in the final grammar.

Using it

Kleene Star of Regular Grammars

For Kleene Star, add \(\epsilon\) to the language of Kleene Plus.

Using it

At this point, we have all operations necessary to convert a regular expression to a regular grammar directly. We first define the class

 <cex>   ::= <exp>
           | <exp> <cex> 

  <regex> ::= <cex>
            | <cex> `|` <regex>

   <regexplus> ::= <unitexp> `+`

Using it

At this point, the grammar may still contain degenerate rules of the form \(A \rightarrow B\). We need to clean that up so that rules follow one of \(A \rightarrow a B\) or \(A \rightarrow a\) or \(A \rightarrow \epsilon\).

class RegexToRGrammar(RegexToRGrammar):
    def is_degenerate_rule(self, rule):
        return len(rule) == 1 and fuzzer.is_nonterminal(rule[0])

def cleanup_regular_grammar(self, g, s):
        # assumes no cycle
        cont = True
        while cont:
            cont = False
            new_g = {}
            for k in g:
                new_rules = []
                new_g[k] = new_rules
                for r in g[k]:
                    if self.is_degenerate_rule(r):
                        new_r = g[r[0]]
                        if self.is_degenerate_rule(new_r): cont = True
                        new_rules.extend(new_r)
                    else:
                        new_rules.append(r)
            return new_g, s

def to_grammar(self, my_re):
        parsed = self.parse(my_re)
        key, children = parsed
        assert key == '<start>'
        assert len(children) == 1
        grammar, start = self.convert_regex(children[0])
        return self.cleanup_regular_grammar(grammar, start)

Using it

The runnable code for this post is available here

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel rxregular is available here.

Contents

Union of Regular Grammars

Concatenation of Regular Grammars

Kleene Plus of Regular Grammars

Kleene Star of Regular Grammars

Artifacts