Contents

  1. Regular expression to context-free grammar

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

For fuzzing, we often need to generate inputs that have a particular pattern, and this pattern could be easier to specify using a regular expression than using a context free grammar. For example, the URL can be described as:

(https|http|ftp)://[a-zA-Z0-9.]+(:[0-9]+|)(/[a-zA-Z0-9-/]+|)

Can we use such regular expressions as producers? As before, we start with our prerequisites. We import the following modules

System Imports These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.
  1. sympy
Available Packages These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine.
  1. simplefuzzer-0.0.1-py2.py3-none-any.whl
  2. gatleastsinglefault-0.0.1-py2.py3-none-any.whl
  3. earleyparser-0.0.1-py2.py3-none-any.whl
  4. hdd-0.0.1-py2.py3-none-any.whl
  5. ddset-0.0.1-py2.py3-none-any.whl

The imported modules



Since we want to convert a regular expression to a generator, it is necessary to parse it first. The following describes the grammar of regular expressions. A regular expression is basically a set of alternatives separated by |

  <regex> ::= <cex>
            | <cex> `|` <regex>

Each alternative is an expression that is a sequence of more basic expressions

  <cex>   ::= <exp>
            | <exp> <cex>

The basic regular expression unit is a single character, standing for itself. One may also have a bracket expression [...] which matches the list of characters inside the brackets, or a single . which matches any character. However, one can also have more complex units such as a parenthesized regex (...), a basic expression followed by Kleene star * which stands for any number of matches including none, and a basic expression followed by + which stands for at least one match of the preceding basic expression.

  <exp>   ::=  <unitexp>
            |  <regexstar>
            |  <regexplus>
  <unitexp>::= <alpha>
            |  <bracket>
            |  <dot>
            |  <parenexp>


Let us see if we can parse a small regular expression.



While a regular expression parse tree is technically sufficient to produce a generator, there is a better solution. We know how to build very good fuzzers with grammars. Hence, it is better to convert the regular expression to a grammar first, and use one of our fuzzers.

Regular expression to context-free grammar



We first define our basic unit. The exp converter, which delegates to other converters

  <unitexp>::= <alpha>
            |  <bracket>
            |  <dot>
            |  <parenexp>


The most basic regular expression is the character itself. We convert it to a nonterminal that defines the single character. That is, a gets translated to

<X> ::= `a`


Using it



The next basic regular expression is the brackets, which matches any character inside the brackets [...]. We convert it to a nonterminal that defines the single character. The following is the regex grammar.

  <bracket> ::= `[` <singlechars> `]`
  <singlechars>::= <singlechar><singlechars>
                 | <singlechar>
  <singlechar> ::= <char>
                 | `\` <escbkt>
  <escbkt>     ::= `[`
                 | `]`
                 | `\`

Given the regex [abc], this regex is converted to the following grammar

<X> ::= `a` | `b` | `c`


Using it



Next, we define the <dot>. A dot matches any character. That is, terminal symbol.

  <dot>   ::=  a | b | ...


Using it



Next, we define the <exp>

  <exp>   ::=  <unitexp>
            |  <regexstar>
            |  <regexplus>


For <regexstar> the regular expression grammar is

   <regexstar> ::= <unitexp> `*`

Given a regular expression a*, this gets translated to

<X> ::= a <X>
      |

For <regexplus> the regular expression grammar is

   <regexplus> ::= <unitexp> `+`

Given a regular expression a+, this gets translated to

<X> ::= a <X>
      | a


Using it.



One basic operation of regular expressions is concatenation. It matches two patterns in sequence. We convert concatenation to a rule containing two corresponding nonterminals. The regular expression grammar is

  <cex>   ::= <exp>
            | <exp> <cex>

Given a regular expression ab, this gets converted into

<X> := a
<Y> := b
<Z> := <X> <Y>


Using it



Next, we define our top level converter.

  <regex> ::= <cex>
            | <cex> `|` <regex>

Given a regular expression a|b, this gets converted into

<X> := a
<Y> := b
<Z> := <X>
     | <Y>


Using it



Next, we define our A parenthesis is simply a grouping construct that groups all elements inside it as one.

<parenexp> ::= `(` <regex> `)`

Given a parenthesis expression (a), this gets translated to

<X> = a


Using it



At this point, we have our full grammar, and we can use it to generate inputs as below



The runnable code for this post is available here