Fuzzing With Regular Expressions

Regular expression to context-free grammar
unitexp
exp
cex
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

For fuzzing, we often need to generate inputs that have a particular pattern, and this pattern could be easier to specify using a regular expression than using a context free grammar. For example, the URL can be described as:

(https|http|ftp)://[a-zA-Z0-9.]+(:[0-9]+|)(/[a-zA-Z0-9-/]+|)

Can we use such regular expressions as producers? As before, we start with our prerequisites. We import the following modules

System Imports

These are available from Pyodide, but you may wish to make sure that they are installed if you are attempting to run the program directly on the machine.

sympy

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

The imported modules

Since we want to convert a regular expression to a generator, it is necessary to parse it first. The following describes the grammar of regular expressions. A regular expression is basically a set of alternatives separated by |

  <regex> ::= <cex>
            | <cex> `|` <regex>

Each alternative is an expression that is a sequence of more basic expressions

  <cex>   ::= <exp>
            | <exp> <cex>

The basic regular expression unit is a single character, standing for itself. One may also have a bracket expression [...] which matches the list of characters inside the brackets, or a single . which matches any character. However, one can also have more complex units such as a parenthesized regex (...), a basic expression followed by Kleene star * which stands for any number of matches including none, and a basic expression followed by + which stands for at least one match of the preceding basic expression.

  <exp>   ::=  <unitexp>
            |  <regexstar>
            |  <regexplus>
  <unitexp>::= <alpha>
            |  <bracket>
            |  <dot>
            |  <parenexp>

import string

REGEX_GRAMMAR = {
    '<start>' : [
        ['<regex>']
    ],
    '<regex>' : [
        ['<cex>', '|', '<regex>'],
        ['<cex>', '|'],
        ['<cex>']
    ],
    '<cex>' : [
        ['<exp>', '<cex>'],
        ['<exp>']
    ],
    '<exp>': [
        ['<unitexp>'],
        ['<regexstar>'],
        ['<regexplus>'],
    ],
    '<unitexp>': [
        ['<alpha>'],
        ['<bracket>'],
        ['<dot>'],
        ['<parenexp>'],
    ],
    '<parenexp>': [
        ['(', '<regex>', ')'],
    ],
    '<regexstar>': [
        ['<unitexp>', '*'],
    ],
    '<regexplus>': [
        ['<unitexp>', '+'],
    ],
    '<bracket>' : [
        ['[','<singlechars>', ']'],
    ],
    '<singlechars>': [
        ['<singlechar>', '<singlechars>'],
        ['<singlechar>'],
    ],
    '<singlechar>': [
        ['<char>'],
        ['\\','<escbkt>'],
    ],
    '<escbkt>' : [['['], [']'], ['\\']],
    '<dot>': [
        ['.'],
    ],
    '<alpha>' : [[c] for c in string.printable if c not in '[]()*+.|'],
    '<char>' : [[c] for c in string.printable if c not in '[]\\'],
}
REGEX_START = '<start>'

Let us see if we can parse a small regular expression.

While a regular expression parse tree is technically sufficient to produce a generator, there is a better solution. We know how to build very good fuzzers with grammars. Hence, it is better to convert the regular expression to a grammar first, and use one of our fuzzers.

Regular expression to context-free grammar

First the terminal symbols

Next, the class.

unitexp

We first define our basic unit. The exp converter, which delegates to other converters

  <unitexp>::= <alpha>
            |  <bracket>
            |  <dot>
            |  <parenexp>

The most basic regular expression is the character itself. We convert it to a nonterminal that defines the single character. That is, a gets translated to

<X> ::= `a`

Using it

The next basic regular expression is the brackets, which matches any character inside the brackets [...]. We convert it to a nonterminal that defines the single character. The following is the regex grammar.

  <bracket> ::= `[` <singlechars> `]`
  <singlechars>::= <singlechar><singlechars>
                 | <singlechar>
  <singlechar> ::= <char>
                 | `\` <escbkt>
  <escbkt>     ::= `[`
                 | `]`
                 | `\`

Given the regex [abc], this regex is converted to the following grammar

<X> ::= `a` | `b` | `c`

Using it

Next, we define the <dot>. A dot matches any character. That is, terminal symbol.

  <dot>   ::=  a | b | ...

Using it

exp

Next, we define the <exp>

  <exp>   ::=  <unitexp>
            |  <regexstar>
            |  <regexplus>

For <regexstar> the regular expression grammar is

   <regexstar> ::= <unitexp> `*`

Given a regular expression a*, this gets translated to

<X> ::= a <X>
      |

For <regexplus> the regular expression grammar is

   <regexplus> ::= <unitexp> `+`

Given a regular expression a+, this gets translated to

<X> ::= a <X>
      | a

Using it.

cex

One basic operation of regular expressions is concatenation. It matches two patterns in sequence. We convert concatenation to a rule containing two corresponding nonterminals. The regular expression grammar is

  <cex>   ::= <exp>
            | <exp> <cex>

Given a regular expression ab, this gets converted into

<X> := a
<Y> := b
<Z> := <X> <Y>

Using it

Next, we define our top level converter.

  <regex> ::= <cex>
            | <cex> `|` <regex>

Given a regular expression a|b, this gets converted into

<X> := a
<Y> := b
<Z> := <X>
     | <Y>

Using it

Next, we define our A parenthesis is simply a grouping construct that groups all elements inside it as one.

<parenexp> ::= `(` <regex> `)`

Given a parenthesis expression (a), this gets translated to

<X> = a

Using it

At this point, we have our full grammar, and we can use it to generate inputs as below

The runnable code for this post is available here

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel rxfuzzer is available here.

Contents

Regular expression to context-free grammar

unitexp

exp

cex

Artifacts