A simple grammar miner

Tracing
Parse Tree
Grammar Extraction
Fuzzing
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Note: This post is based on the string inclusion grammar miner in the fuzzingbook, but reduced to bare essentials.

The concept of grammar mining from programs is as follows: Given a program that uses recursive descent parsing on its input, the call graph along with the arguments passed corresponds to the parse tree of the input string.

Once we have the parse tree from a sufficient number of input strings (or an input string with sufficiently diverse features), one can extract the grammar directly.

In this post, I will explain how to recover the input grammar from urlparse from the Python standard library.

First, we have to import our function, and dependencies:

We also need the fuzzer package.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".

Next, we define our input values.

The basic procedure of our grammar mining algorithm is to hook into a debugger, and inspect the environment variables as the execution traverses the call graph. Then, identify the variables that represent fragments of the input that was passed in.

Hence, we need to hook into the Python debugger. This is accomplished using the trace functionality.

Tracing

First, we define traceit() the function that actually gets called on each trace event. For now, we are only interested in arguments to a function, and hence, ignore all events other than function call. This allows us to ignore reassignments.

Next, we define the trace_function() which hooks into the trace functionality, and registers a given function to be called on each trace event.

We can now inspect the fragments produced

Parse Tree

We will next use these fragments to construct a derivation tree of the given input string. The idea is to start with the input string that corresponds to the <START> symbol. Then take one fragment at a time, and replace the matching substring by a reference to the fragment, and add the fragment variable name as a non terminal symbol in the grammar.

The to_tree() iterates through the fragments, and refines the defined tree.

Next, we define the refine_tree() which takes one single fragment (a key value pair), and recursively searches and update the tree.

We use the to_tree() in the following fashion.

Grammar Extraction

Once we have this tree, extracting the grammar is as simple as recursively traversing the tree, and collecting the alternative expansions of the rules. This is accomplished by the to_grammar() function.

It is used as follows:

This represents the input grammar of the function urlparse(). All together:

The function can be used as follows:

Fuzzing

Let us see if our simple fuzzer is able to work with this grammar. Using it to fuzz:

Artifacts

The runnable Python source for this notebook is available here.

Contents

Tracing

Parse Tree

Grammar Extraction

Fuzzing

Artifacts