A simple grammar miner
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Note: This post is based on the string inclusion grammar miner in the fuzzingbook, but reduced to bare essentials.
The concept of grammar mining from programs is as follows: Given a program that uses recursive descent parsing on its input, the call graph along with the arguments passed corresponds to the parse tree of the input string.
Once we have the parse tree from a sufficient number of input strings (or an input string with sufficiently diverse features), one can extract the grammar directly.
In this post, I will explain how to recover the input grammar from
from the Python standard library.
First, we have to import our function, and dependencies:
We also need the fuzzer package.
Available PackagesThese are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.
Next, we define our input values.
The basic procedure of our grammar mining algorithm is to hook into a debugger, and inspect the environment variables as the execution traverses the call graph. Then, identify the variables that represent fragments of the input that was passed in.
Hence, we need to hook into the Python debugger. This is accomplished using the trace functionality.
First, we define
traceit() the function that actually gets called on each
trace event. For now, we are only interested in arguments to a function, and
hence, ignore all events other than function
call. This allows us to ignore
Next, we define the
trace_function() which hooks into the trace functionality,
and registers a given function to be called on each trace event.
We can now inspect the fragments produced
We will next use these fragments to construct a derivation tree of the given
input string. The idea is to start with the input string that corresponds to the
<START> symbol. Then take one fragment at a time, and replace the matching
substring by a reference to the fragment, and add the fragment variable name as
a non terminal symbol in the grammar.
to_tree() iterates through the fragments, and refines the defined tree.
Next, we define the
refine_tree() which takes one single fragment (a key value
pair), and recursively searches and update the tree.
We use the
to_tree() in the following fashion.
Once we have this tree, extracting the grammar is as simple as recursively
traversing the tree, and collecting the alternative expansions of the rules.
This is accomplished by the
It is used as follows:
This represents the input grammar of the function
The function can be used as follows:
Let us see if our simple fuzzer is able to work with this grammar. Using it to fuzz:
The runnable Python source for this notebook is available here.