Search Based Fuzzing -- Computing Approach Level
Fuzzing is one of the more easy to use and efficient means of testing software systems. The idea is that we produce random inputs that are then executed by the system. If the system does something unexpected (such as crashing) then we know that the path taken by the execution was not considered by the programmer, and that such a path may be exploited to make the program do something that was unintended by the developer.
However, simply throwing random inputs at the program does not work well beyond a point. The issue is that the vast majority of such inputs will be rejected by common validations present in the program, and hence will not penetrate deep into the program. If there are say ten validations in the input processing section of the program, and there is 0.5 probability at each branch point for failing the validation, only one out of thousand inputs will successfully traverse the input processing stage. This ratio can worsen quickly when loops and recursion is present, leaving a majority of statements in the program uncovered.
One possible solution to enable better coverage of the program is to guide new inputs toward statements branches and paths that were not taken previously. This is the basic intuition behind search based testing. The field of search based software testing starts from Korel 1 in 1990.
Directed fuzzing leverages intuitions behind search based testing for generating inputs. The idea is that given a set of inputs, and a set of uncovered program elements we compute how far away each input execution is from the uncovered program elements. Then, if we want to new inputs to cover these uncovered elements, we choose those inputs that produced executions that are closest to the targeted element, mutate them, generating new inputs, and execute these inputs, and choose those that produce executions that are closest to the required program element. The intuition here is that mutating those inputs that produced an execution closest to the target program element has a better chance of producing inputs that are even more closer than mutating inputs that produced executions that were more farther.
In this post, I will be covering approach level (also called approximation level) a metric that can be used for computing the execution distance of inputs. Approach level was first proposed by Wegener et al.2
As before, we start by importing the prerequisites
Contents
- Visualization prerequisites
- Extract control-flow-graph in JSON
- Approach Level
- Approach Level of given path
- Artifacts
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.- simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
- earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
- pycfg-0.0.1-py2.py3-none-any.whl from "The Python Control Flow Graph".
- pydot-1.4.1-py2.py3-none-any.whl
- metacircularinterpreter-0.0.1-py2.py3-none-any.whl from "Python Meta Circular Interpreter".
Visualization prerequisites
If you are unfamiliar with control-flow, please see the post on control flow to understand what control flow is about.
From the same post, we need a few visualization functions, which we redefine, as they are slightly different.
Use CLIGraphics if you are running from the command line
Change WebGraphics to CLIGraphics here if you want to run from the command line
More helper functions for visualization from the control flow post.
Finally the to_graph stitches all these together.
Let us consider the triangle function. This is a simple function that given three sides of a triangle, tells you what kind of a triangle it is.
Another example, which is the GCD of two numbers.
Extract control-flow-graph in JSON
Next, we need a helper function to extract the control flow graph. Note, function definitions do not have a parent.
Let us consider our triangle.
The GCD
Now, we are ready for the main content.
Approach Level
How do you compute the distance between an execution and a program element? The approach level says that given an execution path, the distance is given by how many critical branches there are between the program element and the execution. A critical branch is a branch that, once taken, removes the target from reachable nodes. More intuitively, it is the number of potential problem nodes that lay on the shortest path from the closest node that diverted control flow away from the target goal node. Stated otherwise, approach level gives the minimum number of control dependencies between the goal and the execution path of the current input.
A is control dependent on B if and only if the following are true:
- B has two successors
- B dominates A
- B is not post-dominated by A
- There is a successor of B that is post dominated by A.
So, let us start defining these. The simplest is the successors
Successors
Using it.
Same with triangle.
Next, we define the dominator.
Dominator
From the wiki, a node $A$ on a control-flow graph dominates a node \(B\) if every path from the entry node to \(B\) must go through \(A\). It is typically written as \(A >> B\). A post dominates B if every path from \(B\) must go through \(A\).
Let us see how to compute the dominator set.
Using it for GCD
For Triangle
Control Dependent
We can now define control dependence directly.
Using it. Note how node 8 is control dependent on 6 from the main figure.
For triangle Note how node 22 is control dependent on 6 from the main figure.
Approach Level of given path
Next, we compute the approach level of a given path. The idea is that we may have reached the head of the path, and we have a potential path from the head node to the tail node. We now want to find how far away the head is from the tail. We first define a small class to hold the path.
Next, the approach level itself. We need to go from the tail to the head. Remember that our target is the last node. That is, if we reverse the path, then the first node (hd).
Using it for triangle
Using it for GCD
So, how do you use this for fuzzing? Given a program, and an uncovered program element such as a node (the target), we find all nodesin the program from which we can reach the uncovered node. Next, we compute the shortest path from such nodes to the target node. Then we compute the approach level to the target for each of these nodes. Given an input, we compute all the nodes that it covers, then identify all the nodes from which it can reach the target node, and finally, we find the node with the minimum of the approach levels that we computed previously. This would be the score of the corresponding input. We choose all inputs (based on our population size) that have the least approach level, and use that for evolution.
Artifacts
The runnable Python source for this notebook is available here.