Search Based Fuzzing -- Computing Branch Distance

Search Based Fuzzing involves generating various candidate inputs for a given program, identifying the inputs with the best score in some metric of effectiveness and choosing them for the next iteration so that one can iteratively improve the fitness of the given population of inputs. In the previous post I discussed how you can score inputs using approach level.

Approach level (or approximation level) is reasonable to compute the distance we have to traverse to reach a given node. However, in fuzzing, we also find that just reaching a node is insufficient. In several cases, the branch taken by an input execution determines if we can make progress. That is given several inputs, each of which reach a given decision node, we need to prioritize those inputs that are closest to switching to an uncovered branch. To do this, we use what is called the Branch Distance. This was first proposed by Bogdan Korel in 1990 [^korel1990]. In this post, I will discuss how to compute branch distance for computing the fitness score of an input in flipping a branch condition.

As before, we start by importing the prerequisites

DistributeNot
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

Given a set of variables (corresponding to an input) and a conditional, the branch distance measures the distance to the critical branch being true or false, where the critical branch is the branch where control flow diverged from reaching the target. That is, if an input diverged at a critical branch, then the distance to the conditional being true or false. It is given by the following translation:

Element	Value
Boolean	if True then 0 else K
a = b	if abs(a-b) = 0 then 0 else abs(a-b) + K
a != b	if abs(a-b) != 0 then 0 else K
a < b	if a-b < 0 then 0 else (a-b) + K
a <= b	if a-b <= 0 then 0 else (a-b) + K
a > b	if b-a > 0 then 0 else (b-a) + K
a > b	if b-a >= 0 then 0 else (b-a) + K
a or b	min(bdistance(a), bdistance(b))
a and b	bdistance(a) + bdistance(b)
not a	Negation is moved inward and propagated over a

K is a penalty constant which lets you fine tune the penalty of being wrong. Typically K=1

Once you have the branch distance, you need to normalize it to make it less than 1. A common formula used is

\[normalize(branchD) = 1 − 1.001^{(−branchD)}\]

Another [^arcuri2011] is

\[\frac{branchD}{branchD + \beta}\]

where \(branchD\) is the value to be normalized and \(\beta\) is a constant value you choose.

Finally, \(fitness = approach level + normalized branch distance\)

Let us consider a few examples.

say we have the following inputs

As per the above formula, the bdistance is

Can we move it forward? Let us consider a few neighbours.

That is, as per this computation, 0, 0 is closer to flipping the branch. let us explore the neighbours again

again

at this point, we have a zero

Can we automate this approach? Interestingly, this is quite easy. We can reuse the approach in metacircular interpreter and change the semantics to conform to our requirement.

Here is a quick check to show that the meta-circular interpreter works expected.

Next, let us redefine the interesting bits according to the table we provided. Provided below again for easy reference.

Element	Value
Boolean	if True then 0 else K
a = b	if abs(a-b) = 0 then 0 else abs(a-b) + K
a != b	if abs(a-b) != 0 then 0 else K
a < b	if a-b < 0 then 0 else (a-b) + K
a <= b	if a-b <= 0 then 0 else (a-b) + K
a > b	if b-a > 0 then 0 else (b-a) + K
a > b	if b-a >= 0 then 0 else (b-a) + K
a or b	min(bdistance(a), bdistance(b))
a and b	bdistance(a) + bdistance(b)
not a	Negation is moved inward and propagated over a

For ease of discourse, let us consider the last one first. The idea is that if one encounters a not unary, then it should be moved inward to the outermost comparison, which gets flipped. Any and or or that is encountered gets switched.

The same also gets applied when we want to take the false branch of a conditional. So, let us create a new class, that given an expression e, transforms it as equivalent to not e, but without the not in the expression, and normalizes it. So, we need two classes that correspond to both distributing any internal not and negating a given expression.

DistributeNot

This class normalizes any Not by distributing it inside. First the infrastructure.

When we find module, and expr there is no change, because they are just wrapper classes

We need two classes, the DistributeNot which is responsible for non-negated and NegateDistributeNot which is responsible for carrying a negated expression.

Simple things like names and constants should get translated directly by the DistributeNot, but should be negated by NegateDistributeNot.

Check that it works.

What should happen for not a? It should get pushed into a if possible. That is, DistributeNot should then switch to NegateDistributeNot. However, if we are starting with NegateDistributeNot, then it is already carrying a negation, so it should switch to DistributeNot.

Check that it works

What should happen for a and b? It should get turned into not (a and b) which is (not a) or (not b), but only on NegateDistributeNot. For DistributeNot, there is no change.

Check that it works

The on_compare method is simply itself in DistributeNot because we do not expect a not inside the compare. The NegateDistributeNot switches to its anti operation. We also do not have to walk inside the comparators because we do not expect either boolean operators or other comparators inside comparators.

class DistributeNot(DistributeNot):
    def on_compare(self, node):
        return node

class NegateDistributeNot(NegateDistributeNot):
    def on_compare(self, node):
        assert len(node.ops) == 1
        op = node.ops[0]
        if isinstance(op, ast.Eq):
            v = ast.Compare(node.left, [ast.NotEq()], node.comparators)
        elif isinstance(op, ast.NotEq):
            v = ast.Compare(node.left, [ast.Eq()], node.comparators)
        elif isinstance(op, ast.Lt):
            v =  ast.Compare(node.left, [ast.GtE()], node.comparators)
        elif isinstance(op, ast.Gt):
            v =  ast.Compare(node.left, [ast.LtE()], node.comparators)
        elif isinstance(op, ast.GtE):
            v = ast.Compare(node.left, [ast.Lt()], node.comparators)
        elif isinstance(op, ast.LtE):
            v = ast.Compare(node.left, [ast.Gt()], node.comparators)
        elif isinstance(op, ast.In):
            v = ast.Compare(node.left, [ast.NotIn()], node.comparators)
        elif isinstance(op, ast.NotIn):
            v = ast.Compare(node.left, [ast.In()], node.comparators)
        else:
            assert False
        return self.fix(v)

Check that it works

We can now define branch distance conversions in BDInterpreter class. we want the comparator to have access to K. So we pass in self.

from functools import reduce
class BDInterpreter(BDInterpreter):
    def on_unaryop(self, node):
        v = self.walk(node.operand)
        UnaryOP = {
        ast.Invert: lambda self, a: self.K,
        ast.Not: lambda self, a: self.K,
        ast.UAdd: lambda self, a: self.K,
        ast.USub: lambda self, a: self.K
        }
        return UnaryOP[type(node.op)](v)

def on_compare(self, node):
        hd = self.walk(node.left)
        op = node.ops[0]
        tl = self.walk(node.comparators[0])
        CmpOP = {
        ast.Eq: lambda self, a, b: 0 if a == b else math.abs(a - b) + self.K,
        ast.NotEq: lambda self, a, b: 0 if a != b else math.abs(a - b) + self.K,
        ast.Lt: lambda self, a, b: 0 if a < b else (a - b) + self.K,
        ast.LtE: lambda self, a, b:  0 if a <= b else (a - b) + self.K,
        ast.Gt: lambda self, a, b: 0 if a > b else (b - a) + self.K,
        ast.GtE: lambda self, a, b:  0 if a >= b else (b - a) + self.K,
        # The following are not in traditional branch distance,
        # but we can make an informed guess.
        ast.Is: lambda self, a, b: 0 if a is b else self.K,
        ast.IsNot: lambda self, a, b:  0 if a is not b else self.K,
        ast.In: lambda self, a, b: 0 if a in b else self.K,
        ast.NotIn: lambda self, a, b: 0 if a not in b else self.K,
        }
        return CmpOP[type(op)](self, hd, tl)

def on_boolop(self, node):
        vl = [self.walk(n) for n in node.values]
        BoolOP = {
        ast.And: lambda a, b: a + b,
        ast.Or: lambda a, b: min(a, b)
        }
        return reduce(BoolOP[type(node.op)], vl)

We need one more step. That is, we need to run the evaluator. In the below, we assume that we need to take the True branch. Hence, we use the normal_ast to find how to flip from the False branch. If on the other hand, you want to flip from the True branch to False branch of the conditional, then you need the negated_ast.

Let us try to make it run. Note that the examples would typically be present in code as

if a>b:
   # target branch
   print('Hello') 
else:
   # this branch was taken.
   print('Hi') 

How would you use the branch distance metric? Currently (2024) it is used to further fine-tune the fitness of a given input with respect to a given condition. That is, the approach distance is used as the whole number component of the fitness, and the branch distance is used as the fractional component, and the branch distance is only computed at the particular branch where the control flow diverged from the desired path. However, this is not the only way it can be used. Korel [^korel1990] initially suggested using the branch distance directly, computing the fitness based on the entire set of boolean conditions encountered on a desired path.

Note: A particularly good resource on SBST is Simon Marcus Poulding’s Ph.D. thesis “The Use of Automated Search in Deriving Software Testing Strategies”. [^korel1990]: Bogdan Korel “Automated software test data generation.” IEEE Transactions on software engineering, 1990 [^arcuri2011]: Andrea Arcuri “It really does matter how you normalize the branch distance in search-based software testing” 2011

Artifacts

The runnable Python source for this notebook is available here.

Contents

DistributeNot

Artifacts