Note: This is based on the ddmin in the fuzzingbook.

Pyodide takes a little time to initialize. Please wait until the Run all button has a red border.

About Delta Debugging

Delta Debugging is a method to reduce failure inducing inputs to their smallest required size that still induces the same failure. It was first formally introduced in the paper Simplifying and Isolating Failure-Inducing Input by Zeller and Hildebrandt.

The idea of delta debugging is fairly simple. We start by partitioning the given input string, starting with two partitions – which have a given partition length. Then, we check if any of these parts can be removed without removing the observed failure. If any of these can be removed, we remove all such parts of the given length. Once no such parts of the given length can be removed, we reduce the partition length by two, and do the same process again. This obtains us the 1-minimal failure causing string where removal of even a single character will remove the observed failure.

Given a causal function as below,



Here is an example run:

$ python ddmin.py '<SELECT NAME="priority" MULTIPLE SIZE=7>'
.  ty" MULTIPLE SIZE=7> 20
.  <SELECT NAME="priori 20
.  ME="priority" MULTIPLE SIZE=7> 30
+  <SELECT NAty" MULTIPLE SIZE=7> 30
+  <SELECT NALE SIZE=7> 20
.  <SELECT NA 10
.  CT NALE SIZE=7> 15
.  <SELELE SIZE=7> 15
+  <SELECT NAZE=7> 15
.  <SELECT NA 10
.  ELECT NAZE=7> 13
.  <SECT NAZE=7> 13
.  <SELT NAZE=7> 13
.  <SELECNAZE=7> 13
+  <SELECT ZE=7> 13
+  <SELECT =7> 11
+  <SELECT > 9
.  <SELECT  8
.  SELECT > 8
.  <ELECT > 8
.  <SLECT > 8
.  <SEECT > 8
.  <SELCT > 8
.  <SELET > 8
.  <SELEC > 8
+  <SELECT> 8
.  <SELECT 7
<SELECT>

Implementation

How do we implement this?

First, the prerequisites:



remove_check_each_fragment()

Given a partition length, we want to split the string into that many partitions, remove each partition one at a time from the string, and check if for any of them, the causal() succeeds. If it succeeds for any, then we can skip that section of the string.



There is a reason this function is split from the main function unlike in the original implementation of ddmin. The function remove_check_each_fragment obeys the contract that any string returned by it obeys the contract represented by the causal function. This means that any test case that is produced by remove_check_each_fragment will reproduce the specified behavior, and can be used for other computations. For example, one may use it for evaluating test reduction slippage, or for finding other reductions.

ddmin()

The main function. We start by the smallest number of partitions – 2. Then, we check by removing each fragment for success. If removing one fragment succeeds, we change the current string to the string without that fragment. So, we remove all fragments that can be removed in that partition size. If none of the fragments could be removed, then we reduce the partition length by half. If the partition cannot be halved again (i.e, the last partition length was one) or the string has become empty, we stop the iteration.



The driver.







The nice thing is that, if you invoke the driver, you can see the reduction in input length in action. Note that our driver is essentially a best case scenario. In the worst case, the complexity is \(O(n^2)\)

Recursive

That was of course illuminating. However, is that the only way to implement this? delta-debug at its heart, is a divide and conquer algorithm. Can we implement it recursively?

The basic idea is that given a string, we can split it into parts, and check if either part reproduces the failure. If either one does, then call ddrmin() on the part that reproduced the failure.

If neither one did, then it means that there is some part in the first partition that is required for failure, and there is some part in the second partition too that is required for failure. All that we need to do now, is to isolate these parts. How should we do that?

Call ddrmin() but with an updated check. For example, for the first part, rather than checking if some portion of the first part alone produces the failure, check if some part of first, when combined with the second will cause the failure.

All we have left to do, is to define the base case. In our case, a character of length one can not be partitioned to strictly smaller parts. Further, we already know that any string passed into ddrmin() was required for reproducing the failure. So, we do not have to worry about empty string. Hence, we can return it as is.

Here is the implementation.

ddrmin()



Given that it is a recursive procedure, one may worry about stack exhaustion, especially in languages such as Python which allocates just the bare minimum stack by default. The nice thing here is that, since we split the string by half again and again, the maximum stack size required is \(log(N)\) of the input size. So there is no danger of exhaustion.

The recursive algorithm is given in Yesterday, my program worked.Today, it does not. Why? by Zeller in 1999.

Is this Enough? (Syntax)

One of the problems with the unstructured version of ddmin() above is that it assumes that parts of the inputs can be cut away, while still retaining validity in that it will actually reach the testing function. This, however, may not be a reasonable assumption especially in the case of structured inputs. The problem is that if you have a JSON [{"a": null}] that produces an error because the key value is null, ddmin() will try to partition it as [{"a": followed by null}] neither of which are valid. Further, any chunk removed from either parts are also likely to be invalid. Hence, ddmin() will not be able to proceed.

The solution here is to go for a variant called Hierarchical Delta Debugging described in HDD: hierarchical delta debugging by Misherghi et al. in 2006. The basic idea is to first extract the parse tree (either by parsing the input using a grammar, or by either relying on a generator to generate the parse tree along with the input, or by extracting the parse tree in-flight from the target program if it supports it), and then try and apply ddmin() on each level of the derivation tree. Another notable improvement is Automatically Reducing Tree-Structured Test Inputs by Herfert et al. in 2017. In this paper, the authors describe a simple strategy: Try and replace a given node by one of its child nodes. This works reasonably well for inputs that are context sensitive such as programs that can trigger bugs in compilers. Another is Perses: Syntax-Guided Program Reduction by Sun et al. in 2018 which uses the program grammar directly to avoid creating invalid nodes.

The problem in the above approach is that, it assumes that there exist a child node that is of same type as the parent node. This need not be the case. Hence an even better idea might be to simply do a breadth first search of the first node that has the same symbol as that of the current node, and try replacing with that node, continuing with the next symbol of the same kind if it fails.

Perses

Below is a variant of hierarchical ddmin() from “Perses: syntax-guided program reduction” by Sun et al. 2018



A few helpers

Check if a given token is a noterminal



We prioritize the smallest trees.



A priority queue



A way to tempoararily replace a node in a tree.



Find all subtrees with the given symbol



The perses algorithm itself.



The driver

First, we define the predicate. Our predicate is simple. We check if the expression has doubled parenthesis.



We have the following grammar



And the following input



We parse and generate a derivation tree as follows

Loading the prerequisite:





The derivation tree looks like this





Now, we are ready to use the perses_delta_debug()



Is this Enough? (Semantics)

Note that at this point, we can generate syntactically valid inputs to check reduction, but there is no guarantee that they would be semantically valid (i.e. variable def-use etc.). So, we still suffer a large unresolved rate. Is there a better solution? One option notably followed by creduce is to rely on custom reduction passes that maintains the validity of the string being operated on. That is, incorporate domain knowledge into reduction.

A rather different approach is taken by Hypothesis. Hypothesis is again a generator based reduction.

Hypothesis reducer works as follows: Given any generator that relies on randomness, one can envision an array of random bits that is passed to the generator. For example, consider the simple grammar generator below. It is passed bits which is an array of random bits with a function next(), which returns an integer and advances the index.

def unify_key(bits, grammar, key):
   return unify_rule(bits, grammar, grammar[key][bits.next() % len(grammar[key])]) if key in grammar else [key]

def unify_rule(bits, grammar, rule):
    return sum([unify_key(bits, grammar, token) for token in rule], [])

When the generator wants to make a random choice, it consumes a number of bits from this array, and advances the index to denote that the bits were consumed. Now, at the end of generation, the bits consumed so far is essentially the trail of generation.

Hypothesis tries to reduce the length of this trail (which it calls the choice sequence). Hypothesis tries to identify shorter choice sequences (short-lex order) that can reproduce the failure. This it does by certain surgeries in the choice sequence obtained from generating the string. The particular surgeries are simply removal of chunks that seems to work well in practice, with no specific reasoning behind it. If it finds that more bits are required than in the original choice sequence, then it discards the edit. Otherwise, the idea is that short-lex order almost always leads to shorter strings by prioritizing strings with smaller number of decisions, and consequently smaller number of recursion and hence terminal symbols.

Note that unlike HDD variants, any edit on the choice sequence is guaranteed to produce a valid string.

One question here is, why does the almost arbitrary edits in choice sequences work? The surprise here is that, arbitrary edits work even though any edit can end up making the remaining bits be passed to different choices than original. The reason seems to be that, the bugs are usually locally concentrated. That is, once one is able to reproduce the failure pattern, one can fill in the remaining parts of the input with randomly generated (but valid) values. Note that skeleton of the input will be generated by the initial choices itself. So, the hypothesis strategy seems to be to find edits that reach the pattern to reproduce fast, and then fill in the remaining blanks, which seems to work in practice.