Hierarchical Delta Debugging
Contents
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Previously, we discussed how Delta-debug (ddmin)
worked. However, pure ddmin is limited when it comes to structured inputs.
One of the problems with the unstructured version of ddmin()
above is that
it assumes that parts of the inputs can be cut away, while still retaining
validity in that it will actually reach the testing function. This, however,
may not be a reasonable assumption especially in the case of structured
inputs. The problem is that if you have a JSON [{"a": null}]
that produces
an error because the key value is null
, ddmin()
will try to partition it
as [{"a":
followed by null}]
neither of which are valid. Further, any
chunk removed from either parts are also likely to be invalid. Hence,
ddmin()
will not be able to proceed.
The solution here is to go for a variant called Hierarchical Delta Debugging
1.
The basic idea is to first extract the parse tree (either by parsing the input
using a grammar, or by either relying on a generator to generate the parse
tree along with the input, or by extracting the parse tree in-flight from the
target program if it supports it), and then try and apply ddmin()
on each
level of the derivation tree. Another notable improvement was invented by
Herfert et al. 2.
In this paper, the authors describe a simple strategy: Try and replace a given
node by one of its child nodes. This works reasonably well for inputs that are
context sensitive such as programs that can trigger bugs in compilers.
A final algorithm is Perses 3 which uses the program grammar
directly to avoid creating invalid nodes. In this post, I will explain the
main algorithm of Perses. I note that a similar idea was explored by Pike in
20144 albeit for data structures.
The basic idea of Perses is to try and replace a parent node by either an empty expansion (so that the corresponding string fragment can be deleted) or by a child node of the same nonterminal. To accomplish the first (that is, to be able to use empty expansions), Perses first transofrms the given grammar to a Perses normal form. We can however skip that portion if we are careful and use a grammar where we explicitly use nonterminals that can have an empty expansion. In this post, I only explain the perses reduction algorithm. First, some book keeping
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.The imported modules
Predicate
For the delta debug algorithm to work, we need a predicate. However, unlike
ddmin which works with simple pass or fail status, we need a bit more
sophistication here. The reason is that in many cases, the newly constructed
strings we generate may also be invalid. For example, a particular program
may produce a core dump from the C compiler. Hence, the core dump is the
pass
for the predicate. On the other hand, deleting the declaration of a
variable may result in a compilation error. This is different from a simple
fail
which is no coredump on a semantically valid program. To capture these
different conditions, we define new return values for the predicate.
There can be four outcomes when an input is executed:
- Success (failure condition reproduced)
- Failed (failure condition not reproduced)
- Invalid (Did not reach failure condition possibly semantically invalid)
- Timeout (equivalent to Failed)
We now define our predicate
Checking
We now define a grammar
define a display_tree function
Now, let us define an input, and parse it
The Perses algorithm is as follows: We start the root, and recursively go down the child nodes. For each node, we check if that node can be replaced by a subtree with the same nonterminal, and still reproduce the failure, and find the smallest such tree (length determined by number of leaves).
Since this procedure can result in multiple trees, the tree to work on is chosen based on a priority queue where the priority is given to the smallest tree. The particular node chosen to replace the current node is determined based first on its number of leaf nodes, and then on its rank in a priority queue, where the priority is determined by the depth of the subtree from the current node. That is, a child gets priority over a grand child. We first have to define a way to address a specific node.
For the path, we simply use a list of numbers indicating the child node. For example, in the above, the path would be [0, 2, 0] Given a path, get_child() will simply return the node at the path.
Using it.
We also need a way to replace one node with another. This is done by replace_path().
Using it.
Priority queue
For Perses reduction, one needs a way to count the number of leaf nodes to determine the priority of a node. This is done by count_leaves()
We also define a helper that simply counts the internal nodes.
Next, we need to maintain a priority queue of the [(tree, path)]. The essential idea is to prioritize the items first by the number of leaves in the full tree (that is, the smallest tree that we have currently gets priority), then next by the number of leaves in the node pointed to by path, and finally, tie break by the insertion order (ecount).
We define another helper function nt_group() that groups all nonterminals that have the same name. These are used to determine the nodes that can be used to replace one node.
Using it
What are the compatible nodes? These are all the nodes that have the same nonterminal name, and is a descendent of the current node. Further, if the nonterminal allows empty node, then this is the first in the list. This is defined by compatible_nodes()
Using it.
Some programming languages have tokens which are first level lexical elements. The parser is often defined using the lexer tokens. We do not want to try to reduce tokens further. So we define a way to identify them (we have to keep in mind when we produce grammars). For now, we assume the ANTLR convention of identifying tokens by uppercase letters.
Perses reduction
We finally define the reduction algorithm. The implementation of Perses is given in reduction(). The essential idea is as follows:
- We have a priority queue of (tree, path_to_node) structures, where node is a node within the tree.
- The highest priority is given to the smallest tree.
- With in the nodes in the same tree, priority is given to nodes with smallest number of leaves
- In case of tie break, the shallowest subnode gets the highest priority (i.e child has higher priority over grand child, and empty node has the highest priority since it is a peer of the current node).
- We pick each nodes, and find compatible subnodes that reproduce the failure.
- Each compatible node and the corresponding tree is put back into the priority queue.
- If no child nodes were found that could replace the current node, then we add each children with the current tree into the priority queue. (If we had to recurse into the child nodes, then the next tree that will get picked will be a different tree.)
Using it
Is this Enough? (Semantics)
Note that at this point, we can generate syntactically valid inputs to check reduction, but there is no guarantee that they would be semantically valid (i.e. variable def-use etc.). So, we still suffer a large unresolved rate. Is there a better solution? One option notably followed by creduce is to rely on custom reduction passes that maintains the validity of the string being operated on. That is, incorporate domain knowledge into reduction.
A rather different approach is taken by Hypothesis. Hypothesis is again a generator based reduction.
Hypothesis reducer works as follows: Given any generator that relies on randomness, one can envision an array of random bits that
is passed to the generator. For example, consider the simple grammar generator below. It is passed bits
which is an array of random
bits with a function next()
, which returns an integer and advances the index.
def unify_key(bits, grammar, key):
return unify_rule(bits, grammar, grammar[key][bits.next() % len(grammar[key])]) if key in grammar else [key]
def unify_rule(bits, grammar, rule):
return sum([unify_key(bits, grammar, token) for token in rule], [])
When the generator wants to make a random choice, it consumes a number of bits from this array, and advances the index to denote that the bits were consumed. Now, at the end of generation, the bits consumed so far is essentially the trail of generation.
Hypothesis tries to reduce the length of this trail (which it calls the choice sequence). Hypothesis tries to identify shorter choice sequences (short-lex order) that can reproduce the failure. This it does by certain surgeries in the choice sequence obtained from generating the string. The particular surgeries are simply removal of chunks that seems to work well in practice, with no specific reasoning behind it. If it finds that more bits are required than in the original choice sequence, then it discards the edit. Otherwise, the idea is that short-lex order almost always leads to shorter strings by prioritizing strings with smaller number of decisions, and consequently smaller number of recursion and hence terminal symbols.
Note that unlike HDD variants, any edit on the choice sequence is guaranteed to produce a valid string.
One question here is, why does the almost arbitrary edits in choice sequences work? The surprise here is that, arbitrary edits work even though any edit can end up making the remaining bits be passed to different choices than original. The reason seems to be that, the bugs are usually locally concentrated. That is, once one is able to reproduce the failure pattern, one can fill in the remaining parts of the input with randomly generated (but valid) values. Note that skeleton of the input will be generated by the initial choices itself. So, the hypothesis strategy seems to be to find edits that reach the pattern to reproduce fast, and then fill in the remaining blanks, which seems to work in practice.
References
Artifacts
The runnable Python source for this notebook is available here.
The installable python wheel hdd
is available here.
-
HDD: hierarchical delta debugging, Ghassan Misherghi, Zhendong Su, ICSE 2006 ↩
-
Automatically Reducing Tree-Structured Test Inputs, Satia Herfert, Jibesh Patra, Michael Pradel, ASE 2017 ↩
-
Perses: Syntax-Guided Program Reduction, Chengnian Sun, Yuanbo Li, Qirun Zhang, Tianxiao Gu, Zhendong Su, ICSE 2018 ↩
-
SmartCheck: automatic and efficient counterexample reduction and generalization Lee Pike, Haskell 2014 ↩