Simple DDSet

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

We previously discussed how DDSET ¹ is implemented. However, DDSET is a fairly complex algorithm, and tries to handle diverse cases that may arise in the real world such as difference between syntactic and semantic validity, impact of tokens, variables etc. This complexity is however, not innate. One can produce a much more simple version of DDSET if one is only interested in abstracting inputs, and one has a predicate that does not have a semantic validation phase. The algorithm is as follows: (We use a number of library functions from that post). Unlike previous posts, this post uses a top down approach since we have already defined a number of functions previously. First we load the prerequisite earley parser.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine.

The imported modules

We first define our input, and make sure that our predicate works

Next, let us make sure that it parses correctly.

Next, we use the perses reducer to produce a reduced derivation tree.

Now we are ready to call our generalizer, which takes the expression tree, the grammar, and the predicate, and returns the generalized pattern.

pattern = ddset_simple(reduced_expr, EXPR_GRAMMAR, expr_double_paren)
print(pattern)

The ddset_simple() is implemented as follows:

If we do only want the abstract tree, we have another function ddset_abstract()

The generalize() procedure tries to generalize a given tree recursively. For that, it starts at the root node, and replaces the node with a randomly generated tree rooted at the same node. It tries that a configurable number of times, and if the tree can be replaced each time without failure, then we mark the path as abstract. If not, we descent into its children and try the same. While generating a new tree, any previous nodes marked as abstract is also replaced by randomly generated values.

The can_abstract() procedure above does the checking to see if the tree can be abstracted. It is implemented as follows.

The can_abstract() procedure tries to generate a valid value MAX_TRIES_FOR_ABSTRACTION times. For this, it relies on replace_all_paths_with_generated_values() which is implemented as follows.

Here, the major work is done by replace_path_with_generated_value() which replaces a single given path with a generated node of the same kind.

Given a key, generate a random value for that key using the grammar.

Finally, the converter from an abstract tree to a string expression

We also need a few library functions for marking some nodes concrete and some abstract.

With this, we are ready to extract our pattern.

The tree

The pattern

So, given that this algorithm is much simpler than the original, why should we use the original algorithm? The problem is that when the input is a file in a programming language, one also needs to take into account the semantics. That is, the generated input needs to be valid both syntactically (by construction) as well as semantically. It is hard enough trying to fill one hole in the parse tree (abstract node) with a semantically valid subtree. Now, imagine that you have identified one abstraction and are evaluating a second node. You need to generate random nodes both for previously identified abstract node, as well as the node you are currently evaluating. Say you have identified three abstract nodes, for any node that will be evaluated next, you need to fill in three abstract nodes with randomly generated semantically valid values. This is exponential, and infeasible to continue as more nodes are added. Hence, in original DDSet, we try to independantly evaluate each single node, and once we have collected most of these nodes, we go for a second pass to verify.

How much difference does it make? For Rhino bug 385 abstracting the minimal string var {baz: baz => {}} = baz => {}; took 15643 executions for ddsetsimple when compared to 10340 executions for the ddset from the paper (discounting covarying fragments).

The full code is available here

Rahul Gopinath, Alexander Kampmann, Nikolas Havrikov, Ezekiel Soremekun, Andreas Zeller, “Abstracting Failure Inducing Inputs” ISSTA 2020 URL:https://rahul.gopinath.org/publications/#gopinath2020abstracting ↩

Contents