The imported modules
A grammar that produces at least one evocative fragment per input generated.

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

This post is the implementation of my paper Input Algebras

In my previous post on DDSet I explained how one can abstract failure inducing inputs. The idea is that given an input such as '1+((2*3/4))' which induces a failure in the program, one can extract a minimal failure inducing input such as say '((4))', which can again be abstracted to produce the expression ((<expr>)) which precisely defines the fault inducing input. Such an input is quite useful for the developers to understand why a failure occurred.

However, such inputs are insufficient from a testing perspective. For example, such a pattern fails to capture the contexts in which the doubled parenthesis can appear, and hence induce the failure. That is, how do we use the pattern ((<expr>)) to produce inputs such as '1+((2*3/4))'? This is what this post will discuss.

Note that the guarantee of inducing failures is statistical. That is, the abstract input is mined by evaluating produced inputs for failure a fixed (configurable) number of times. It is possible that some rare inputs may still fail to induce the failure. Since abstract possibly failure inducing inputs is a mouthful, let us call these abstract failure inducing inputs evocative patterns for short. In this post, we will see how to transform such an evocative pattern to an evocative grammar that is guaranteed to produce evocative inputs (that is, each input contains an evocative fragment and the derivation tree of the input contains an evocative subtree) in all contexts that are guaranteed (statistically) to induce failures.

As before, let us start with importing our required modules.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
hdd-0.0.1-py2.py3-none-any.whl from "Hierarchical Delta Debugging".
simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
ddset-0.0.1-py2.py3-none-any.whl from "Simple DDSet".

The imported modules

We import the following modules

We have our input that causes the failure.

We first parse the input

Then reduce input

Finally, extract the abstract pattern.

However, it does not actually tell you in what all circumstances the failure can be reproduced. The problem is that it represents a complete abstract input. The only general part here is <expr> between the doubled parenthesis. So, it tells us that we can produce ((1)), ((2 + 3)) etc. but these are not the only possible errors. Indeed, our original error: 1+((2*3/4)) does not fit this template. So, how can we rectify this limitation?

A grammar that produces at least one evocative fragment per input generated.

The basic idea is that if we can find an abstract subtree of the evocative pattern derivation tree, such that the presence of this abstract subtree in the input guarantees the failure, then we can modify the grammar such that this abstract subtree is always present (we call the abstract subtree an evocative subtree). That is, for any input from such a grammar, at least one instance of the evocative subtree will be present. This is fairly easy to do if the generated tree contains a nonterminal of the same kind as that of the top node of the evocative subtree. Simply replace that node with the evocative subtree, and fill in the abstract nonterminals in the evocative subtree with concrete expansions.

Reachable Grammar

On the other hand, this gives us an idea. What if we modify the grammar such that at least one instance of such a nonterminal is present? Such a grammar is called the reachable_grammar. To produce a reaching grammar, first we need to find what nonterminals are reachable from the expansion of a given nonterminal. A nonterminal <A> is reachable from another nonterminal <B> if and only if one of the expansion rules for <B> contains the nonterminal <A> or it is reachable from one of the nonterminals in the expansion rules of <B> Note that it is not enough to be the same nonterminal. That is, (for e.g.) <A> is not reachable from <A> if expansion rules of <A> contain only terminal symbols.

We can now collect it in a data structure for easier access

That is, if we want to know the nonterminals that are reachable from <integer>, we can simply do

That is, only <digit> and <integer> are reachable from the expansion of nonterminal <integer>

Reachable positions.

Next, given an evocative subtree, we want to find what tokens of the grammar can actually embed such a subtree. We call the top node of an evocative subtree its characterizing node, and the nonterminal the characterizing nonterminal of the evocative subtree.

Say we assume that <factor> is the characterizing nonterminal. Here are the locations in grammar where one can embed the evocative subtree.

Insertion Grammar

Given the insertion locations, can we produce a grammar such that we can guarantee that at least one instance of evocative subtree can be inserted? To do that, all we need to guarantee is that the start node in any derivation tree can reach the characterizing nonterminal. For now, let us call our fault F1. Let us indicate that our start symbol guarantees reachability of characterizing nonterminal by specializing it. So, our new start symbol is <start F1>

Next, for the start symbol to guarantee reachability to characterizing nonterminal, all that we need to ensure is that all the expansion rules of start can reach the characterizing nonterminal. On the other hand, for a guarantee that an expansion rule can reach the characterizing nonterminal, all that is required is that one of the nonterminals in that rule guarantees reachability. We start with a few helper functions

Defining the reachable_key()

It is used as follows

Pattern Grammar

Next, we need to ensure that our evocative subtree can form a unique subtree. For that, all we need to do is that all nodes in the subtree are named uniquely. Not all nodes in the evocative subtree needs unique names however. DDSet produces trees such that some nodes in the tree are left abstract. We leave these with the original node names.

Using

We can extract a unique pattern grammar from this tree.

We can convert this to grammar, but first, to display the grammar properly, we define display_grammar()

class DisplayGrammar:
    def __init__(self, grammar, verbose=0):
        self.grammar = grammar
        self.verbose = verbose

def is_nonterminal(self, key):
        return fuzzer.is_nonterminal(key)

def display_token(self, t):
        return t if self.is_nonterminal(t) else repr(t)

def display_rule(self, rule, pre):
        if self.verbose > -2:
            v = (' '.join([self.display_token(t) for t in rule]))
            s = '%s|   %s' % (pre, v)
            print(s)

def display_definition(self, key, rule_count):
        if self.verbose > -2: print(key,'::=')
        for rule in self.grammar[key]:
            rule_count += 1
            if self.verbose > 1:
                pre = rule_count
            else:
                pre = ''
            self.display_rule(rule, pre)
        return rule_count

def recurse_grammar(self, grammar, key, order, undefined=None):
        undefined = undefined or {}
        rules = sorted(grammar[key])
        old_len = len(order)
        for rule in rules:
            for token in rule:
                if not self.is_nonterminal(token): continue
                if token not in grammar:
                    if token in undefined:
                        undefined[token].append(key)
                    else:
                        undefined[token] = [key]
                    continue
                if token not in order:
                    order.append(token)
        new = order[old_len:]
        for ckey in new:
            self.recurse_grammar(grammar, ckey, order, undefined)
        return undefined

def sort_grammar(self, grammar, start_symbol):
        order = [start_symbol]
        undefined = self.recurse_grammar(grammar, start_symbol, order)
        return order, [k for k in self.grammar if k not in order], undefined

def display_unused(self, not_used, r):
        if not_used and self.verbose > -1:
            print('[not_used]')
            for key in not_used:
                r = self.display_definition(key, r)
                if self.verbose > 0:
                    print(k, r)

def display_undefined(self, undefined):
        if undefined and self.verbose > -1:
            print('[undefined keys]')
            for key in undefined:
                if self.verbose == 0:
                    print(key)
                else:
                    print(key, 'defined in')
                    for k in undefined[key]: print(' ', k)

def display_summary(self, k, r):
        if self.verbose > -1:
            print('keys:', k, 'rules:', r)

def display(self, start):
        rule_count, key_count = 0, 0
        order, not_used, undefined = self.sort_grammar(self.grammar, start)
        print('[start]:', start)
        for key in order:
            key_count += 1
            rule_count = self.display_definition(key, rule_count)
            if self.verbose > 0:
                print(key_count, rule_count)

self.display_unused(not_used, rule_count)
        self.display_undefined(undefined)
        self.display_summary(key_count, rule_count)

def display_grammar(grammar, start, verbose=0):
    DisplayGrammar(grammar, verbose).display(start)

Using it.

We define pattern_grammar() that wraps both calls.

Using it.

We define a procedure to reverse our pattern grammar to ensure we can get back our tree.

The tree can be recovered thus.

Given the reaching grammar and the pattern grammar, we can combine them to produce the complete grammar

Using it

Here, you will notice a problem: There are a few nonterminals such as <integer F1> that do not have a definition. That is, it has no expansion (not even empty expansion). So, any rule that uses it will also by definition have no possible expansions. This has consequences during generation, forcing us to abandon partially constructed trees. Hence, we define a grammar_gc()

Cleanup of the grammar

Let us make sure that we are operating on a copy of the grammar.

Next, we find the empty keys in the grammar that do not have a definition.

Now, we need to remove such empty definitions, which also means any rules that refer to the corresponding nonterminals also have to be removed. The remove_nonterminal function takes a nonterminal, and removes its references from the given grammar.

When removing a rule, it can also happen that the corresponding nonterminal may be left with no further rules. Hence, we need to define finding and removing empty nonterminals from the grammar recursively.

These gives us the grammar_gc()

At this point we are ready to define our atleast_one_fault_grammar()

The new grammar is as follows

This grammar is now guaranteed to produce at least one instance of the characterizing node.

This seems to work. However, there is a wrinkle. Note that we set the evocative subtree as this subtree pattern[1][0][1][0][1][0], which is a <factor> node. But why not any of the other subtrees? for example, why not evocative_pattern itself? Indeed, the effectiveness of our specialized grammar is dependent on finding as small a evocative subtree as possible that fully captures the fault. So, how do we automate it? The idea is fairly simple. We start from the root of the evocative pattern. We know that this pattern fully captures the predicate. That is, any valid input generated from this pattern by filling in the abstract nodes is (statistically) guaranteed to produce the failure. Next, we move to the children of this node. If the node is abstract, we return immediately. If however, the child is not abstract, we let the child be the evocative subtree, and try to reproduce the failure. If the failure is not reproduced, we move to the next child. If the failure is reproduced we recurse deeper into the current child.

Usage

That is, we found the correct evocative subtree. We can confirm this by comparing it against the subtree we identified manually.

We save this subtree for later use.

ETREE_DPAREN = ('<factor>',
        [('(', [], {'abstract': False}),
         ('<expr>',
             [('<term>',
                 [('<factor>',
                     [('(', [], {'abstract': False}),
                      ('<expr>', [('<term>',
                          [('<factor>',
                              [('<integer>',
                                  [('<digit>',
                                      [('4', [],
                                          {'abstract': False})],
                                      {'abstract': False})],
                                  {'abstract': False})],
                              {'abstract': False})],
                          {'abstract': False})],
                          {'abstract': True}),
                      (')', [],
                          {'abstract': False})],
                      {'abstract': False})],
                 {'abstract': False})],
             {'abstract': False}),
         (')', [], {'abstract': False})], {'abstract': False})

Here is another evocative pattern, but we define a different predicate.

Check the assertion

We first parse the input as before

Then reduce input

Finally, extract the evocative pattern.

Then extract the evocative subtree

prettyprinting

using

We save this evocative pattern for later use.

The new grammar is as follows

This grammar is now guaranteed to produce at least one instance of a divide by zero.

At this point, we have the ability to guarantee that a evocative fragment is present in any inputs produced. In later posts I will discuss how combine multiple such fragments together using and, or or negation. I will also discuss how to ensure at most one fragment in the input and exactly one fragment or exactly n fragments. As before, the runnable source of this notebook can be found here

Artifacts

The runnable Python source for this notebook is available here.

The installable python wheel gatleastsinglefault is available here.

Specializing Context-Free Grammars for Inducing Faults

Contents