Grammar Coverage with K-Paths for Syntax Testing

TL;DR This tutorial explains how to design test cases that effectively cover features of a given context-free grammar using the k-path strategy.

Note: The k-path strategy discussed here is unrelated to the k-path cover from graph theory.

Definitions

For this post, we use the following terms as we have defiend previously:

The alphabet is the set all of symbols in the input language. For example, in this post, we use all ASCII characters as alphabet.
A terminal is a single alphabet symbol. For example, x is a terminal symbol.
A nonterminal is a symbol outside the alphabet whose expansion is defined in the grammar using rules for expansion. For example, <term> is a nonterminal symbol.
A rule is a finite sequence of terms (two types of terms: terminals and nonterminals) that describe an expansion of a given terminal. For example, [<term>+<expr>] is one of the expansion rules of the nonterminal <expr>.
A definition is a set of rules that describe the expansion of a given nonterminal. For example, [[<digit>,<digits>],[<digit>]] is the definition of the nonterminal <digits>
A context-free grammar is composed of a set of nonterminals and corresponding definitions that define the structure of the nonterminal. The grammar given below is an example context-free grammar.
A terminal derives a string if the string contains only the symbols in the terminal. A nonterminal derives a string if the corresponding definition derives the string. A definition derives the string if one of the rules in the definition derives the string. A rule derives a string if the sequence of terms that make up the rule can derive the string, deriving one substring after another contiguously (also called parsing).
A derivation tree is an ordered tree that describes how an input string is derived by the given start symbol. Also called a parse tree.
A derivation tree can be collapsed into its string equivalent. Such a string can be parsed again by the nonterminal at the root node of the derivation tree such that at least one of the resulting derivation trees would be the same as the one we started with.

Definitions
Syntax Based Testing
Artifacts

Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.

Available Packages

These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (`pkg.whl`) and install using `pip install pkg.whl`.

Syntax Based Testing

When the abstract model of a program is a graph, to test such programs effectively, one uses the syntax based testing. Similarly, if the input domain of a program is a context-free language, to test such programs effectively, one needs to rely on syntax based testing.

As Havricov et al. ¹ discusses, a high variation of inputs is desirable because it can induce a high variation in how programs behave, which is highly desirable. In particular, if some feature of the input domain is not present in the test cases, the corresponding code that processes that element may not get exercised. Finally, some elements may be processed only by the combination of input elements in specific order. In all such cases, producing inputs that cover all combinations is desirable. The traditional test design objectives ² for syntax based testing are as follows:

Covering all terminal symbols in the context-free grammar
Covering all definitions in the context-free grammar
Covering all rules in the context-free grammar (subsumes the above two).

However, note that none of them actually produce inputs that are a combination of input features. Hence, to ensure that we can produce such inputs, Havricov et al. ¹ introduces a new measure of test obligations called k-path.

K-Paths

A K-path is a sequence of expansions with K nonterminal symbols. A 1-path is a path where there is only a single nonterminal symbol is involved. So, a test suite designed with all 1-path criteria is same as one defined with all definitions obligation. A 2-path is a path with exactly two nonterminal symbols that are expanded consecutively. For example, say you have a definition such as

<digits>:[
               [<digit><digits>],
               [<digit>],
],

<digit>: [[1],[0]]

and a derivation tree that looks like

('<digits>', [
          ('<digit>', [('1', [])])
          ])

Such a tree is an instance of a 2-path, which is [<digits>, <digit>]. In a tree such as

('<digits>', [
          ('<digit>', [('1', [])]),
          ('<digits>',
              [('<digit>', [('1', [])])])
          ])

we have one 3-path, which is [<digits>,<digits>, <digit>]. We also have two 2-paths [<digits>, <digit>], [<digits>,<digits>] and [<digits>, <digit>] , but there are only two unique 2-paths.

So, given an input, ho do we compute the k-paths in that input?

Computing k-paths

Let us start with an input

let us visualize it

First, we write a function to get all paths in a given tree from root to leaf.

Let us test it

Given a single path, we can compute all the k-sequences in it

def get_k_sequences(path): if not path: return []

sequences = []
n = len(path)

for length in range(1, n + 1):
    for start in range(n - length + 1):
        sequences.append(path[start:start + length])

return sequences Let us test it again.

Now, computing k-paths is simply a matter of getting all paths, and generating all sequences from it.

Let us test it

Possible k-paths from grammar

So, the question is, how to compute the number of k-paths from grammar? The basic idea is to first find the 1-paths from a grammar, which is the set of all nonterminal symbols. Then, find all possible nonterminals which contain a given nonterminal symbol in one of the rules in the definition. Let us call it the parent set of a given nonterminal The set of all possible parent along with the child gives us the 2-paths in the grammar. We can extend a n-path to n+1 path by adding the parents to the n-path.

We stat by defining a function parents() which takes in a grammar, and identifies possible parent nodes for a given nonterminal symbol.

That is, given the expression grammar

The parents are

Next, we define a function to compute all k-paths in a grammar.

Using it:

We can now tie both together, allowing us to return all k_paths of a particular k.

Using

So, how many k-paths are in the grammar? We now have to eliminate duplicates

Using

At this point, we have demonstrated both how to obtain the k-paths from a given set of inputs, and finding the total number of k-paths from a grammar for a particular k.

Obtaining test cases for coverage.

Now that we have the k-paths, how do we obtain the test cases? For that, we first define a procedure that given a a node of the tree, and the parent nonterminal symbol (key), looks through the rules of the parent key, and identifies one of the rule which contains the root nonterminal of the given node as a token. Given such a rule, we can embed the current node, forming a partial tree.

using it

Next, given a k-path, we want to make it into a tree.

Using it.

Let us define a method to display such trees

Using it

Filling the partial tree

Using it

Let us make it into a test case

Using it

Artifacts

The runnable Python source for this notebook is available here.

Havrikov, Nikolas, and Andreas Zeller. “Systematically covering input structure.” 2019 IEEE/ACM international conference on Automated Software Engineering (ASE). IEEE, 2019. ↩ ↩²
Paul Purdom. “A Sentence Generator for Testing Parsers.” BIT Numerical Mathematics, 1972 ↩

Definitions

Contents

Syntax Based Testing

K-Paths

Computing k-paths

Possible k-paths from grammar

Obtaining test cases for coverage.

Artifacts