Weak, Firm, and Strong Mutation

In general, methods for evaluating a test suite have focused on two measures, the extent of testing (commonly called coverage), and the strength of test cases in enforcing specifications.

For the first measure, a number of coverage measures have been proposed in the past, including statement coverage (and weaker coverage measures like function, class etc.), branch coverage, path coverage, condition coverage, data flow coverage, etc. The main idea behind these coverage measures is to determine how much of the program graph is traversed. While it is easy to show that for adequate test suites (adequate test suite with respect to a particular coverage measure is a test suite that is able to achieve 100% coverage in that measure.), path coverage > branch coverage > statement coverage (that is, a test suite with 100% branch coverage is guaranteed to provide 100% statement coverage, but not necessarily 100% path coverage.), the situation for non-adequate test suites are more complex as we showed earlier (Gopinath 2014).

However, few measures have targeted the second part. That is actually evaluating the quality of oracles (test assertions). It is for this reason that mutation analysis is important. The strong mutation analysis can effectively tell you whether your oracles are good or bad. (You can see the difference in manual oracles and automatic oracles in our previous research on code coverage – even though test suites generated by Randoop scored high on even the hardest coverage measure of all – path coverage – its mutation scores are low compared to manual test suites of similar coverage.)

To recap, strong mutation analysis (also called traditional mutation analysis) generates all first order variants of the program (programs that are different by a single token from the original), runs the test cases against each, and reports the percentage of such variants that were detected by the test suite. That is, any injected fault needs to be encountered by a test case, its effect propagated, and finally detected by the assertion as a difference from the expected behavior.

Weak mutation analysis was proposed by Howden (Howden 1982) as a low cost alternative to traditional mutation analysis. What weak mutation analysis does is different. For a weak killing of a mutant, the last condition of actual detection by an assertion is not required. If the mutant is able to change the internal state of the program at all, it is deemed to have been weakly detected ¹. Firm mutation is in between strong and weak mutation. For firm mutation, It is not expected that an assertion catches the difference in behavior. However, unlike weak mutation, the change needs to propagate some distance from its place of origin (typically checked at lexical boundaries).

The difference that such a change can make is enormous. That is we no longer need to execute each mutant in full. Rather, all we need to do is, at the site of the mutation, note which of the mutants would have resulted in a change to the internal state, mark them as killed, and proceed with the execution. That is, a single test run is sufficient for evaluation of the weak mutation score. (Papadakis and Malevris suggests using concolic tracing to implement weak mutation testing. The essential idea is, after collecting the trace, which gives you the symbolic expressions along the full execution path – this should be satisfiable, replace the expression corresponding to the original with the mutated expression, and check if the path is still satisfiable. If not satisfiable, the mutation was weakly killed.) Unfortunately, it also means that weak mutation analysis no longer belongs under the second category. Rather, it is just another coverage measure, albeit a rather strong one (similarly firm mutation is also just another coverage measure).

If you are confused between the two different kinds of measures, there is an easy test. Say you have a test suite with assertions, and you have some value computed by the measure under computation for its effectiveness. Say you change all assertions such as assert(expression) in the test suite to assert(expression || true). Does the value of the effectiveness computed by the measure change? If it changes, then the measure under consideration is a measure of specification strength.

A coverage measure that actually manages to check assertions to some extent is checked coverage (Schuler 2013), which was proposed by Schuler in his PhD. thesis. Checked coverage, as defined by Schuler is a dynamic slice of statements that actually flow into any of the assertions in the test suite. Unfortunately while it is more effective than other methods, it still does not measure the strength of specifications, and hence still a coverage measure. A more recent research by Murugesan et al. (Murugesan 2015) provides a stronger checked coverage that actually checks the results of assertions. One problem with this approach is that it only ensures that the covered line has an impact on the result of assertions. It provides no guarantees that all changes to the line will be caught by the assertion (which as far as I know is still only provided by the mutation analysis).

In particular, say P is a program, C is a simple component of P, and C' is the mutated component. Then, weak mutation killing of C' by a test case t requires that t executes C, and in at least one such execution, C' computes a different value than C. The components are not precisely defined, but are provided by examples: (1) variable reference, (2) variable assignment, (3) arithmetic expressions (4) arithmetic relations and (5) Boolean expressions. ↩