One of the key challenges of developers testing code is determining a test suite’s quality – its ability to find faults. The most common approach is to use code coverage as a measure for test suite quality, and diminishing returns in coverage or high absolute coverage as a stopping rule. In testing research, suite quality is often evaluated by a suite’s ability to kill mutants (artificially seeded potential faults). Determining which criteria best predict mutation kills is critical to practical estimation of test suite quality. Previous work has only used small sets of programs, and usually compares multiple suites for a single program. Practitioners, however, seldom compare suites — they evaluate one suite. Using suites (both manual and automatically generated) from a large set of real-world open-source projects shows that evaluation results differ from those for suite-comparison: statement (not block, branch, or path) coverage predicts mutation kills best.

Updates:

Laura Inozemtseva and Reid Holmes published the paper Coverage is not strongly correlated with test suite effectiveness at the same ICSE 2014 conference. Their title seems to be counter to the findings in this paper. Why is it so? Here are my observations:

  1. Inozemtseva et al.’s central thesis is that normalized mutation score is not highly correlated with coverage.
  2. The normalized mutation score is computed by Inozemtseva et al. as #mutants killed / #mutants covered.
  3. However, note that #covered mutants is directly determined by statement coverage. That is, it depends only on the average number of mutants per line along with statement coverage.

Normalization is done according to Inozemtseva et al. to remove the effect of coverage (Sec 3.6 2ed para). So, removing the effect of coverage, if that attempt was successful, should result in a number that is not correlated with coverage (by construction).

So, this means that:

  1. A finding that there is no correlation between the normalized mutation score and coverage would be a tautology. This is the expected result.
  2. The finding that there is a moderate correlation between the two just means that the effect of coverage was not completely removed by division. That is, there was some non-linear component involved.