Can testedness be effectively measured?

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of “testedness” have been proposed; unfortunately, we do not know whether these are truly effective.

In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.

We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a “poorly tested” element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

Updates:

In this paper we found that coverage and mutation score of a program element has only a weak negative correlation with future bug fixes per line in that element. In retrospect, comparing the future bug fixes per line was a mistake. We should have compared the coverage and mutation score of the element with the total future bugfixes of that element.

Crucially, our research is the first to provide evidence that the mutation score is related to the residual defect density of the program. That is, the number of live mutants remaining is related to the actual bugs remaining in the program.