How hard does mutation analysis have to be, anyway?

Mutation analysis is considered the best method for measuring the adequacy of test suites. However, the number of test runs required for a full mutation analysis grows faster than project size, which is not feasible for real-world software projects, which often have more than a million lines of code. It is for projects of this size, however, that developers most need a method for evaluating the efficacy of a test suite. Various strategies have been proposed to deal with the explosion of mutants. However, these strategies at best reduce the number of mutants required to a fraction of overall mutants, which still grows with program size. Running, e.g., 5% of all mutants of a 2MLOC program usually requires analyzing over 100,000 mutants. Similarly, while various approaches have been proposed to tackle equivalent mutants, none completely eliminate the problem, and the fraction of equivalent mutants remaining is hard to estimate, often requiring manual analysis of equivalence.

In this paper, we provide both theoretical analysis and empirical evidence that a small constant sample of mutants yields statistically similar results to running a full mutation analysis, regardless of the size of the program or similarity between mutants. We show that a similar approach, using a constant sample of inputs can estimate the degree of stubbornness in mutants remaining to a high degree of statistical confidence, and provide a mutation analysis framework for Python that incorporates the analysis of stubbornness of mutants.

Updates:

One can simplify, and reach our conclusions in this paper by noting that, theory of random sampling only requires randomness in the sample selection, and not in the population. That is, even if the population contains strongly correlated variables, so long as the sampling procedure is random, one can expect the sample to obey statistical laws. In particular, the correlatedness of the mutants results in same mean, but lesser variance than if we were sampling from a set of mutants that were independent of each other.
Further, we recommend that one should sample at least 9,604 mutants for 99% precision 95% of the time, as suggested by theory for independent mutants simply because it is the pessimistic upper bound, and it should be used for research.