Mutation analysis is the primary means of evaluating the quality of test suites, though it suffers from inadequate standardization. Mutation analysis tools vary based on language, when mutants are generated (phase of compilation), and target audience. Mutation tools rarely implement the complete set of operators proposed in the literature, and most implement at least a few domain-specific mutation operators. Thus different tools may not always agree on the mutant kills of a test suite, and few criteria exist to guide a practitioner in choosing a tool, or a researcher in comparing previous results. We investigate an ensemble of measures such as traditional difficulty of detection, strength of minimal sets, diversity of mutants, as well as the information carried by the mutants produced , to evaluate the efficacy of mutant sets. By these measures, mutation tools rarely agree, often with large differences, and the variation due to project, even after accounting for difference due to test suites, is significant. However, the mean difference between tools is very small indicating that no single tool consistently skews mutation scores high or low for all projects. These results suggest that research using a single tool, a small number of projects, or small increments in mutation score may not yield reliable results. There is a clear need for greater standardization of mutation analysis; we propose one approach for such a standardization.
- The surface mutants in this paper is actually the minimal mutants from Ammann et al. (Ammann 2014), and the disjoint mutants from Kintis et al. (Kintis 2010). The minimal mutants in this paper starts by minimizing the test suite, and hence different from minimal mutants from Ammann et al.
- The definition of mutation subsumption in the paper is flipped. That is, a mutant dynamically subsumes another if all test cases that kills the former is guaranteed to kill the later, and the mutant is killed by the test suite.