Mutation Testing

What is mutation testing?

Mutation testing is a white-box testing technique to evaluate the quality of a given test suite. It works by creating incorrect versions of a given implementation and running the defined test suite against all generated incorrect implementations. If the defined test suite fails against all incorrect implementations, that is, kills all mutants then it's considered a good test suite.

There are several tools that automate the process of mutation testing. In this course, we will use the PIT mutation testing, which a tool to automatically perform mutation testing of Java programs and JUnit tests.

Understanding the PIT tool report

Instructions to setup and run the tool are part of Lab1 (see Pawtograder). Refer to the lab to set up the tool. This section explains how to interpret the report generated by the tool to improve our test cases.

Recall that a mutation testing tool works by generating several incorrect implementations of your original implementation. These incorrect implementations are generated by changing a single line of code (e.g., negating a condition) and are called mutants. A test suite is considered good if it contains at least one test that fails for every incorrect mutant.

To understand the report we will first have to understand the following terms for a mutant:

Mutant Killed: At least one test case failed for that incorrect mutant
Mutant Survived: No test case failed for the incorrect mutant.
No Coverage: No test case was executed for the incorrect mutant.

In the report, we should focus on the following metrics:

Test Strength: This is the ratio of the number of mutants killed by the number of mutants executed when the test suite was run.
Mutation Coverage: This is the ratio of the number of mutants killed by the total number of mutants generated.

What do these metrics indicate? A low test strength implies that all tests in the current test suite pass even if parts of our implementation are changed incorrectly. For e.g., if we incorrectly change an if condition in one of our methods, all our tests pass. Clearly, at least one test should have failed as a result of this incorrect change. If all our tests pass despite incorrect changes to our code, it means two things -- (1) we need to write more tests to cover those parts or (2) these parts of the code are redundant and should be removed. On the other hand, a low mutation coverage implies that there are parts of our code for which we haven't written any tests at all. Hence, we should write more tests.

Ideally we should have 100% for both mutation coverage and test strength. However, we want to focus on test strength first and make sure our existing tests are comprehensive. Once we have good test strength, we can focus on mutation coverage to see if we need to write additional tests.