Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Rafael and my talk is titled Mutation Testing
with Pete. But first, a few words about myself. I am a cloud native
team lead at Hazelcast before I worked at Google and
CERN. I'm also an author of the book continuous delivery with Docker and
Jenkins, and from time to time I do conference speaking and trainings,
but on a daily basis. I'm a developer and I live in Krakow in
Poland. A few words about Hazelcast Hazelcast is a distributed company
and it is distributed in two meanings. First meaning is
we are distributed company because we produce distributed software. But the second
meaning is that we are distributed company because we all
work remotely. We always work remotely. So I'm from Krakow,
from Poland, but I'm the only one from Krakow. We have people all over the
world. Our products are hazelcast in memory data grid,
which is a library for distributed computation,
for caching. Hazelcast jet is a library for stream processing, and hazelcast
cloud is hazelcast, but as a service. So do you
know why the NASA spacecraft burned in the atmosphere of
Mars in 1999? And do you know that it is somehow
related to the fact that even though mutation testing
was discovered in 1971, it was not really
used until 2012? So the answer to
all of these questions, as well as the whole idea and the implementation
of mutation testing, you will all find this in this presentation.
But first, imagine the following scenario. So there are like two engineers
discussing. I wrote code for the spacecraft. How do you know that it
works? I just know, or I
feel it. So NASA code, that is actually what happened in
NASA. So the code of the NASA spacecraft, it was
not tested. So in December 1998,
NASA launched the climate orbiter spacecraft
into the space. And this divide weighted almost half of
a ton and was sent to the Mars. It takes around half a year,
like six months, to go from Earth to Mars. And everything
was fine until September 1999, when all of
a sudden the contact with this device was lost.
What happened? So there was a big in the code. So the ground computer calculated
everything. In no metric system, they used pounds.
However, the orbiter used metric system, and there
was no conversion in between. So the speed of the orbiter was too
fast, and the orbiter passed too quickly through the atmosphere
of Mars and burned. So there was no people on board. So we can say,
okay, shit happens, it's just money. But you know what?
These NASA devices there are super expensive.
So this one costed more than 300
million of dollars. There's around 234
Americans working all their lives for these orbiters.
In case of Poland, it's even worse, you would need like 700 polish
people working all their lives to get this money. And that is
just because of the bug in the code. So how this conversation
should look like. So I wrote code for the spacecraft.
How do you know that it works? I wrote unit tests, but how do you
know that your test works?
But really. So we write code, and then
we write test to test the code. And then we write test to test the
test of the code, and write the tests of. To test the test of the
code. And it just doesn't make sense. So, if we are not
sure that our tests are good, so does testing
make any sense at all? And actually, yes, there is a method of
testing your tests without writing more tests. So imagine
we have the following code. So just the simplest possible code.
A plus b. So this could be a production code of, I don't know,
some calculator. So we write unit test. How do we check that our unit
test works? We could run test coverage. But what does
the code coverage check? So, according to common sense,
or according to the guru of the common sense, the Uncle Bob coverage
does not prove that you have tested every line. All it proves is that you
have executed every line. And that is a big difference.
If you think about it like this return
statement here, it would be perfectly covered with the test,
without any assertion. Just imagine the following test.
And we have a test like this, and we
have a 100% coverage, but we haven't tested anything. So we need
something better. We need something better than code coverage. And actually there's
something better. It was discovered by Richard Limpton
in 1971. So Richard Lipton, he asked like a
more fundamental question. He asked like,
why do we write tests? And he came to the conclusion that we
write tests to detect bugs. So think.
But it for a moment, if you are sure that your code does not have
any bugs, then there is no reason to write tests. So the
test is good when it catches bugs. So we could
reverse this strategy and introduce artificial bugs,
and check if our test detects these artificial bugs.
And that's what Richard Limton suggested in his paper.
He mentioned that if you want to know if a test suit has properly
checked some code, introduce a bug, how to do it in practice.
So let's go back to our example, simplest example possible. We have
our statement a plus b. So how can we introduce a
bug here? We could reverse the statement and change it to minus.
That is clear, a bug, because we wanted to have a sum, but we actually
have. We subtract the values. So what I did here actually I
created a mutation of the production code with an artificial
bug. And now we can check, okay, if our test suit fails on
this bug. If yes, our tests are good. If no,
our tests are useless. So you may ask, like, is it the whole idea
behind the mutation testing? And actually, yes, that's the whole
idea of the mutation testing. But before we go any further,
let's set up the terminology we
will use. So, this artificial big is called a mutation
operation. Code with the artificial bug is called a mutant.
When the test fails on a mutant, we say it killed
the mutant. However, if the test succeeds,
even if we introduce this big, even if create a mutant, we say that
the mutant survived. So, coming back to our example,
if the mutant is our mutant, a minus d
is killed, then our tests are good. If it survives,
our tests are bad. So in this case, like,
killing is good. So if all the mutants are killed,
our test suit is perfectly fine. Now, the first thing you can think about
is that, okay, but my code is much more complex than just adding
two numbers. And that is why we have a lot of different mutation operations.
And there are so many of them that we even put them into categories.
So, first thing, how you can mutate your production
code is to do some math changes. So we change some
plus to minus. We change multiply to divide
minus to plus. So we change all the math operations.
Second thing, we change the boundaries, so we had less than
a, less than b. So we changed it to less or equal than b.
We can also negate the statements, and we have some more complex
mutation operations, like remove if statements, remove method
calls, modify return statement, modify some constants,
and there are even more. So now what do we do? We have our production
code base, like the whole code base. We create a mutant,
so we use mutation operation to generate a mutant.
We actually create a lot of them because we have a lot of mutation operation.
Our code is big, so we create a lot of mutants. And now, if all
the mutants, if for every mutant, at least one tests
fails, we killed the mutant, so it's all good. However, if there
is at least one mutant that survived all our
test scenarios, it means that we didn't cover this code by
any test. So it is bad. So the next thing you can ask is like,
but do I need to change the code on my own? And luckily,
no. For Java, there is a very good library called pit.
And apart from being a good library, it has this great logo,
one of the best logo on my personal classification of logos
that goes, trust after Docker. This logo of
this bird is great. And you can use this tool, the speed mutation testing
tool. You can use it from command line, you can integrate it with maven
or gradle, or you can use it as plugin to intellij.
I actually always use it with my intellij. Just click ok, check if
my tests are good. So let's see our example again.
So we have our calculator method. We have our test,
which provides 100% coverage, but actually it tests
nothing. So how does it look like in
practice? So let's see a short demo. How to run
the speed mutation testing on this simple example. So what we're
going to do, we have just one class with the calculator.
This is exactly what we've seen on the slide. The simplest possible code. And we
have a unit test for this with our unit test provides 100%
coverage. But that's nothing. We can run the tests.
Obviously it passes. It will pass even though we had no production code.
We can even check what is the test coverage
of this. And yes, it's 100% class method
line. Everything is perfect. So now we need to improve this
process of checking our tests. So we will use the pit.
I already have the pit plugin installed into my intellij,
so I can run this edit configurations and
then add a configuration for a pit mutation testing
runner. So when I edit, I need to specify a few parameters.
I can give it a name, but doesn't matter much.
I can specify the target classes,
the source directory, the report
directory. So if you look closer, so it just specify where
should be the pit report generated. And that's basically it.
So if I click ok, I can run this mutation
testing framework and see the results in the results here,
it created two mutants automatically. And you see, they were
generated, but they were not killed, which is
obviously bad. So, coming back to the slides, so what we've
seen, this is the result of our pit mutation
testing. Obviously the mutant was generated because this
plus was changed to minus, but it survived, meaning our tests
are useless. How can we improve it? Obviously write some
better unit tests. In our case, we can change it to
the proper unit tests, like giving some ab
values given when we summit,
we should assert it to free. This looks like
a valid unit test. And this, if we run it again, it will result in
an output like this, meaning mutant was generated, but it
was killed. Perfect. Our unit tests are great. Now do I
need to read the console? So, I mean, the reading console is
not the great way of presenting the test results.
And luckily no pit provides a very nicely generated
HTML report and let's see how we can use it.
So let's continue with this demo with our corrected test. So if
we go to our intellij, we have corrected
our unit test and if we run
again our pit test runner,
we should see that, okay, mutants were killed, but apart from the
fact that it was killed at the end of the log, we can
see open report in a browser. So we can open here directly
open from intellij, the report that was generated in the browser and in the
browser we have a very nice report with the code
coverage according to mutation testing. So it's way
better code coverage than the standard code coverage. We can browse it by packages,
by classes and we can see what happened with our code.
So this is a perfectly well covered code according to mutation testing.
We can even see that two mutants were created out of this line
and all of them were killed. So this
is the output you are looking for. Okay. But the next thing, when I
first heard about mutation testing, I thought the
idea is great, I buy it. However,
it will never work for a bigger project because my project is way
bigger than just one calculator class, how it's possible that it works.
And what I tried here, what we actually tried in hazelcast was
to try it and use it on one of our plugins,
which is hazelcast Kubernetes plugin. It has around 5000
lines of code, so it's still small, but it's like reasonable
size. It has twelve classes, so not a very big code base,
but already something that is useful. So let's see in the demo how
to run the same pit framework
on the hazelcast Kubernetes plugin. If you go to the hazelcast
Kubernetes page, what we did is if
you would like to run it from the gradle or from the maven, you need
to add the dependencies to pit. So in
maven we added this with the profile.
So we have a pit test profile and it's enough to
add this part. And that's everything you need to change in your
project to actually automate it. So with this part we can see how
it works. From the command line we can clone this project. If we
clone this project, we can. Then let's code this.
And then if we open this directory
with the project we can run the command with pit test
with our profile. Pit test tests, mutation coverage this
command will generate for us the report. So actually
it takes some time for the pit to generate it. So maybe I will
not show it here,
but let's see how we did
it. So what we did with this Kubernetes plugin, we added
a GitHub action which runs this. Every time you
push to the master, this GitHub action is run.
And we run exactly the same command with the pit mutation coverage
in our GitHub action. But apart from that we also publish
this result to the GitHub pages. So we always have
in the GitHub pages our current result of the mutation test.
That is actually great. You can always go to this GitHub page and see the
results. And with this GitHub action we can see how it was run.
It's actually great because after every push to the master we
have pit results. We can have a look how it runs. So pit
coverage, it took actually 2 minutes and a half
for this GitHub action to execute all the tests,
mutation, everything. You can see that there is some mutant killed and
the results are automatically, the report is automatically
published to the GitHub pages. So we always have at the GitHub pages
current pit test coverage report. So we can open this and
see where are the things that are not
well covered with our tests. And if we look at the,
for example, this class looks like not well
covered. So we can see, okay, this line looks
not covered, right?
This looks like the whole method is actually not covered. We can see like
what? So if we change it to return now,
it's actually no test catches it. So it's really bad and
so on and so on. So we have like a code coverage,
but done way better than with the
standard code coverage tool. So I guess I already convinced you
to use it. I mean, it looks great. So the next question is like why
NASA Engineers didn't use it in 1998. So Richard
Limpton discovered this in 1971. The first like Java
mutation testing framework was called I guess Jepster,
and it was created in 2000 and the pit was
in 2012. So actually why
it took so many years from
the idea to go to something that you can use on
the code and you can use as a developer. And it happens that there
were two reasons why it was not used, widely used
for such a long time. And the first reason was the problem with equivalent
mutants. So let's look at this code. So this is a good code,
I mean it makes sense, this code, but now think
about it like if we mutate the second line,
this code is the same. Actually the semantic of changing
this mutant is the same. So we create a mutant, but no
test will kill this mutant. So we
have a false negative and
that is like problem. Actually, the pit didn't solve
this problem because this problem is not easy to solve. How to
eliminate these equivalent mutants. However, what Pete did is
that the mutation that are highly
probable to generate these equivalent mutants,
they are disabled by default. You can still enable them, but we
don't want to have false negatives
here. But there was also a second problem, why mutation testing was not widely used.
And that is because of the slow performance. Because think about it,
like from our code base, we create a lot of
mutants. I mean, we can change a lot of things, so it
results in a lot of mutants. And now we have a lot of
tests. So you can already guess the problem. So the problem
here is that we need to check every combination test and
it's super time consuming. So what pit actually did,
and it was quite smart. Before running this mutation
testing, they run the normal code coverage
and see which test covers this part.
So if we know that this test covered this code,
then we don't need to run all the combination. It's enough if
we run this test for the mutants related
to this code and it actually speeded up the
whole process very, very well. So we don't run all
tests on all mutants, we just run the test that may
kill the mutant for the given mutant.
Okay, that's cool. Actually, I guess I already convinced
you to use mutation testing. But you may ask, but what about my team?
They are lazy, they will not use it. And let me tell you a story
here. So I worked in the banking industry some time ago, and you
can imagine, like banking industry is like NASA, it's like
super important, big money. And our team was distributed
across three locations. One was in us, the other was in Krakow in
Poland, and the last one was in Hyderabad in India. And the story is that
each team was quite independent. So each team developed their own
part of the system, their own modules. And we
wanted to have 100% code coverage because we wanted to
have high quality. That was the most important, because banking industry
compliance, you need to have code well
covered. The system was constructed in a way that each module
had some main class. And the main class, I call it
main class paradox, because main class is difficult to test because you
cannot create a simple unit test for the main class. Because if
you create a unit test to run the main class, you're basically testing the whole
module, not only the main class. So there was, no matter how I
tried to tests it with unit test, I always could go to 95%
code coverage, never 100%. And then if you don't know how to do something.
If you see the problem, you see, okay, let's see how other teams do it.
And then I check the code from the team
from India. So I look at the code, I check all the code,
check all their tests. And what I found is that I
found, okay, for each test there was given when,
but there was no assertion. So they started to create
tests with no assertions because there was a requirement
for 100% code coverage and they needed to fulfill
it somehow. And I don't know, I don't know what to do with the people
who write tests without assertion. Should we laugh on
them or should we get angry? But the moral of the story is that the
test coverage threshold doesn't make any sense, because if you do it for
your project, people will try to work around it if
they need to. But you still want to have some cold coverage. So luckily there
is a better way. What you should do is put your pit framework into
your continuous integration or continuous delivery pipeline
and generate these reports of cold coverage. And once a week,
or once every other week, have a meeting and look at this report.
Not to have like a threshold, you need to have 100%
code coverage, but rather look at this HTML report,
because that is the way you
have less technical depth and improve the quality of your code.
So the last thing is like, but do people really use mutation testing?
Is it like widely used in an industry? And actually, yes, there are a lot
of companies. It is used in CERN, it was used in
the Norway voting system. And more and more companies are
introducing mutation testing to their pipelines because it
just improves the code quality. It's just better measurement than code
coverage. So test your tests because it
gives you the freedom, like you can easier refactor the test,
refactor the code. You can trust your, you know,
some shit can happen, or you can lose a lot of money,
or you can feel shame or had laughter or anger.
Thank you for listening to this presentation.