Abstract
Under-tested financial code is a very bad idea - just ask Knight Capital how they lost $460 million in less than an hour. More often, bugs expose you to a little more risk or a little less value that you expected… but what can we do differently?
We often think of manual testing as slower and less effective than automated testing, but most test suites haven’t automated that much!
Computers can execute all our pre-defined tests very quickly - and this is definitely a good thing, especially for regression tests - but the tricky parts are still done by humans. We select test cases (inputs) and check that the corresponding outputs make sense; we write functions that “arrange, act, and assert” for our tests; and we decide - or script via CI systems - which tests to execute and when.
So lets explore some next-generation tools that we could use to automate these remaining parts of a testing workflow!
Property-based testing helps you to write more powerful tests by automating selection of test cases: instead of listing input-output pairs, you describe the kind of data you want and write a test that passes for all X…. We’ll see a live demo, and learn something about the Python builtins in the process!
Code Introspection can help write tests for you. Do you need to know any more than which code to test, and what properties should hold?
Adaptive Fuzzing take CI to its logical conclusion: instead of running a fixed set of tests on each push, they sit on a server and run tests full-time… fine-tuning themselves to find bugs in your project and pulling each new commit as it lands!
By the end of this talk, you’ll know what these three kinds of tools can do - and how to get started with automating the rest of your testing tomorrow.
__
Is it really automated testing when you still have to write all the tests? What if your tools:
- wrote test code for you (‘ghostwriting’)
- chose example inputs (property-based testing)
- decided which tests to run (adaptive fuzzing)
Now that’s automated - and it really does work!
Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Zac, Zac Hatfielddodds, and I'm giving a talk called stop
writing tests for comp 42. Now, this might be provocative,
given that I do in fact want you to continue testing youll code,
but it's provocative for a reason. Before we get into it, I want
to start with an australian tradition that we call an acknowledgment of country.
And in particular, thats means that I want to acknowledge that the land I'm giving
this talk from in Canberra, Australia, was originally and still is, land of the
Nanowell people who have been living here for tens of thousands of years, working lands,
learning, and acknowledge that these land that I'm living on was
never actually seeded. It was settled and colonized.
But back to testing. I'm giving a talk about testing, and as part of
that, I should probably tell you what I mean by testing. I mean the
activity where youll run your code to see what it does.
And importantly, this excludes a number of other useful techniques to
make sure your code does the right thing, like linting or auto formatting,
like code review or getting enough sleep, or perhaps even coffee.
And specifically, the activity that is testing usually means
we choose some inputs to run our code on. We run the
code, we check that it did the right thing, and then we repeat
as needed. And in the very old days that might have been, or for
some problems. Now, we still automate repeat as needed.
So you do it all manually the first time, and then you record that in
a script, and you can run it with something like unit test or pytest.
But all of the other parts of this process are usually
totally manual. We choose inputs by hand, we decide what
to run by hand. We write assertions for
a particular input or giving a particular output.
So let's see what else we could automate. And we're going to use the example,
thanks to my friend David, not of reversing a list twice.
We're going to use the example of sorting a list. Sorting is
a classic algorithm, and you've probably all sorted things a few times yourself.
So our classic sorting tests might look something like this. We say
that if we sort a list of integers, one, two, three,
we get one, two three. So we're checking that started things stay sorted.
Or if we sort a list of floating point numbers, we get the
same sorted list, but the elements are floats. This time, because we haven't actually changed
the elements lands, we'll check that we can sort things that aren't numeric as
well. In order to avoid repeating ourselves, we might use a parameterized test.
This makes it much easier to add more examples later as they come up in
our regression testing, or if bugs are reported by customers. It's a
little uglier, but it does help scale out our test suite to
more examples. My real question though is, is this actually automated?
We've had to think of every input and every output,
and in particular we've had to come up with the outputs pretty
much by hand. We've just written down what we know the right answer should be.
But what if we don't know what the right answer should be? Well, one option
would be we can compare our sort function to a trusted
sort function. Maybe we have the one from before the
refactoring, or the single threaded version, or a very
simple bubble sort, for example, that we're confident is correct,
but is too slower to use in production. If we don't even have that,
though, all is not lost. We don't have the known good version, but we
can still check for particular errors, and this test will
raise an exception if we ever return a list which is not sorted.
The problem is that it just checks that the output
is sorted, not that it's the correct sorted list. And as an example,
I would point out that the empty list is always in order.
So if we don't want to allow the empty list as a super fast performance
optimization, we might want to check thats we have the same size of output
as we had of the input. And additionally we'll check that we have the right
elements by checking that we have the same set of elements in the output as
we did in the input. Now this isn't quite perfect.
First, it only works for lists where the arguments
are hashable. Thats is, we can put them in a set that's basically fine for
now, but it also permits an evil implementation where
if I had the list one two one, I could sort it by
replacing it with the list one two two. So I've actually changed one of the
elements, but because it was a duplicate of one element, and it's now a duplicate
of a different element, the test would still pass. To deal with that,
we could check that by the mathematical definition,
the output is a permutation of the input.
Now this is a complete test. These only problem is it's super
slow for large lists, and so our final enhancement is
to use the collections counter class. So we're not just checking that we have
the same number and the same set of elements, but that we have
the same number of each element in the output as in the
input. And so we've just invented what's called propertybased testing.
The two properties of the function that we want to test are that when you
sort a thing, the output is in order and thats the outputs
has the same elements as the input list. And so these are
the two properties of the sorting function lands if we test them. This actually
is the complete definition of sorting. If we take an input list and we return
can output with the same elements in ascending or at lets nondescending order,
then we've sorted it. I don't want to go too far though, like this
is a fine test, and it's actually pretty rare to have a complete specification
where you can list out and test every single property. And unless someone is like
deliberately trying to sneak something past youll test suite, which code review should catch,
this kind of test is going to do really well too. But in
this example we've still got kind of one last problem,
which is we still have to come up with the arguments,
the inputs to our test somehow. And that means that
however carefully we think of our inputs, we're not going
to think of anything for our tests that we didn't think of when we wrote
the code in the first place. So what we need is some way
to have the computer or a random number generator come
up with examples for us, and then we can use our existing
property based tests. And that's exactly what my library hypothesis
is for. It lets you specify what kind of inputs the test
function should have. Lands. Then youll write the test function that should pass for every
input. So using that exact same test body that we've had here, we can say
that if our argument, that is our input, is either
a list of some mix of integers and floating point numbers,
or a list of strings, we can't sort a list of mixed strings
and numbers because we can't compare those in python, but we can sort
either kind of list then run the same tests. If you
do run this though, the test will actually fail. And it will fail,
because not a number compares unequal to itself.
It's always false, no matter what the order should be.
And in fact, if you try sorting lists with lands in them, you'll discover that
things get very complicated very quickly. But for
this kind of demo, it's perfectly fine just to say, actually, we don't care
about Nan, that's just not part of the property of sorted that we're testing.
So I think propertybased testing is great, and I want you to get
started. And in order to do that, I've got a foolproof three
point plan for you. The first is just to pip install
hypothesis. It works on any supported version of Python three super
stable these. Second is to havent a skim of the documentation,
and the third is to find lots of bugs and
hopefully profit to make it easier to get started,
though, I've actually developed a tool I call the hypothesis ghostwriting,
where you can get hypothesis to write your tests for you.
Let's have a look at that. First of all, of course you can see the
help text if you ask for it. We've got various options and flags that
you can see, as well as a few suggested things, so let's start by getting
the ghostwriting to produce a sort function for us. Of course there's
no sort built in, so let's look at sort ed instead. The actual
thing you can see here that hypothesis has already noticed two arguments
that we forgot to test in our earlier demo. That is the key function,
and whether or not to sort in reverse order. But the
other thing to note is that it's just said sorted,
so it's just called the function without any assertions in the body of the test.
This is surprisingly useful, but we can do better.
Hypothesis knows about item potence. That is,
if you sort a thing a second time, it shouldn't change anything additional
to sorting the first time. And if we ask hypothesis
to test that, you can see it does indeed check that the result lands,
then the repeat result are equal. That's not the only test
we can write, though. We could check that two functions are equivalent,
and this one is actually pretty useful. If, for example,
you have a multithreaded version compared to a single threaded version before
and after refactoring, or a simple slow version
like perhaps bubble sort compared to a more complicated but faster version.
In production, the classic properties also work too,
so if you have commutative or associative properties, you can write
tests for those. I'll admit, though, these don't tend to come up as often
as what we call round trip properties, which just about everyone has.
If you save data and then load it, and you want the original data back,
you can write a test like this that asserts that if you compress the
data and these decompress it or save it and load it,
you should get exactly the same data youll started with. Back.
This one's crucial because inputs, lands, output and data persistence
tend to cross many abstraction layers, so they're surprisingly error
prone. But also they're surprisingly easy to write really
powerful tests for. So for pretty much everyone I would recommend writing
these round trip tests. Let's look at a more complicated example with
JSOn encoding. With JSON, the input is more
complicated because it's recursive, and frankly
the encoding options are also kind of scarily complicated.
Just look at how many arguments there are here.
Fortunately, we don't actually need to look at all of these, so I've
just trimmed it down and that's going to look like this. So we say,
well, given our object is recursive, so we have none,
or booleans or floats or strings, that's JSON.
Or we have lists of JSON, including the nested lists or
dictionaries of string keys to JSON values,
including maybe nested lets lands dictionaries.
But we've still preserved these other things, so we may or may not disallow
nan. We might or might not check whether we have circular objects.
We might or might not ensure that the output is ASCII instead of UTF eight,
and we might or might not sort the keys in all of our objects.
So these are nice just to let vary, because we're
claiming these should have no impact on the actual body
of the test. Let's see if Pytest agrees with us. This is
pretty simple. We have a test function, we just run it, and we've
been given two distinct failures by hypothesis.
In the first one we've discovered thats of course the floating point,
not a number value is unequal to itself. Yay for
not a number. We'll see more of it later. And as our second
distinct failure, we've discovered that if allow can is false
and we pass infinity, then encoding is actually invalid,
because this is a violation of the strict JSON spec. So I'm going
to fix that. In this case, we'll just say, well, we will always
allow non finite values just for the purpose of this test,
and we'll assume, that is, we'll tell hypothesis to reject the input
if it's not equal to itself. That's like an
extra powerful assert. And then if we run this
version, what do you think we're going to see in
this case? We see that hypothesis finds another failing
example. If you have a list containing Nan,
then it actually compares equal to itself.
This, it turns out, is thanks to a performance optimization in cpython
for list equality. It will compare itself to the other list by identity
first, which allows you to skip the chose in performance of doing deep
comparisons when you don't need to. The only problem
is that can kind of breaks the object model. So I'll
instead do the correct fix, which is to just pass allow Nan
equals false to our input generator. And so this
ensures that we'll just never generate Nan. And with allow Nan equals
just true, we'll also allow non
finite examples and this test finally passes.
All right, back to the talk. If you can't ghostwriting your test,
because for example, you already have a tests suite that you don't just want
to throw out and start over with, then of course we could migrate some
of our tests incrementally. I'm going to walk you through migrating a test
for something that looks a lot like git. And we say if
we create an empty repository lands, check out a new branch,
then that new branch is the name of our active branch.
The idea here is that I want to show you thats you can do this
for kind of business logicy things. And I'm going to say like
the details of how git works are pretty much like business logic. Rather thats
pure algorithmic stuff. But this tests kind
of also leaves me with a bunch of questions like what exactly
are valid names for branches? And does it work for non
empty repositories? So the first thing we can
do is just refactor our test a little to pass in the name of the
branch as an argument to the function. And this just
says semantically it should work for any branch name, not just for new
branch. And then we can refactor thats again to use hypothesis
and say that, well, for any branch name.
And it happens that we'll just generate new branch for now, this test should
pass. And then we could share that logic between multiple tests.
Again, so far we've made no semantic changes at
all to this test function, but the meaning is already a little
clearer to me. Given any valid branch name, this test should
pass. And now it's time to try to improve our branch name strategy.
And as it turns out, git has a pretty complicated
spec for branch names. And then the various hosting
services also put length limits on there are certain things about
printable characters. You can't start or end with a dash,
you can't contain white space, except maybe sometimes youll can.
But we're just going to say for simplicity. Actually, if your branch
name consists of ascii letters only, and it's of a reasonable length,
then the test should pass. And we'll come back and refactor that later
if we decide it's worth it. And now, looking at the body of the test,
this is a decent test, but if we want to clarify that it works for
nonempty repositories as well, we might want to end up something like
this. We say, given any valid branch name and any git repository,
so long as the branch name isn't already a branch. When we check
it out, check out that branch name and create the branch, it becomes the
active branch. So there we are. That's how I'd refactor.
You can run these, of course, in your CI suite or
locally, just as you would for unit tests. But that's not the only thing you
can do with property based testing. You can also use coverage
guided fuzzing as a way to save you from having
to decide what test to run and let the computer work out how
to search for things for a much longer time. Google has this tool called
Etheris, which is a wrapper around lib fuzzer, and it's
designed to run a single function for hours or even days.
This is super powerful. If you have C extensions, for example, it's a great way
to find memory lets or address errors or undefined behavior
using the sanitizers and hypothesis integrates with thats
really well. So you can generate really complex inputs or behavior using
hypothesis and then drive that with a fuzzer. Or if you
want to do that for an entire test suite, I have a tool called
hypofuzz that you can find@hypofuzz.com which is pure python,
so it works on any operating system, not just on Linux, and it runs
all of your tests simultaneously, trying to work out adaptively
which ones are making the fastest progress. Let's have a look at that.
Now, I started this running just before the talk,
and so you can see I've got pretty much the whole test suite
here for a tool of mine called hypothesmith for generating python source
code. And we can see these number of branches or the
coverage generated by each separate test. And you can also see that
they've been running different numbers of examples based on which ones are
fastest, and discovering new inputs or new coverage the quickest.
If we go down here, we can also see that we've actually
discovered a couple of bugs. So this one,
testing ast unpass failed just because as unpass
is a new function and it doesn't exist on this version of Python.
But if we skip that and we go to the testing the black auto formatter,
it seemed to raise an invalid input on this particular,
admittedly pretty weird thing. This is genuinely a new bug to me,
and so I'm going to have to go report that after the talk. You can
also see about how long it took in both the number of inputs lands
in the time, as well as a sort of diverse sample
of the kinds of inputs that hyperfuzz fed to youll function.
All right, so that's pretty much my talk.
I want you to stop writing tests by hand and instead use
hypothesis lands property based testing to come up with the inputs for you to
automatically explore your code, to write your test code, and ultimately
to even decide what tests to be run. These tools together can
make testing both easier and more powerful, and I hope you enjoy using them as
much as I have.