Transcript
This transcript was autogenerated. To make changes, submit a PR.
Am
am you?
Hi, I'm Z ISO and today I'm going to talk about
static analysis and how it helps you engineer more reliable systems.
This will help you make it harder for incorrect code to blow up
production at 03:00 a.m. There are a lot of tools out
there that can do this for a variety of languages. However,
I'm going to focus on go because that's what I am in expert in in
this talk, I'll cover the problem space, some solutions you
can apply today, and how you can work with people to engineer more reliable
systems. As I said, I'm Zee.
I'm the archmage of infrastructure at tailscale. I've been an
SRE for several years and I'm moving over into developer
relations. As a disclaimer, this talk may
contain opinions. None of these opinions are my employers.
I'll have a recording of this talk, slides, speaker notes,
and a transcript of it up in a day or two after the conference.
The QR code in the corner of the screen will take you to my blog
when starting to think about a problem, I find it helps to start thinking about
the problem space. This usually means thinking about the problem
at an incredibly high level and all of its related parts.
So let's think about the problem space of compilers at
the highest possible level, a compiler can take literally anything
as input and maybe produce an output.
A compiler's job is to take this anything,
see if it matches a set of rules, and then produce an output
of some kind. In this case,
with the go compiler, this means that the input needs to
match the rules that the Go language has defined in its specification.
This human readable specification outlines
core rules of the Go language. These include
things like every Go file needs to be in a package,
the need to declare variables before using them, what core types
are in the language, how to deal with slices, and more.
However, the specification doesn't define what correct go
code is, it only defines what valid
go code is. This is normal for specifications
of this kind. Ensuring correctness is an active field of research
in computer science that small, scrappy startups like
Google, Microsoft, and Apple struggle with.
As a result of this, though, you can't rely on the compiler
from stopping all incorrect code from being deployed into production.
There's a wide range of errors that will be stopped in the process,
but there are more subtle errors that can squeak by.
This is an example of the kind of error that the go compiler can
catch by itself. If you declare a value as
a string, you can't go put an integer in it.
They are different types, and the compiler will reject it.
I know one of you is out there probably thinking something like what
about rust? What about Hassell? Don't those compilers
have a reputation for making very correct code?
And you know what? That's a good point. There's other languages
that have more strict rules like linear types or explicitly
marking when you poke the outside world. However, the kinds
of errors that are brought up in this talk can still happen in those languages,
even if it's more difficult to do it by accident.
Static analysis on top of your existing compiler lets
you step closer to correctness without going the
maximalist route, like when you port everything to rust.
It's a balance between pragmatism and correctness.
The pragmatic solution and the correct solution are always in conflict,
so you need to find a compromise down the middle.
This is because in general, proving everything is correct with
static analysis is literally impossible.
It takes a theoretically infinite amount of time to tell if
absolutely every facet of the code is correct in every single
way. But we don't
have to be perfect, we have to be good.
And perfect is the enemy of the good. And static analysis
is more moving you towards perfect while being good.
So here are some patterns for things that can be solved with
static analysis in go. They are not releasing
resources that you acquire, making typos
that the compiler can't prove at compile time.
Usually this happens with struct tags,
invalid constants such as time format, strings,
URLs, and regular expressions,
and a wide range of predictable crashes or
very unintended behavior.
These kinds of things are easy to prove and are enabled
by default in govet and static check.
Also, for the record, incorrect code won't explode
instantly upon it being run. The devil is in the
details of how it is incorrect and how those things can pile up to create
issues downstream. Incorrect code can also
confuse you while trying to debug it, which can make you waste
time you could spend doing anything else.
This is an example of Go code that will compile.
It'll likely do what you want, but bit is incorrect.
It is incorrect because the HTTP response
body is read from, but it's never closed
in Go. When you don't close the response body, you will leak
the resources associated with that HTTP connection.
When you close the response body, it will release the connection so that you
can use it for other HTTP actions. If you
don't do this, you can easily run into a state where your server
application will run out of available sockets at 03:00 a.m. And case
you may be tempted to fix it like this.
However, this is incorrect too. Look at
where the defer is called. Let's think about how
the program flow will work. I'm going to translate this into a
diagram of how the computer is going to execute this code.
This flowchart is another way to think about how this program is being executed.
It starts with the HTTP get call on the left side
and flows to either crashing or the
code finishing on the right.
In this case, we start with the HTTP getcall and then
defer, closing the response body to the end of the function.
Then we check to see if there was an error or not. If there
was no error, we can use the response and do something useful,
and then the response body comes automatically due
to the deferred close. Everything works like you'd expect.
However, if there was an error, something different happens.
The error is returned and then the scheduled close call runs.
The close call assumes that the response is valid bit it's not.
This results in the program panicking, which can be a crash at 03:00
a.m. This is the kind of place where static analysis comes
in to save you. Let's take a look at what Go vet
says about this code.
HTTP Get Go line 16 using response
before checking for errors it caught the error.
To fix this, we need to move the defer call to after the error check
like this.
This way the response body is closed after we know that it's usable.
This will work as we expect in production. This is an example
of how trivial errors can be fixed with a little extra tooling without
having to rewrite everything in rust.
If you use Go test, then a large amount of these go vet
checks are run by default. This covers a wide
variety of common issues that have trivial fixes that
help move your code towards the corresponding Go idioms. It's limited
to the subset of checks that aren't known to have false positives.
So if you want more assurance, you will need to run govet or other tools
in your continuous integration step.
And some of you might be thinking, well, if these are so
easy to detect, why doesn't gobuild do this?
This is a good question. I'm personally on the side of the
compiler should aggressively project code as much as possible,
but the reason why this isn't done in Go is because
it's a matter of philosophy. Go is
not a language that wants to make it impossible to write buggy code.
Go just wants to give you tools to make your life easier.
In the Go team's view, they would rather you be able
to compile buggy code than have the compiler reject your code on
accident. This is a result of a
philosophy of trusting that there are gaps between the programmer and production.
During those gaps, there are testing. There's tools like static check
and govet, but most importantly, there's also human review
to catch other trivial errors.
In addition to using govet checks, you can also use
static check with this GitHub action. This will automatically
download, install, and run static check on your code.
Static check catches a wide variety of errors that govet
considers out of scope. Here's can example of a
more complicated problem that static check can catch but govet can't.
The reason why there's a problem here is that go lets you make variables
that are scoped to if statements. This lets you write
code like this.
This is shorthand for writing out something like this.
This does the same thing, but it looks a bit more ugly.
Either way, the error value isn't in scope at
the end of it, so it'll be dropped by the garbage collector.
However, let's also consider the other important part of this snippet
variable shadowing. We have two different
variables named x, and they are different types declared at different places.
To help you tell them apart, I've colored the inner one yellow and
the outer one red. In a type assertion like this,
the red variable is not an int, but the yellow variable
is an int that might have failed to assert down.
If it fails to assert down, then the yellow x variable
will always be can int that will have the value zero.
This is probably not what you want, given that the log call with
the percent sign t format specifier would let you know
what type the red x variable was,
and as a result, when you run this code
you will get an error message that looks like this unexpected
type int. This will confuse the living hell out of you.
The correct fix here is to rename the int version of
x. You could do this in a few ways, but here's a
valid approach. Change the name of
the yellow x to xint.
This will get you the correct result. You would also need to
change the okay branch of the if statement to
use xint instead of x, but this is a fairly easy
thing to fix. There are a bunch of other checks that static
check runs by default. I could easily talk about them for a few hours,
but I'm being to focus on one of the more interestingly subtle checks in
go. Sometimes you need to write your own error types with
go interfaces and their duck typing. Anything that
matches the definition of the error interface is able to be used as
an error value.
I put the definition of the error interface type over to the
side and gave you a link to the go documentation for
it. In this case, our type
failure has an error method, which means that the go compiler can
treat it as an error. Given that the error function
returns a string, that means that our failure type is an
error.
However, something else to keep in mind is that the receiver the
function is a pointer value. Normally this means a few things,
but in this case it means that the receiver may be nil and as a
result the reason may not exist.
Because of this, we can return a nil value of failure
and then when you try to use it from go, it will explode at
runtime panic runtime error
invalid memory address or nil pointer dereference boom,
it crashed. Seg fault this
happens because under the hood each interface value is a box.
The box contains the type of the value in the box and a pointer
to the actual value itself. But this box
will always exist even if the underlying value is nil.
This means that the if error not equals nil check will always return
true. So you will always try to read from the value,
which will always explode because the underlying value is nil.
This is always frustrating when you run into it, but let's see
what static check says when we run it against this code.
Errorbomb go line eleven do work never
returns a nil interface value.
Haza static check rejects it. If this
code was checked into source control and static check was a run in CI,
tests would fail and this would never be allowed to be deployed to
production. The correct version of do work
should look something like this. Note how I changed the failure
case to use an untyped nil. This prevents
the nil value from being boxed into an interface.
This will do the right thing.
This will help you ensure that this kind of code never enters production so
it cannot fail at untold hours of the night while you are sleeping.
As sres, we tend to sleep very little, as is statistically.
We have higher rates of burnout, mind fog, fatigue,
and likelihood of turning into angry, sad people as we do this
job longer and longer, especially if the culture of a company
is broken enough that you end up being on call during sleeping hours.
This is not healthy. It is not sustainable for
us to be woken up at obscene hours of the night because of trivial and
prevents errors. If we get woken up in the middle of
the night, it should be things that are measurably novel and not caused
by errors that should have never been allowed to be deployed in the first place.
I don't think I've heard my pager sound in years by this point.
Bit. The last time I heard it, I almost had a full blown panic attack.
I have been in the kind of place where burnout from my pager
severely affected my health. I'm still
recovering from the after effects of that tour of SRE duty
and this has resulted in me making permanent career changes that I am never
put in that kind of position again. I don't
wish the hell that I've experienced on anyone.
Normally when you're in SRE put into the line of pager fire,
it kind of feels like both options suck.
Fixing production seems like it'll be impossible. Being able
to get more sleep during on call hours seems impossible because
things aren't getting fixed, and with an
SLA for responding to the pager within half an
hour,
it just feels impossible.
Adding static analysis to your continuous integration step
can allow you to walk down a new middle path between these two extremes.
It is not going to be perfect,
however, gradually things will get better.
Trivial errors will be blocked from going into production and
you will be able to sleep easier.
The benefits of being able to rest easier like this are numerous and
difficult to summarize. It could save your relationship with your loved
ones. It could prevent people near you from resenting you.
It could be the difference between a long and happy career, or having to
drop out of tech at 25, burnt out to a crisp and unable to
do much of anything. It could
be the difference between life and an early, uniquely death
from a preventable heart attack.
In talks like these, it's easy to ignore the fact that the people that
are responsible for making sure services are reliable are that
human company culture may
get in the way and there may be a lack of people that are
willing or able to take the pager rotation.
However, when the machines come to take our jobs,
I hope that this is one of the first that they take.
In the meantime, all we can do is get towards a
more sustainable future. The best thing we can do is to
make sure people sleep well without having to worry about being woken up
because of preventable errors that tools like static check can block
from getting into production. If you
use go in production, I highly suggest using static check
if you find it useful. Sponsor Dominic on GitHub
software like this is complicated to develop and the best way to
ensure Dominic can keep developing it is to pay him for his efforts
the better he sleeps, the better you sleep as an SRE.
As for other languages, I'm going to be totally honest.
I don't know what the best practices are. You will have to do research
on this. You may have to work together with other coworkers to
find out what would be the best option for your team.
I will say though, bit is worth the effort.
This helps you make a better product for everyone and it's worth
the teething pains at first. You can do it.
I'm almost at the end, but I wanted to give a special shout out to
all these people who helped make this talk a reality. I also
want to give a special shout out to my coworkers at Tailscale that let me
load shed super hard so that I could focus on making this talk shine.
Thanks for watching. I'll stick around in the chat for
questions, but if I miss your question and you really want an
answer to it, please email it to code 42 SRE
2022 at Zserve us.
I'm happy to answer questions and I enjoy writing up responses.
Have a good rest of the conference. Everyone be well.