Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, thanks so much for listening to my talk how to stop breaking other people's
things. I really hope you enjoy it. I'm Lisa Karlin Curtis,
born and bred in London, and I'm a software engineer at Gocardless.
I currently work in our core banking team and I'm going to be talking about
how we can help stop breaking other people's things. So we're
going to start with a sad story. A developer notices that
they have an endpoint that has a really high latency compared to what they'd
expect. They find a performance issue in the code, which is essentially an
exacerbated m plus one problem, and they deploy a fix. The latency
on the endpoint goes down by a half, and the developer stares at
the beautiful graph with the lovely cliff shape and feels good about themselves and
then moves on somewhere else in the world. Another developer gets
paged. Their database cpu usage has spiked and it's struggling
to handle the load. They start investigating, but there's
no obvious cause, no werent changes.
Request volume is pretty much the same. They start scaling down
their queues to relieve the pressure. And that solves the immediate issue. The database
seems to have recovered. Then they notice something strange.
They've suddenly started processing webhooks much more quickly than they
used to. So what actually happened here? It turns out
that our integrator had a webhook handler, which would receive
a webhook from us and then make a request back to find
the status of that resource. And this was the endpoint that we'd actually
fixed earlier that day. By the way, I'm going to use the word integrations
a lot. And what I mean is people who are integrating against the API
that you are maintaining, so that might be inside your company or that might
be like a customer that you're serving your API to. So back to the
story. That Webhook Handler, it turns out, spent most of
its time waiting for our response, and when it got the response
from our endpoint, it would then go and update its own database. And it's
worth noting here that our webhooks are often a result of batch processes,
so they're really spiky. We tend to spend lots of them in a very short
space of time, a couple of times a day. So as the endpoint got faster
during those spikes, the webhook handler started to apply more and more
load to the database to such an extent that an engineer actually
got paged to resolve the service degradation.
The fix here is fairly simple. Scale down the webhook handlers.
So these process fewer webhooks and the database usage returns to normal,
or alternatively beef up your database. But this shows
us just how easy it is to accidentally break someone else's thing,
even if you're trying to do right by your integrators. Lots of us deliver
software in different ways. We might deliver a package like a node module
or a ruby gem, which is included at build time, or potentially
software, whether that's like an on prem software or a SaaS product.
And then we have the Internet, which is a little bit hard to define.
But what I really mean here is like a public API or HTTP
endpoint that your integrators or your consumers hit
in order to process your product. We're going
to be mainly talking in terms of that third use case, partly because it's what
I do in my day to day, but also because I think it's the most
challenging of the three. As an integrator you have no control over when
a change is applied, and as an API maintainer it's really natural to roll
out changes to everyone at once. But many of the principles remain the same across
the board, particularly when we start talking about how to understand
whether something might be breaking or not. So to set the scene,
here are some examples of changes that have broken code in the past.
Traditional API changes. So adding a mandatory field,
removing an endpoint, entirely changing your validation logic.
I think we're all kind of comfortable with this, and this is quite well
established in the industry about what this looks like, but then we
can move on to things like introducing a rate limit. Docker did
this recently, and I think they communicated really clearly, but it obviously
impacted a lot of their integrators. And the same if you're going to change
your rate limiting logic, changing an error string. So I
was unlucky enough to discover that some software I maintained was regexing
to determine what error message to display in the UX. And when
the error string changed, we started displaying the wrong information to the
user. And then similarly at Gocardless we actually found a but where we werent
respecting the accept language header on a few of our endpoints.
So we were returning english errors when there was an accept language
header of Fr, for example. And so we dutifully went and fixed
the bug, and then one of our integrators raised the ticket saying that we'd broken
their software. And it turned out they were relying on that incorrect behavior,
I. E. They were breaking on us to not translate that particular error on
that particular endpoint because they knew, because they'd observed that
we always replied in English. Breaking apart a database transaction.
This might seem obvious in some ways when we think about our own systems,
we know that internal consistency is really important, but it can
get a bit more complicated when you have certain models that you expose to integrators.
So the example we have for this at Gocardless is that we have resources,
and then those resources have effectively a state machine. So they transition between states,
and we create an event when we transition them between states.
And that lets our integrators kind of get a bit more detail about what's
happened and the history of a particular resource. And it's quite easy for
us internally to distinguish between those two concepts,
right? We do the state transition and then separately we create that event.
Historically, we've always done that in a database transaction,
and that's now important for us to maintain,
because our integrations could be relying on the fact that if a resource
is in a failed state, I can find an event that tells
me why it failed. And so it's just worth thinking when you do try
and maybe you're starting to split up your monolith and start distributing
stuff into microservices or start throwing everything through
some sort of magic cloud platform, it's very difficult to
maintain those transactional guarantees. And so you just have to think quite
carefully about whether that's going to be surfaced to integrators and then what you can
do about it. Changing the timing of your batch processing.
So we can see from our logs that certain integrators create
lots of their payments just in time, I e. Just before our daily payment
run. So we have a cutoff once a day. And if you don't create them
in time, then we have to roll forward to the next day. So we know
that if we change our timings without communicating with them, it would cause
significant issues because our system would stop behaving the way that they expect.
And then these last one here is reducing the latency on an API call,
which is the story that we told right at the beginning. I'm going to define
a breaking change as something where I, as the API developer, do a
thing and someone's integration breaks. And that happens because an
assumption that's been made by that integrator is no longer correct.
When this happens, it's easy to criticize the engineer who made
the assumption, but there are a couple of things that we should bear in mind.
First of all, assumptions are inevitable. You literally cannot write
any code ever without making hundreds of assumptions wherever
you look. And then the second point is, it may indeed be their fault,
but it's often your problem. Maybe if you're Google or you're AWS,
you can get away with it. But for most companies, if your integrators are feeling
pain, then you'll feel it too, particularly if it's one of your most important
customers. Obviously, if it's AWS and it's slack that you've broken,
you're still probably going to care. There are a few different ways
that assumptions can develop, and some of them are explicit. So an
integrator is asking a question, getting an answer, and then building
their system based on that answer. So if we start with documentation,
that's obviously your first step. When you're building against somebody else's API,
you look at the API reference. It's worth noting that people often skip
to the examples and don't actually read all of the text that you've slaved over.
So you do need to be a little bit careful about the way that you
give them that information. And then we can talk about support articles
and blog posts, or some people might call them cookbooks or guides.
And lots of people like these guides, particularly if they're trying to set something up
quite quickly in a prototyping these, or if there's quite a lot of boilerplate,
maybe this is something that you've published, or maybe it's something that a third party
has published on something like medium, and then we have ad hoc communication.
And what I mean by this is basically the random other backwards
and back and forth that an integrator might have with somebody in your organization,
whether that's a presales team, whether that's solutions engineers,
or potentially it's like a support ticket that they've raised. If you
get really unlucky, they might have emailed the friend they have that still works there,
or indeed used to work there. But these are all different ways in which
an integrator can explicitly ask a question, and then there
are other assumptions that an integrator makes that are a bit more implicit.
So the first thing to talk about here is industry standards. If you send
me a JSON response, you're going to give me an application JSON header,
and I'm going to assume that won't change, even though in your docs you never
specifically tell me that that's going to be true. Another example
of this is I assume that if you tell me something is secret, you will
keep my secret safe. And that means that I can assume that I am the
only person who has my secret. Generally speaking, this is okay,
but in some cases you can find yourself in trouble,
particularly if these standards change over time. So we had quite
a bad incident at Gocardless where we upgraded our Ha
proxy version. And it turned out the new versioning was
observing a new industry standard, which proceeded to downcase all
of our outgoing HTTP headers. So that meant our HTTP header
response headers, rather than having capital letters at the beginning
of each word, were now totally lowercase by the book.
HTTP response headers should not be treated as case sensitive.
But a couple of key integrators have been relying on the previous behavior and
had a significant outage. And that outage was made a lot worse by the fact
that their requests were being successfully processed, but they actually
couldn't process our response. And then we move on to observed
behavior, which is probably the most interesting of all of this slide.
So as an integrator you want the engineers who run
the services you use to be constantly improving it and adding new
features, but in a way you also want them to not touch it at
all so that you can be sure its behavior won't change. As soon as a
developer sees something. Whether that's an undocumented header on an HTTP
response, a batch process that happens at the same time every day,
or even a particular API latency, they assume it's reliable
and they build their systems accordingly. And humans pattern match
really, really aggressively. We find it very easy to convince ourselves that correlation
equals causation. And that means, particularly if we can come up with
an explanation of why a always means b.
However convoluted it might seem to somebody else, we're quick to
accept and rely on it. When you start to think about it, this is quite
bizarre. Given that we are all engineers and we're all
employed to be making changes to our own systems, we should
understand that they are constantly in flux. We also all encounter
interesting edge cases every day. And we know that just because something is true
99% of the time in our system, it won't always be.
And yet we assume thats everybody else's stuff is going to stay exactly the same
forever. Now, as much as it would be great if our integrators didn't behave like
this, I think the reality is that this is something that's shared by all of
us as a community, so we kind of have to face up to it.
None of this stuff is new. So a great example of this is MS
DOS. So Microsoft released MSDOs and they
released some documentation alongside it, which basically listed
some interrupts and calls and hooks that you could use to make the operating system
do interesting things. But early application developers quickly found
that they weren't able to achieve everything that they wanted. And this was made worse
because Microsoft would actually use those undocumented calls
in their own software. So it was impossible to compete using what was
only in the documentation. So like all good engineers, they started
decompiling the operating system and writing lists of undocumented information.
So one of the famous ones was called Ralph Brown's interrupt list, and this
information was shared really widely. And so using those
undocumented features became totally widespread. And it got to
a point where Microsoft couldn't really changes anything about the internals of
their system without breaking all of these applications that people used
every day. Now obviously the value proposition of an operating system is
completely tied up with all of the applications that run on it. And so they
just got to this point where they couldn't really develop the operating system in the
way that they wanted. We can think of that interrupt list being analogous to somebody
writing a blog on medium called ten things you didn't know that someone's
API could do. Some of these assumptions are also unconscious.
Once something's stable for a while, we sort of just assume it will
never break. We also usually make resourcing choices
based on previous data, as napkin math is always a bit haphazard.
So we sort of give it a bit of memory and give it a bit
of cpu and start it and kind of hope
that it's fine. And some examples where this can go a bit wrong.
One thing that I've seen was we had a service that we used actually,
so we were the integrator and they started suddenly adding lots
more events to each webhook. And so our workers
started trying to load all this data and basically they started getting even killed
very very frequently. And we can also think about our first try
here, right, where speeding up an API increased the load on the database.
So we need to be careful about things where we're changing the behavior of our
system, even if the behavior is maybe kind of a bit
like what would be called non functional requirements, as opposed to just the
exact data that we're returning or the business logic that we apply.
So if we want to stop breaking other people's things, we need to help our
integrators stop making bad assumptions. So when it comes to docs,
we want to document edge cases. When someone raises a support ticket
about an edge case, always be asking, can we get this into the docs,
how do we make this discoverable? Faqs can be useful,
but it's really all about discoverability. So you want to think,
but both search within your doc site, but also SEO,
right? We're all developers, we all Google things like 400 times a day.
If your docs don't show up when they get googled, they're going to be
using some third parties and they're not going to be right. And then the other
important thing here is do not semver deliberately not
document something. And I've heard this argument a lot of times which
says oh well, we're not sure that weve going to keep it,
so we're just going to not tell the integrations. If it's subject to
change, you really want to call it out so there's no ambiguity,
particularly if it's something that is visible to the integrator.
Because if you don't do that, then you end up in this position where the
integrator, all they get is the observed behavior and there's no associated documentation.
And what they're going to do, as we've already discussed, is assume that
it's not going to change. So if you don't document something, you can end
up just as locked in as if you do. So the best is
to call it out and be like very, very clearly this is going to change.
Please don't rely on this. Thank you very much. When it comes to
support articles and blog posts, you obviously want to keep your own
religiously up to date and again searchable.
Make sure that if a guide's like against a particular API version,
that that's called out really clearly. And if you do have third party party
blogs that are incorrect, try contacting the author or commenting with
a fix needed to make it work. Or alternatively, point them at an equivalent
page that you've written on your own site. If you get unlucky, that third party
content can become the equivalent of Ralph Brown's interrupt list.
Ad hoc communication is one of the hardest things to solve. In my
experience, many b, two b software companies end up emailing random pdfs
around all over these place or even creating shared slack channels.
Now things might sound great, right? We're talking to our integrators,
we're giving them loads of information. But if all of this stuff isn't
accessible to engineers, that means that engineers have no chance
of working out what assumptions might have been made as a result.
So in order to combat that, you want to make sure thats everyone has access
to all of that information that you send out, ideally in a really searchable
format, and try and avoid having large materials that aren't centrally
controlled. They don't all have to be public. But if everybody
is using the same docs and singing from the same hymn cheat, that means thats
your integrator's behavior will be more consistent and it will be easier
to make sure that you're not going to break their stuff when it comes to
industry standards. Just follow them where you can flag really
loudly where you cant, or particularly where the industry hasn't yet settled.
And then there's quite a lot to think about with observed behavior.
So naming is really important, particularly given that developers don't
read the docs and they just look at the examples. So one
instance of this is we have a field on our bank accounts endpoint
which is account number ending. And it turns out in Australia account
numbers occasionally have letters in them. We do send it as a
string, but that does still surprise people and you end up with the odd
ticket asking us what's going on. We do document it as clearly as we can,
but because we've called it account number ending, people do kind
of reasonably assume thats it's a number. Another example is numbers
that begin with zeros, so those often get truncated.
So things like company registration number. If you have a
number type in your database, then those zeros are going to go in once and
they're never coming back. If stuff is already named badly and you can't
change it, try to draw attention to it in the docs as much as possible.
You can even include an example. You can include the edge case in
your kind of normal example, just as a super clear flag that
this looks like a number, but it isn't. So let's say you had an API
that returns a company registration number. Just make sure that that starts
with a zero. And that's a really easy way of signposting the slightly strange behavior.
You cant to use documentation and communication to combat pattern matching.
So we've already talked about making sure that you document things that might change.
So if you know that you could change your batch timings, call that out
in the docs like we currently run it once a day at 11:00 a.m.
But this is likely to change and then expose information on your API
that you might cant to change. It's a really good flag and even if under
the hood it's just pointing at a constant and nothing ever happens to
it. Bit means that an integrator is at least going to think about what are
they going to do if that doesn't return what they expect. And then the last
thing here is to restrict your own behavior and then document those
restrictions. So as an example here, we were
talking about the number of events in a webhook, right? That's not something that
should just happen by accident, because if it does, then what that means is
in the future a developer might come back and find
some performance optimization, and now all of a sudden, you've got 550
times as many events per webhook. So instead, what you want
to do is document the limit, even if that limit seems kind of
unreasonable to you in terms of like that you'll never hit bit. And then
make sure that you actually restrict the behavior in the code to match the documented
limit. And any external behavior should have clearly defined limits.
So that's things like payload size, but also potentially the
rate at which you send requests to integrations for complex products.
It's very unlikely that all your integrators will have avoided bad assumptions.
So we also need to find strategies to mitigate the impact of our changes.
These first thing to remember is that a change isn't just either breaking
or not. If an integrator has done something strange enough, and believe me,
they will, almost anything can be breaking. This binary
was historically used to assign blame. If it's not breaking,
then it's the integrator's fault. But as we discussed earlier, it may not be technically
your fault, but it's probably still your problem if your
biggest customer's integration breaks. The fact that you didn't break the official rules
will be little consolation to the engineers who are up all night trying to resolve
it. And the attitude just isn't productive. You can't always blame the developer
at the other end, as it's not possible for them to write code without making
assumptions. And lots of this stuff is really easy to get wrong.
So instead of thinking about it as a yes no question, we should think about
it in terms of probabilities. How likely is it that someones is relying
on this behavior? Not all breaking changes are equal, right? So some
changes are 100% breaking. So if you kill an endpoint, it's going
to break everybody's stuff, but many are neither 0%
nor 100. Try to empathize with your integrators about what
assumptions these might have made, and particularly try and use people in your organization
who are less familiar with the specifics than you are, to rub a duck if
possible. Also, obviously try and talk to your integrations, as that will really
help you empathize and understand which bits of your system they understand and
which bits they perhaps don't really have the same mental model as you do.
If you can find ways to dog food your APIs to find tripwires,
I think a really effective tool here is to have it as part of an
onboarding process for new engineers to try and integrate against your API.
I think it's a really good way of both introducing your engineers to the
idea of how to empathize with integrations, introducing them to your product and the
core concept from an integrator point of view, and also
trying to make sure that you keep your docs in line and you can ask
them to raise anything that they find surprising or unusual.
And then you can edit the docs accordingly. And then finally,
sometimes you can even measure it. So add observability to
help you look for people who are relying on some undocumented behavior.
So for us, we can see this big spike in payment, create requests
every day just before our payment run. So we're really confident that
we're going to break loads of stuff if we change that payment runtime. And you'd
be surprised that if you really think about it, there are lots of different ways
in which you can kind of observe and measure that, particularly after
the fact as well, right? If you're
looking at the webhooks that you send out, you can monitor the error response rate
and then potentially you might be able to see that something's gone wrong in someone
else's integration because they're just 500 ing back to you. And that's new.
You want to be able to scale your release approach depending on how many integrators
have made the bad assumption. So we need to have different strategies that we
can employ at different levels. If we over communicate, we get into a
boy who cried wolf situation where no one reads anything that you send
them and then their stuff ends up breaking anyway. And the fact that you emailed
them doesn't seem to make them feel any better. We all receive loads of
emails and we all ignore lots and lots of emails. So we
really do have to be quite thoughtful here. So start at pullcoms,
whether that's just updating your docs, hopefully you've got like a change log.
And this is really useful to help integrators recover after they've found an
issue, right? So they see some problem, they then turn
up, they go to your API docs, they see the change log. Okay, I now
understand that a b and know I need to make this change
because Gocardless have done this, everybody's good and you can then upgrade to
pushcoms. So whether that's like a newsletter or an email that you send to integrations,
and it is really difficult to get this right because you really cant to make
sure that the only people you're contacting are kind of care,
because the more you tell people about things they don't care about, the less they're
going to break everything you send them. So if you can try and filter your
comms to only the integrators that you believe will be affected, particularly if
a change only affects a particular group. And it can
be really, really tempting to use that email for kind of marketing content.
And I think it's really important to keep those as separate as you can,
because as soon as people see marketing content, they kind of switch off and they
think it's not important. And then the last one, if you're really worried, is explicitly
acknowledged comms and it's unlikely you'd want to do this for all your integrators,
but potentially for like a few key integrators, this can be
a really useful approach before rolling out a change. Things is particularly
good if you've kind of observed and measured and found a couple of integrations
that you specifically think are going to have problems with things change. And then I'd
also say make breaking changes. Often all of this comms is like a muscle
that you need to practice, and if you don't do it for a very long
time, you get scared and you forget how and you also lose the infrastructure
to do it. So as an example, it's really important that you have
an up to date list of emails, or at least a way of
contacting your integrators. And if you don't contact them ever, then what you discover is
that list very quickly gets out of date. We can also mitigate the impact of
a breaking change by releasing it in different ways. Making changes incrementally
is the best approach. It helps give early warning signs to your integrators.
So that might be applying the change to like a percentage of requests or
perhaps slowly increasing the number of events per webhook to
avoid. This will help integrators, sorry,
avoid performance cliffs, and it could turn a potential outage
into a minor service degradation. Many integrators will have near minds
alerting to help them identify problems before they cause any significant
damage. Alternatively, if you've got like a test or
a sandbox environment that can also be a great candidate for this stuff.
Making the change there, obviously, assuming that integrators are actively
using it, can act as the canary in the coal mine to help alert you
of who you need to talk to before rolling it, but to production.
And then the final point is about rolling back. If your biggest integrator
phones you and tells you that you've broken these stuff, it's really nice to have
a kill switch in your back pocket. You also want to be able to identify
the point of no return of particular changes so you can really quickly and
effectively communicate when you do get into those kind of high pressure incident scenarios.
The only way to truly avoid breaking other people's things is to
not change anything at all, and often even that is not possible.
So instead we need to think about managing risk.
We've talked about ways of preventing these issues by helping your integrators
make good assumptions in the first place and how important it is to build and
maintain a capability to communicate when you're making these kind of changes,
which massively helps mitigate the impact. But you aren't a
mind reader, and integrators are sometimes careless just like you.
So be cautious. Assume that your integrators didn't read the
docs perfectly and may have cut corners or been under pressure. They may
not have the observability of their systems that you might hope or expect. So you
need to find the balance between caution and product delivery that's right for your organization.
For all the modern talk of move fast and break things, it's still
painful when stuff breaks, and it can take a lot of time and energy to
recover. Building trust with your integrations is critical to the success of
a product, but so is delivering features. We may not be able to completely
stop breaking other people's things, but we can definitely make it much less
likely if we put the effort in. I really hope you've enjoyed the talk.
Thanks so much for listening. Please find me on Twitter at Pat Prakati
Underscore Eng if you'd like to. Thats about anything that we've covered today,
and I hope you have a great day.