Conf42 Chaos Engineering 2021 - Online

How to avoid breaking other people's things

Video size:

Abstract

Unless you know all the assumptions the consumers of your API have made, it’s impossible to reliably avoid breaking their software. Assuming you, like me, can’t read minds, what can we do to try and keep the number of sad developers to a minimum?

Breaking changes are sad. We’ve all been there; someone else changes their API in a way you weren’t expecting, and now you have a live-ops incident you need to fix urgently to get your software working again. Of course, many of us are on the other side too: we build APIs that other people’s software relies on. There are many strategies to mitigate breaking changes: think SemVer or API versioning, but all rely on the API developer identifying which changes are breaking.

That’s the bit we’ll focus on in this talk: how do you (a) Get really good at identifying what changes might break someone’s integration (b) Help your API consumers to build integrations that are resilient to these kinds of changes © Release potentially breaking changes as safely as possible

Summary

  • Lisa Karlin Curtis is a software engineer at Gocardless. She talks about how we can help stop breaking other people's things. She explains how easy it is to accidentally break someone else's thing.
  • A breaking change is something where I, as the API developer, do a thing and someone's integration breaks. This happens because an assumption that's been made by that integrator is no longer correct. It may indeed be their fault, but it's often your problem.
  • As an integrator you want the engineers who run the services you use to be constantly improving it. But in a way you also want them to not touch it at all so that you can be sure its behavior won't change. If we want to stop breaking other people's things, we need to help our integrators stop making bad assumptions.
  • A change isn't just either breaking or not, almost anything can be breaking. Try to empathize with your integrators about what assumptions these might have made. You want to be able to scale your release approach depending on how many integrators have made the bad assumption.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, thanks so much for listening to my talk how to stop breaking other people's things. I really hope you enjoy it. I'm Lisa Karlin Curtis, born and bred in London, and I'm a software engineer at Gocardless. I currently work in our core banking team and I'm going to be talking about how we can help stop breaking other people's things. So we're going to start with a sad story. A developer notices that they have an endpoint that has a really high latency compared to what they'd expect. They find a performance issue in the code, which is essentially an exacerbated m plus one problem, and they deploy a fix. The latency on the endpoint goes down by a half, and the developer stares at the beautiful graph with the lovely cliff shape and feels good about themselves and then moves on somewhere else in the world. Another developer gets paged. Their database cpu usage has spiked and it's struggling to handle the load. They start investigating, but there's no obvious cause, no werent changes. Request volume is pretty much the same. They start scaling down their queues to relieve the pressure. And that solves the immediate issue. The database seems to have recovered. Then they notice something strange. They've suddenly started processing webhooks much more quickly than they used to. So what actually happened here? It turns out that our integrator had a webhook handler, which would receive a webhook from us and then make a request back to find the status of that resource. And this was the endpoint that we'd actually fixed earlier that day. By the way, I'm going to use the word integrations a lot. And what I mean is people who are integrating against the API that you are maintaining, so that might be inside your company or that might be like a customer that you're serving your API to. So back to the story. That Webhook Handler, it turns out, spent most of its time waiting for our response, and when it got the response from our endpoint, it would then go and update its own database. And it's worth noting here that our webhooks are often a result of batch processes, so they're really spiky. We tend to spend lots of them in a very short space of time, a couple of times a day. So as the endpoint got faster during those spikes, the webhook handler started to apply more and more load to the database to such an extent that an engineer actually got paged to resolve the service degradation. The fix here is fairly simple. Scale down the webhook handlers. So these process fewer webhooks and the database usage returns to normal, or alternatively beef up your database. But this shows us just how easy it is to accidentally break someone else's thing, even if you're trying to do right by your integrators. Lots of us deliver software in different ways. We might deliver a package like a node module or a ruby gem, which is included at build time, or potentially software, whether that's like an on prem software or a SaaS product. And then we have the Internet, which is a little bit hard to define. But what I really mean here is like a public API or HTTP endpoint that your integrators or your consumers hit in order to process your product. We're going to be mainly talking in terms of that third use case, partly because it's what I do in my day to day, but also because I think it's the most challenging of the three. As an integrator you have no control over when a change is applied, and as an API maintainer it's really natural to roll out changes to everyone at once. But many of the principles remain the same across the board, particularly when we start talking about how to understand whether something might be breaking or not. So to set the scene, here are some examples of changes that have broken code in the past. Traditional API changes. So adding a mandatory field, removing an endpoint, entirely changing your validation logic. I think we're all kind of comfortable with this, and this is quite well established in the industry about what this looks like, but then we can move on to things like introducing a rate limit. Docker did this recently, and I think they communicated really clearly, but it obviously impacted a lot of their integrators. And the same if you're going to change your rate limiting logic, changing an error string. So I was unlucky enough to discover that some software I maintained was regexing to determine what error message to display in the UX. And when the error string changed, we started displaying the wrong information to the user. And then similarly at Gocardless we actually found a but where we werent respecting the accept language header on a few of our endpoints. So we were returning english errors when there was an accept language header of Fr, for example. And so we dutifully went and fixed the bug, and then one of our integrators raised the ticket saying that we'd broken their software. And it turned out they were relying on that incorrect behavior, I. E. They were breaking on us to not translate that particular error on that particular endpoint because they knew, because they'd observed that we always replied in English. Breaking apart a database transaction. This might seem obvious in some ways when we think about our own systems, we know that internal consistency is really important, but it can get a bit more complicated when you have certain models that you expose to integrators. So the example we have for this at Gocardless is that we have resources, and then those resources have effectively a state machine. So they transition between states, and we create an event when we transition them between states. And that lets our integrators kind of get a bit more detail about what's happened and the history of a particular resource. And it's quite easy for us internally to distinguish between those two concepts, right? We do the state transition and then separately we create that event. Historically, we've always done that in a database transaction, and that's now important for us to maintain, because our integrations could be relying on the fact that if a resource is in a failed state, I can find an event that tells me why it failed. And so it's just worth thinking when you do try and maybe you're starting to split up your monolith and start distributing stuff into microservices or start throwing everything through some sort of magic cloud platform, it's very difficult to maintain those transactional guarantees. And so you just have to think quite carefully about whether that's going to be surfaced to integrators and then what you can do about it. Changing the timing of your batch processing. So we can see from our logs that certain integrators create lots of their payments just in time, I e. Just before our daily payment run. So we have a cutoff once a day. And if you don't create them in time, then we have to roll forward to the next day. So we know that if we change our timings without communicating with them, it would cause significant issues because our system would stop behaving the way that they expect. And then these last one here is reducing the latency on an API call, which is the story that we told right at the beginning. I'm going to define a breaking change as something where I, as the API developer, do a thing and someone's integration breaks. And that happens because an assumption that's been made by that integrator is no longer correct. When this happens, it's easy to criticize the engineer who made the assumption, but there are a couple of things that we should bear in mind. First of all, assumptions are inevitable. You literally cannot write any code ever without making hundreds of assumptions wherever you look. And then the second point is, it may indeed be their fault, but it's often your problem. Maybe if you're Google or you're AWS, you can get away with it. But for most companies, if your integrators are feeling pain, then you'll feel it too, particularly if it's one of your most important customers. Obviously, if it's AWS and it's slack that you've broken, you're still probably going to care. There are a few different ways that assumptions can develop, and some of them are explicit. So an integrator is asking a question, getting an answer, and then building their system based on that answer. So if we start with documentation, that's obviously your first step. When you're building against somebody else's API, you look at the API reference. It's worth noting that people often skip to the examples and don't actually read all of the text that you've slaved over. So you do need to be a little bit careful about the way that you give them that information. And then we can talk about support articles and blog posts, or some people might call them cookbooks or guides. And lots of people like these guides, particularly if they're trying to set something up quite quickly in a prototyping these, or if there's quite a lot of boilerplate, maybe this is something that you've published, or maybe it's something that a third party has published on something like medium, and then we have ad hoc communication. And what I mean by this is basically the random other backwards and back and forth that an integrator might have with somebody in your organization, whether that's a presales team, whether that's solutions engineers, or potentially it's like a support ticket that they've raised. If you get really unlucky, they might have emailed the friend they have that still works there, or indeed used to work there. But these are all different ways in which an integrator can explicitly ask a question, and then there are other assumptions that an integrator makes that are a bit more implicit. So the first thing to talk about here is industry standards. If you send me a JSON response, you're going to give me an application JSON header, and I'm going to assume that won't change, even though in your docs you never specifically tell me that that's going to be true. Another example of this is I assume that if you tell me something is secret, you will keep my secret safe. And that means that I can assume that I am the only person who has my secret. Generally speaking, this is okay, but in some cases you can find yourself in trouble, particularly if these standards change over time. So we had quite a bad incident at Gocardless where we upgraded our Ha proxy version. And it turned out the new versioning was observing a new industry standard, which proceeded to downcase all of our outgoing HTTP headers. So that meant our HTTP header response headers, rather than having capital letters at the beginning of each word, were now totally lowercase by the book. HTTP response headers should not be treated as case sensitive. But a couple of key integrators have been relying on the previous behavior and had a significant outage. And that outage was made a lot worse by the fact that their requests were being successfully processed, but they actually couldn't process our response. And then we move on to observed behavior, which is probably the most interesting of all of this slide. So as an integrator you want the engineers who run the services you use to be constantly improving it and adding new features, but in a way you also want them to not touch it at all so that you can be sure its behavior won't change. As soon as a developer sees something. Whether that's an undocumented header on an HTTP response, a batch process that happens at the same time every day, or even a particular API latency, they assume it's reliable and they build their systems accordingly. And humans pattern match really, really aggressively. We find it very easy to convince ourselves that correlation equals causation. And that means, particularly if we can come up with an explanation of why a always means b. However convoluted it might seem to somebody else, we're quick to accept and rely on it. When you start to think about it, this is quite bizarre. Given that we are all engineers and we're all employed to be making changes to our own systems, we should understand that they are constantly in flux. We also all encounter interesting edge cases every day. And we know that just because something is true 99% of the time in our system, it won't always be. And yet we assume thats everybody else's stuff is going to stay exactly the same forever. Now, as much as it would be great if our integrators didn't behave like this, I think the reality is that this is something that's shared by all of us as a community, so we kind of have to face up to it. None of this stuff is new. So a great example of this is MS DOS. So Microsoft released MSDOs and they released some documentation alongside it, which basically listed some interrupts and calls and hooks that you could use to make the operating system do interesting things. But early application developers quickly found that they weren't able to achieve everything that they wanted. And this was made worse because Microsoft would actually use those undocumented calls in their own software. So it was impossible to compete using what was only in the documentation. So like all good engineers, they started decompiling the operating system and writing lists of undocumented information. So one of the famous ones was called Ralph Brown's interrupt list, and this information was shared really widely. And so using those undocumented features became totally widespread. And it got to a point where Microsoft couldn't really changes anything about the internals of their system without breaking all of these applications that people used every day. Now obviously the value proposition of an operating system is completely tied up with all of the applications that run on it. And so they just got to this point where they couldn't really develop the operating system in the way that they wanted. We can think of that interrupt list being analogous to somebody writing a blog on medium called ten things you didn't know that someone's API could do. Some of these assumptions are also unconscious. Once something's stable for a while, we sort of just assume it will never break. We also usually make resourcing choices based on previous data, as napkin math is always a bit haphazard. So we sort of give it a bit of memory and give it a bit of cpu and start it and kind of hope that it's fine. And some examples where this can go a bit wrong. One thing that I've seen was we had a service that we used actually, so we were the integrator and they started suddenly adding lots more events to each webhook. And so our workers started trying to load all this data and basically they started getting even killed very very frequently. And we can also think about our first try here, right, where speeding up an API increased the load on the database. So we need to be careful about things where we're changing the behavior of our system, even if the behavior is maybe kind of a bit like what would be called non functional requirements, as opposed to just the exact data that we're returning or the business logic that we apply. So if we want to stop breaking other people's things, we need to help our integrators stop making bad assumptions. So when it comes to docs, we want to document edge cases. When someone raises a support ticket about an edge case, always be asking, can we get this into the docs, how do we make this discoverable? Faqs can be useful, but it's really all about discoverability. So you want to think, but both search within your doc site, but also SEO, right? We're all developers, we all Google things like 400 times a day. If your docs don't show up when they get googled, they're going to be using some third parties and they're not going to be right. And then the other important thing here is do not semver deliberately not document something. And I've heard this argument a lot of times which says oh well, we're not sure that weve going to keep it, so we're just going to not tell the integrations. If it's subject to change, you really want to call it out so there's no ambiguity, particularly if it's something that is visible to the integrator. Because if you don't do that, then you end up in this position where the integrator, all they get is the observed behavior and there's no associated documentation. And what they're going to do, as we've already discussed, is assume that it's not going to change. So if you don't document something, you can end up just as locked in as if you do. So the best is to call it out and be like very, very clearly this is going to change. Please don't rely on this. Thank you very much. When it comes to support articles and blog posts, you obviously want to keep your own religiously up to date and again searchable. Make sure that if a guide's like against a particular API version, that that's called out really clearly. And if you do have third party party blogs that are incorrect, try contacting the author or commenting with a fix needed to make it work. Or alternatively, point them at an equivalent page that you've written on your own site. If you get unlucky, that third party content can become the equivalent of Ralph Brown's interrupt list. Ad hoc communication is one of the hardest things to solve. In my experience, many b, two b software companies end up emailing random pdfs around all over these place or even creating shared slack channels. Now things might sound great, right? We're talking to our integrators, we're giving them loads of information. But if all of this stuff isn't accessible to engineers, that means that engineers have no chance of working out what assumptions might have been made as a result. So in order to combat that, you want to make sure thats everyone has access to all of that information that you send out, ideally in a really searchable format, and try and avoid having large materials that aren't centrally controlled. They don't all have to be public. But if everybody is using the same docs and singing from the same hymn cheat, that means thats your integrator's behavior will be more consistent and it will be easier to make sure that you're not going to break their stuff when it comes to industry standards. Just follow them where you can flag really loudly where you cant, or particularly where the industry hasn't yet settled. And then there's quite a lot to think about with observed behavior. So naming is really important, particularly given that developers don't read the docs and they just look at the examples. So one instance of this is we have a field on our bank accounts endpoint which is account number ending. And it turns out in Australia account numbers occasionally have letters in them. We do send it as a string, but that does still surprise people and you end up with the odd ticket asking us what's going on. We do document it as clearly as we can, but because we've called it account number ending, people do kind of reasonably assume thats it's a number. Another example is numbers that begin with zeros, so those often get truncated. So things like company registration number. If you have a number type in your database, then those zeros are going to go in once and they're never coming back. If stuff is already named badly and you can't change it, try to draw attention to it in the docs as much as possible. You can even include an example. You can include the edge case in your kind of normal example, just as a super clear flag that this looks like a number, but it isn't. So let's say you had an API that returns a company registration number. Just make sure that that starts with a zero. And that's a really easy way of signposting the slightly strange behavior. You cant to use documentation and communication to combat pattern matching. So we've already talked about making sure that you document things that might change. So if you know that you could change your batch timings, call that out in the docs like we currently run it once a day at 11:00 a.m. But this is likely to change and then expose information on your API that you might cant to change. It's a really good flag and even if under the hood it's just pointing at a constant and nothing ever happens to it. Bit means that an integrator is at least going to think about what are they going to do if that doesn't return what they expect. And then the last thing here is to restrict your own behavior and then document those restrictions. So as an example here, we were talking about the number of events in a webhook, right? That's not something that should just happen by accident, because if it does, then what that means is in the future a developer might come back and find some performance optimization, and now all of a sudden, you've got 550 times as many events per webhook. So instead, what you want to do is document the limit, even if that limit seems kind of unreasonable to you in terms of like that you'll never hit bit. And then make sure that you actually restrict the behavior in the code to match the documented limit. And any external behavior should have clearly defined limits. So that's things like payload size, but also potentially the rate at which you send requests to integrations for complex products. It's very unlikely that all your integrators will have avoided bad assumptions. So we also need to find strategies to mitigate the impact of our changes. These first thing to remember is that a change isn't just either breaking or not. If an integrator has done something strange enough, and believe me, they will, almost anything can be breaking. This binary was historically used to assign blame. If it's not breaking, then it's the integrator's fault. But as we discussed earlier, it may not be technically your fault, but it's probably still your problem if your biggest customer's integration breaks. The fact that you didn't break the official rules will be little consolation to the engineers who are up all night trying to resolve it. And the attitude just isn't productive. You can't always blame the developer at the other end, as it's not possible for them to write code without making assumptions. And lots of this stuff is really easy to get wrong. So instead of thinking about it as a yes no question, we should think about it in terms of probabilities. How likely is it that someones is relying on this behavior? Not all breaking changes are equal, right? So some changes are 100% breaking. So if you kill an endpoint, it's going to break everybody's stuff, but many are neither 0% nor 100. Try to empathize with your integrators about what assumptions these might have made, and particularly try and use people in your organization who are less familiar with the specifics than you are, to rub a duck if possible. Also, obviously try and talk to your integrations, as that will really help you empathize and understand which bits of your system they understand and which bits they perhaps don't really have the same mental model as you do. If you can find ways to dog food your APIs to find tripwires, I think a really effective tool here is to have it as part of an onboarding process for new engineers to try and integrate against your API. I think it's a really good way of both introducing your engineers to the idea of how to empathize with integrations, introducing them to your product and the core concept from an integrator point of view, and also trying to make sure that you keep your docs in line and you can ask them to raise anything that they find surprising or unusual. And then you can edit the docs accordingly. And then finally, sometimes you can even measure it. So add observability to help you look for people who are relying on some undocumented behavior. So for us, we can see this big spike in payment, create requests every day just before our payment run. So we're really confident that we're going to break loads of stuff if we change that payment runtime. And you'd be surprised that if you really think about it, there are lots of different ways in which you can kind of observe and measure that, particularly after the fact as well, right? If you're looking at the webhooks that you send out, you can monitor the error response rate and then potentially you might be able to see that something's gone wrong in someone else's integration because they're just 500 ing back to you. And that's new. You want to be able to scale your release approach depending on how many integrators have made the bad assumption. So we need to have different strategies that we can employ at different levels. If we over communicate, we get into a boy who cried wolf situation where no one reads anything that you send them and then their stuff ends up breaking anyway. And the fact that you emailed them doesn't seem to make them feel any better. We all receive loads of emails and we all ignore lots and lots of emails. So we really do have to be quite thoughtful here. So start at pullcoms, whether that's just updating your docs, hopefully you've got like a change log. And this is really useful to help integrators recover after they've found an issue, right? So they see some problem, they then turn up, they go to your API docs, they see the change log. Okay, I now understand that a b and know I need to make this change because Gocardless have done this, everybody's good and you can then upgrade to pushcoms. So whether that's like a newsletter or an email that you send to integrations, and it is really difficult to get this right because you really cant to make sure that the only people you're contacting are kind of care, because the more you tell people about things they don't care about, the less they're going to break everything you send them. So if you can try and filter your comms to only the integrators that you believe will be affected, particularly if a change only affects a particular group. And it can be really, really tempting to use that email for kind of marketing content. And I think it's really important to keep those as separate as you can, because as soon as people see marketing content, they kind of switch off and they think it's not important. And then the last one, if you're really worried, is explicitly acknowledged comms and it's unlikely you'd want to do this for all your integrators, but potentially for like a few key integrators, this can be a really useful approach before rolling out a change. Things is particularly good if you've kind of observed and measured and found a couple of integrations that you specifically think are going to have problems with things change. And then I'd also say make breaking changes. Often all of this comms is like a muscle that you need to practice, and if you don't do it for a very long time, you get scared and you forget how and you also lose the infrastructure to do it. So as an example, it's really important that you have an up to date list of emails, or at least a way of contacting your integrators. And if you don't contact them ever, then what you discover is that list very quickly gets out of date. We can also mitigate the impact of a breaking change by releasing it in different ways. Making changes incrementally is the best approach. It helps give early warning signs to your integrators. So that might be applying the change to like a percentage of requests or perhaps slowly increasing the number of events per webhook to avoid. This will help integrators, sorry, avoid performance cliffs, and it could turn a potential outage into a minor service degradation. Many integrators will have near minds alerting to help them identify problems before they cause any significant damage. Alternatively, if you've got like a test or a sandbox environment that can also be a great candidate for this stuff. Making the change there, obviously, assuming that integrators are actively using it, can act as the canary in the coal mine to help alert you of who you need to talk to before rolling it, but to production. And then the final point is about rolling back. If your biggest integrator phones you and tells you that you've broken these stuff, it's really nice to have a kill switch in your back pocket. You also want to be able to identify the point of no return of particular changes so you can really quickly and effectively communicate when you do get into those kind of high pressure incident scenarios. The only way to truly avoid breaking other people's things is to not change anything at all, and often even that is not possible. So instead we need to think about managing risk. We've talked about ways of preventing these issues by helping your integrators make good assumptions in the first place and how important it is to build and maintain a capability to communicate when you're making these kind of changes, which massively helps mitigate the impact. But you aren't a mind reader, and integrators are sometimes careless just like you. So be cautious. Assume that your integrators didn't read the docs perfectly and may have cut corners or been under pressure. They may not have the observability of their systems that you might hope or expect. So you need to find the balance between caution and product delivery that's right for your organization. For all the modern talk of move fast and break things, it's still painful when stuff breaks, and it can take a lot of time and energy to recover. Building trust with your integrations is critical to the success of a product, but so is delivering features. We may not be able to completely stop breaking other people's things, but we can definitely make it much less likely if we put the effort in. I really hope you've enjoyed the talk. Thanks so much for listening. Please find me on Twitter at Pat Prakati Underscore Eng if you'd like to. Thats about anything that we've covered today, and I hope you have a great day.
...

Lisa Karlin Curtis

Tech Lead @ GoCardless

Lisa Karlin Curtis's LinkedIn account Lisa Karlin Curtis's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)