What is observability-driven development?

Video size:

Abstract

Testing large systems with multiple microservices is hard. You need to understand the whole system, all connections, and how all services interconnect, making it hard to pinpoint issues. With distributed tracing, you get a map of everything which makes your whole system easier to understand.

Summary

Adnan: Today I'll be talking about observabilitydriven, more specifically observability driven development with open telemetry. The four main things you need to remember throughout this talk will be pain of testing microservices and integration testing. And then finally we will go into an in practice session.
Distributed tracing refers to methods of observing requests as they propagate through distributed systems. We use this tracing instrumentation in your back end code as assertions for tests. It's really really cool because it enforces not just quality in your traces, but also an easier way to run integration tests.
integration tests require access to your services and your infrastructure. The problem is all of the piping you have to write to actually get to the assertions. Just to start integration testing, it's 90% writing the code, 10% is actually the testing itself.
This is when you're writing code and observability in parallel. It works with any existing open telemetry based distributed tracing. You're testing data from traces in real environments. If you have the Otel SDK installed, this will just work.
Trace test is an open source tool. It uses open telemetry trace bands as assertions. Unlike traditional API tools, trace based testing asserts against the system response and the trace result. It works with all of the tools that you're already using.
The most powerful thing that trace based testing allows you to do is asserting on every part of an HTTP transaction. It works with any distributed system as long as you have open telemetry instrumentation in your services. To install it you can use a CLI and then from there you install the server.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Everyone, my my name is Adnan and I'm from trace test. Today I'll be talking about observabilitydriven, more specifically observability driven development with open telemetry. Super happy to be here at Conf 42. Observability and yeah, let's just jump right in. We can start with the slide deck right away. And as you can see, the title pretty obvious. Observability driven development with open telemetry. Yeah, let's jump right in. My name is Adnan. As I already said, I'm currently doing all things Devrel at Trace test, which is a project, an open source product coming out of the Cube shop accelerator. And to tell you a bit about myself so that you think that I know what I'm talking about and to stay around. First and foremost, I am a failed startup founder and an X free codecamp leader. And that was basically how I transitioned into being a developer relations engineer from being a software engineer. So in the last five or so years I've been building open source dev tools and I absolutely love it. So it's a natural transition for me to go into education. Yeah, super exciting. Let me just give you a quick rundown of the agenda for today. For the next 20 minutes or so I will talk about four main things. So the four main things you need to remember throughout this talk will be, first, the pain of testing microservices. It's absolutely horrible. I want to show you a way of doing it in a much, much simpler way. And also number two will be integration testing. And TDD is hard. We all know that integration testing is hard. There's a lot of mocking, a lot of setup, and I want to show you a solution to that. Then three will be how observability driven development can help, how implementing odD can help your TDD process. That's a very important thing that I want to explain as well. And then finally we will go into an in practice session. So I want to show you hands on how observability driven development works in practice. Now let's jump in right away into the pain of testing microservices. Now here's a problem that you keep facing, at least for me. It's a big problem where I don't have a way of knowing precisely at which point of my complex network of microservice to microservices connections and communications transactions, where an HTTP transaction goes wrong. So I don't know where a transaction fails. I can't track these communications in between the microservices and one more thing that's horrible is that it's really hard to mock different microservices when they're communicating. The only real way I can do that is with tracing. It's because I can store a tons of trace data and I'm actually getting value from that and I can see what's happening. But how do we use that? How do we solve that problem with tracing? Well, we use something called observability driven development and it's often called odd for short. And it emphasizes using this tracing instrumentation in your back end code as assertions for tests. We now have this culture of trace based testing as well, where we can use these distributed traces as the assertions themselves. And it's really really cool because it enforces not just quality in your traces, but also an easier way to run integration tests. You get much more velocity for your dev teams and it's much safer for your platform teams because you know exactly what your production system is doing when you're running tests. So to get a quick intro into distributed tracing, first and foremost, distributed tracing refers to methods of observing requests as they propagate through distributed systems. So this is a really nice definition by lightstep, and I agree with this 100%. Now a visual representation of this, I think this is a better way of explaining it, is that a distributed trace, it records the paths an HTTP request takes as it goes through your system, so as it propagates through APIs, microservices, et cetera, et cetera. And each step of this transaction is actually called a span. So each step of a distributed trace is a span and every span contains information about the executed operation. So if it's an HTTP request, the codes, the timestamps, different timings, database statements as well. So all of these things are contained within the spans of the distributed trace. So here's a system that I'm going to be showing throughout this talk. It's a very simple system. It has one database, it has one service for fetching books, and it has a service where you see the availability of said books. Super, super simple. It's just a simulation of what you would see in production. Now what's happening here is that this is a code example of how the availability API fetches if a book is available. And as you can see here, I've added in these spans, basically showing you what a distributed trace span would look like. I'm initializing the span, I'm setting an attribute which is a real value of the book id, and I'm making sure to check if that book is available, and then I can use this data further down and validate against it within this distributed trace in the UI, it would look something like this, where I can actually see, yes, this attribute that is available I can validate if that's true or not. And this is where we move into whether integration testing and TDD actually need help. And I'm 100% sure that they do, because when you look at the TDD green red feedback loop, you create a test case before writing any code. You run a test, you see that they fail, you write code to pass a test, and then you run the test again and see that it's passing. So it's a very, very nice feedback loop. It's a nice process that we're all used to, that we like, but here are the pain points that we have to work on. First and foremost, integration tests require access to your services and your infrastructure. So running back end integration tests, it requires you to have an insight into your entire infrastructure. Unlike front end tests where you're only operating within the browser, when running an integration test on the back end, you need to design the trigger, you need to figure out how to access the database, you need to have authentication, you need to write that in as well. If you have a message bus, how do you test a message bus? It's very complicated, it's very hard to mock, and then you also need to configure the monitoring as well. So how to gather the logs from these services, and it's just a headache. So this is also a problem because you can't really track which part of a microservice chain failed. So let's say you have serverless functions or API gateways, or even different types of ephemeral infrastructure. How do you test that? It's a headache. So I like saying that just to start integration testing, it's 90%. Writing the code to actually makes the test work, and then 10% is actually the testing itself. So writing the assertions and all that, that's the simple part. The problem is all of the piping you have to write to actually get to the assertions. So here, let me show you what I mean. A traditional integration test, if you look at it, you have a ton of different modules and imports you have to add into your code, and then from there you can first need to write the mock, and then you have to figure out what you need to mock, and then you need to figure out the structure of it. And if the structure changes, you have to rewrite that. And then you have to figure out how to trigger a request and then whether that request needs authentication, et cetera, et cetera. And then you have this tiny little speck of code where you say the result, actually the response should have a status of 200. And then again you're expecting the body to be equal to what you're mocking. So you have two lines of code and you have a ton of piping you need to figure out. You're also tied down to the programming language. You have to basically know the programming language in and out, and you need to know how to write the tests themselves. So it's a lot of complicated things happening at once versus when you're running a trace based test, you're basically only pointing to the API you want to actually the URL you want to trigger the test against, and then you select the assertions based on the trace spans. So here we can see, yes, I'm in my distributed trace. I want to hit the books API. I want to make sure that the status code is 200 and I want to make sure that the list of books is three. And I'm done. There's no mocking. This is actually what's happening in my system. And it's really beautiful how simplistic it is. Also language agnostic. So I don't need to know what the programming language of the actual microservice I'm testing is. I don't need to do any specifics. I don't need to learn the modules like Chi or whatever that I'm using for my node JS tests. I'm running this very totally agnostic from the programming language itself. And from here I like to transition into how observability driven development can help in this process. And first I need to define what it is so you understand exactly what it is. First and foremost. This is when you're writing code and observability in parallel. So you're instrumenting your code with open telemetry and you're writing the code in parallel. So you're not really testing the mocks, you don't have any artificial tests. It's all real data that your system is generating, which is we all know how long it takes to write mocks. If we cut that out, you see how much time we're saving. So from there let's move into the important part, and which is that you're actually testing data from traces in real environments. And what I think is important here is that it works with any existing open telemetry based distributed tracing. So if you have distributed tracing enabled in your system. If you have the Otel SDK installed, this will just work. From here. I'd like to segue into something that's called trace based testing, which is a very similar concept. Often they're like a Venn diagram, they overlap each other. But to explain what trace based testing is, is that it's adding assertions against span values. So against the spans, the individual spans of a distributed trace, you're adding assertions against that. So based on those values you can determine if a test has passed or failed. And unlike traditional API tools, trace based testing asserts against the system response and the trace result. So you're getting a more complete response from your system, so you know exactly what's happening in your system, not just from the response. You can get a response back that's 200. But something is failing asynchronously after the initial response. Let me now show you what this would look like in practice. Obviously one way of doing it is with trace test. It's an open source tool. It's in the CNCF landscape. It uses open telemetry trace bands as assertions. So basically all of the things about observability driven development I was explaining you can do that with trace test. Now. Why? It's a very simple answer, because it works with all of the open telemetry tracing solutions that you have right now. So all of the tools that you're already using, from open telemetry collector to Jaeger, Lightstep, new Relic, elastic, Opensearch tempo, all of the tools that you're already using, it just works and integrates seamlessly. You can also run tests via the web UI or the CLI. So it's very simple that way as well. But what I think is important here is that you're not creating artificial tests, you're testing against real data. You're using transactions for chaining tests into test suites. So you can have multiple tests that have inputs and outputs between the tests, and you can save these into environments where you can generate test suites that way as well. So it's very flexible. Now I like saying no mocks a lot because I don't like mocking at all. So whenever I have the opportunity to not write any mocks, I want to take that opportunity. Another thing that's very powerful is that if I have an async message queue like RabbitMQ or Kafka or something like that, how do I know that the values that get pulled off of Kafka are actually correct? That's a big headache. And with trace test. That's something you can do because of getting access to the actual trace ban that says yep, the value that I pulled off of Kafka is this one. You can also obviously do assertions based on timing. You can do wildcard assertions for common things like maybe you want to see that all of the database requests are less than 100 milliseconds. You can do that as well. So the diagram is the perfect way to explain this. And this is what I like showing. So you get the test executor which is just an API request or GRPC request. You trigger your system, your system will generate traces. Trace test is going to pick up those traces and display that in the assertion engine where you can run your test specs and your assertions and then you get the test result back from there. Now from this I want to jump into some hands on code so we can see how that works. So yeah, so first and foremost you need to select the API you want to test and here you see just the app, 80, 80 books. So I just want to ping the books API that I had. I want to make sure that I'm specifying that I'm getting the status code 200 back and that my books count is equal to three. So this is standard TDD. I'm writing my test and then I need to implement this in my books handler though I'm getting the books back and this is just a placeholder for the books I have in the bottom. I haven't defined any spans for my traces though, so if I do run the test itself I'm going to get an error. So I'm going to see yes, the status code is 200, that's fine, but I'm not seeing any traces for my books list count. I don't have any span that correlates with that. So this test will fail. So we're still keeping the TDD process. However, if I do jump back into the code, I initialize my tracer and then I say yes, my books list count is equal to the books that I'm getting and I want to add that to my trace. And then from there if I run the test again, it's going to pass just fine. Where this is the red green process I was talking about. But one more thing that I think is immensely powerful is that if you were doing some performance testing and you want to assert on timing, so let's say you have, as you can see here, you have the span duration and you want the duration of the span to be less than 500 milliseconds so let's say you want your initial HTTP request to return in less than 500 milliseconds and you can do that as well. If I run this test and if it's taking more than 1 second, this is going to obviously fail. But if I go ahead and obviously change the test and makes sure that actually change the code to make sure that my test is executing, my API request is executing faster, I can check, as you can see in the UI here, the same response, and I'm getting that the test is passing in less than 500 milliseconds, as you can see here. Now these are all very powerful things that you can do, but the most powerful thing that I think trace based testing allows you to do is asserting on every part of an HTTP transaction. Now here's a perfect way of explaining that we have these two services that are communicating with each other via one API call. So my one API call is triggering the books list. So I want to get back the books, a list of the books obviously. But my books, I need to make sure that they're available. So right here I'm actually triggering another API to another service that's called the availability service. And from this availability service I'm actually checking if the book is available. So I have one API call. The first service is going to trigger an external service and then do some validation there. Now if we check the other service, you can see here that this is actually what's going on. I'm triggering an endpoint and I'm passing in the book id and I'm checking if it's available or not. So this is the external API and in traditional testing I really don't know what's happening here. There's no real way figuring out if this entire transaction is correct or not. And then obviously inside of the availability I'm adding in my tracer, I'm adding in my spans and I'm checking here. Yes, I want to makes sure that the book that's available is added to my span of this distributed trace. And now I'm going to know exactly what's happening in the external service that I'm actually not triggering myself, it's getting triggered from inside of the books service itself. Now the way the is book available works, I'm just getting some stock, and based on the stock, if it's zero then it's going to fail the test itself. So the assertions itself is going to look like this. So I have my assertions from the previous example. I have my span duration, I have my books list count, but I also have this at the bottom where I'm checking the availability and I want to make sure all of these checks, so this is going to be three checks because I have three books, all of these checks need to be equal to true. So if one of my books is not in stock, I want the test to fail. And the key point here is at the bottom where you see the attribute is available equal to true. If I run this test it is going to fail because you saw one of the books wasn't in stock, it was equal to zero. So obviously this test will fail. So here I'm validating the entire transaction of an HTTP request, even though this test in a traditional test would have returned 200 and that would be fine. And in the UI visually you'll see that as well. Yeah, you can see that it's passing all fine by checking the availability API. So it's triggering the availability API just fine. But if it's checking the book itself inside of the span inside of this service, it's going to say nope, the value it got back was false, this particular span is returning false, meaning this particular book is not in stock. It's a very very powerful thing where you can test every single part of the transaction. And what's cool here is that this works with any distributed system as long as you have open telemetry instrumentation in your services. Now this is what the traditional setup would look like where you have your app with your open telemetry instrumentation. It's sending traces to your open telemetry collector from the collector you're sending to the trace data store, which is Yeager open search tempo, or whatever trace data store you're using. Now the way it functions with trace test is pretty similar. It hooks into your data store and it triggers your app with HTTP or GrPC requests. So it just triggers the API, fetches the response, gets the trace data, and then runs assertions based on that trace data. So it's just another service alongside your existing open telemetry and observability setup. To install it you can use a CLI and then from there you install the server. That's just a container that runs inside of your infrastructure. Super, super simple one line command, install one line to install the server and get set up running supported out of the box for docker compose and kubernetes. And then what I think is incredibly cool is that the way you connect data store, you can either connect directly through open telemetry, so funnel all of the traces you get from open telemetry collector into trace test. Or you can use a trace data store like Yeager. Now to wrap everything up, let's just run through what we learned. First and foremost, observability driven development is awesome. Why? It's awesome because you don't have any mocking. You can test against real data and you have no more black boxes. You know exactly what's happening in your tests, you know exactly what your system looks like. You don't have to ask anybody in your team. So what was happening with that one service? You have the entire layout and you can run tests from that. You know exactly what's happening and that's a big, big deal. So from there, because you know what's happening, you can assert every step of that transaction. Cool. Let's just do a quick recap. First and foremost, these three things I really want you to take away. First and foremost, testing on the back end is hard. It is very hard. Testing distributed systems is even harder. And that's why I think the way to do it in the best way possible is to elevate your TDD with distributed tracing and then use odd as well. And that's it. Thank you for listening. If you have any questions, you can reach me at pretty much anywhere on Twitter or LinkedIn. Or if you want to check it out, what we're doing, jump over to GitHub, leave a star if you like it. If not, I'm not going to force you. If you want to try it out, go to the download page. Or if you want to read the entire blog post that I wrote as a tutorial for this talk, you can check that out as well. I'm just going to leave this short slide for you to join our community. If you want to join our community. And yeah, find me at Twitter or GitHub. That's my handle. You can send an email as well if you want to reach out directly. And that's it. Super happy to be to have been with you today at Conf 42 observability and yeah, see you next time.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Observability 2023 - Online

June 08 2023

What is observability-driven development?

Video size:

Abstract

Summary

Transcript

Slides

Adnan Rahic

Senior Developer Advocate @ Tracetest

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2023 - Online

June 08 2023

What is observability-driven development?

Video size:

Abstract

Summary

Transcript

Slides

Adnan Rahic

Senior Developer Advocate @ Tracetest

Join the community!