Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone, my my name is Adnan and I'm from trace
test. Today I'll be talking about observabilitydriven, more specifically
observability driven development with open telemetry.
Super happy to be here at Conf 42. Observability and yeah,
let's just jump right in. We can start with the
slide deck right away. And as you can see, the title pretty
obvious. Observability driven development with open telemetry.
Yeah, let's jump right in. My name is Adnan.
As I already said, I'm currently doing all things Devrel
at Trace test, which is a project, an open source
product coming out of the Cube shop accelerator. And to
tell you a bit about myself so that you think that I know what I'm
talking about and to stay around. First and foremost,
I am a failed startup founder and an X free codecamp leader. And that
was basically how I transitioned into
being a developer relations engineer from being a software
engineer. So in the last five or so years I've been building
open source dev tools and I absolutely love it.
So it's a natural transition for me to go into education.
Yeah, super exciting.
Let me just give you a quick rundown of the agenda for today. For the
next 20 minutes or so I will talk about four main things.
So the four main things you need to remember throughout this talk will be,
first, the pain of testing microservices. It's absolutely horrible.
I want to show you a way of doing it in a much, much simpler
way. And also number two will be integration testing.
And TDD is hard. We all know that integration testing is hard.
There's a lot of mocking, a lot of setup, and I want to show you
a solution to that. Then three will be how observability
driven development can help, how implementing odD can
help your TDD process. That's a very important thing that I want to explain
as well. And then finally we will go into an in practice session.
So I want to show you hands on how observability driven
development works in practice. Now let's
jump in right away into the pain of testing microservices.
Now here's a problem that you keep facing,
at least for me. It's a big problem where I don't have
a way of knowing precisely at which point of my
complex network of microservice to microservices
connections and communications transactions, where an
HTTP transaction goes wrong. So I don't know where
a transaction fails. I can't track these
communications in between the microservices and one
more thing that's horrible is that it's really hard to
mock different microservices when they're communicating.
The only real way I can do that is with tracing.
It's because I can store a tons of trace data and I'm actually
getting value from that and I can see what's happening.
But how do we use that? How do we solve that problem
with tracing? Well, we use something called observability driven development
and it's often called odd for
short. And it emphasizes using this tracing instrumentation
in your back end code as assertions for tests.
We now have this culture of trace based testing as well,
where we can use these distributed traces
as the assertions themselves. And it's really really
cool because it enforces not just quality in your traces,
but also an easier way to run integration tests. You get much more
velocity for your dev teams and it's much safer
for your platform teams because you know exactly what your production system is
doing when you're running tests. So to
get a quick intro into distributed tracing, first and foremost,
distributed tracing refers to methods of observing requests as
they propagate through distributed systems. So this is a really
nice definition by lightstep, and I agree with
this 100%. Now a visual
representation of this, I think this is a better way of explaining it, is that
a distributed trace, it records the
paths an HTTP request takes as it goes
through your system, so as it propagates through APIs,
microservices, et cetera, et cetera. And each step of this
transaction is actually called a span. So each step
of a distributed trace is a span and every span contains information
about the executed operation. So if it's an HTTP request,
the codes, the timestamps, different timings,
database statements as well. So all of these things are contained within
the spans of the distributed trace. So here's a system
that I'm going to be showing throughout this talk. It's a very simple system.
It has one database, it has one service for fetching books,
and it has a service where you see the availability of
said books. Super, super simple. It's just a simulation of what you
would see in production. Now what's happening
here is that this is a code example of how the
availability API fetches if a book is available.
And as you can see here, I've added in these spans,
basically showing you what a distributed trace span would
look like. I'm initializing the span, I'm setting an attribute
which is a real value of the book id, and I'm making sure to check
if that book is available, and then I can use this data
further down and validate against it
within this distributed trace in the
UI, it would look something like this, where I can actually see,
yes, this attribute that is available I can validate if
that's true or not. And this is where we move
into whether integration testing and TDD
actually need help. And I'm 100% sure that
they do, because when you look at the TDD green red feedback
loop, you create a test case before writing any code. You run a
test, you see that they fail, you write code to pass a test, and then
you run the test again and see that it's passing. So it's a very,
very nice feedback loop. It's a nice process that
we're all used to, that we like, but here are the
pain points that we have to work on. First and foremost, integration tests
require access to your services and your infrastructure.
So running back end integration tests, it requires you to have
an insight into your entire infrastructure. Unlike front end tests where
you're only operating within the browser, when running
an integration test on the back end, you need to design the trigger, you need
to figure out how to access the database, you need to have authentication,
you need to write that in as well. If you have a message bus,
how do you test a message bus? It's very complicated, it's very hard to mock,
and then you also need to configure the
monitoring as well. So how to gather the logs from
these services, and it's just a headache.
So this is also a problem because you can't really track
which part of a microservice chain failed. So let's say you have
serverless functions or API gateways, or even different
types of ephemeral infrastructure.
How do you test that?
It's a headache. So I like
saying that just to start integration testing, it's 90%.
Writing the code to actually makes the test work, and then 10% is actually
the testing itself. So writing the assertions and all that,
that's the simple part. The problem is all of the piping
you have to write to actually get to the assertions.
So here, let me show you what I mean. A traditional integration test,
if you look at it, you have a
ton of different modules and
imports you have to add into your code, and then from
there you can first need to write the mock, and then you have to figure
out what you need to mock, and then you need to figure out the structure
of it. And if the structure changes, you have to rewrite that. And then you
have to figure out how to trigger a request and then whether
that request needs authentication, et cetera,
et cetera. And then you have this tiny little speck of
code where you say the result, actually the response
should have a status of 200. And then again
you're expecting the body to be equal to what you're mocking. So you
have two lines of code and you have a ton of piping you need to
figure out. You're also tied down to the programming language.
You have to basically know the programming language in
and out, and you need to know how to write the tests themselves.
So it's a lot of complicated things happening at
once versus when you're running a trace based test,
you're basically only pointing to
the API you want to actually the URL you want to trigger the test
against, and then you select the assertions based
on the trace spans. So here we can see,
yes, I'm in my distributed trace.
I want to hit the books API. I want to make sure that the status
code is 200 and I want to make sure that the list of books
is three. And I'm done. There's no mocking.
This is actually what's happening in my system. And it's really
beautiful how simplistic it is. Also language agnostic.
So I don't need to know what the programming language of
the actual microservice I'm testing is. I don't need to do any specifics.
I don't need to learn the modules like Chi or whatever
that I'm using for my node JS tests. I'm running this
very totally agnostic from the programming language
itself. And from here I like to
transition into how observability driven development can help
in this process. And first I need to define
what it is so you understand exactly what it is.
First and foremost. This is when you're writing code and observability
in parallel. So you're instrumenting your code with open
telemetry and you're writing the code in parallel. So you're not really
testing the mocks, you don't have any artificial tests. It's all
real data that your system is generating, which is we
all know how long it takes to write mocks. If we cut that out,
you see how much time we're saving.
So from there let's move into
the important part, and which is that you're actually testing data
from traces in real environments.
And what I think is important here is that it works with
any existing open telemetry based distributed tracing. So if you have
distributed tracing enabled in your system. If you have
the Otel SDK installed, this will just work.
From here. I'd like to segue into something that's called trace based testing,
which is a very similar concept. Often they're like
a Venn diagram, they overlap each other.
But to explain what trace based testing is, is that
it's adding assertions against span values. So against the
spans, the individual spans of a distributed trace, you're adding
assertions against that. So based
on those values you can determine if a test has passed or failed.
And unlike traditional API tools, trace based
testing asserts against the system response and the
trace result. So you're getting a more complete response
from your system, so you know exactly what's happening in
your system, not just from the response. You can get a response back that's 200.
But something is failing asynchronously after the initial response.
Let me now show you what this would look like in practice.
Obviously one way of doing it is with trace
test. It's an open source tool. It's in the CNCF landscape.
It uses open telemetry trace bands as assertions. So basically
all of the things about observability driven development I was explaining you
can do that with trace test. Now. Why?
It's a very simple answer, because it works with all of the open
telemetry tracing solutions that you have
right now. So all of the tools that you're already using,
from open telemetry collector to Jaeger, Lightstep, new Relic,
elastic, Opensearch tempo, all of the tools that you're already
using, it just works and integrates seamlessly.
You can also run tests via the web UI or the CLI. So it's very
simple that way as well. But what I think is important here is that you're
not creating artificial tests, you're testing against real data.
You're using transactions for chaining tests into test suites. So you
can have multiple tests that have inputs and outputs between the
tests, and you can save these into environments where you can generate test
suites that way as well. So it's very
flexible. Now I like saying no
mocks a lot because I don't like mocking at all.
So whenever I have the opportunity to not write any mocks,
I want to take that opportunity. Another thing
that's very powerful is that if I have an async message queue like RabbitMQ
or Kafka or something like that, how do I know that the values
that get pulled off of Kafka are actually correct?
That's a big headache. And with trace test. That's something you
can do because of getting access to the actual trace ban
that says yep, the value that I pulled off of Kafka is
this one. You can also obviously do assertions based on timing. You can do
wildcard assertions for common things like maybe you want to see
that all of the database requests are less than 100
milliseconds. You can do that as well. So the diagram is
the perfect way to explain this.
And this is what I like showing. So you get the test
executor which is just an API request or GRPC request.
You trigger your system, your system will generate traces.
Trace test is going to pick up those traces and display
that in the assertion engine where you can run your test specs
and your assertions and then you get the test result back from there.
Now from this I want to jump into some hands on code
so we can see how that works. So yeah,
so first and foremost you need to select the API you want to test and
here you see just the app, 80, 80 books. So I just want to ping
the books API that I had. I want to make sure that
I'm specifying that I'm getting the status code
200 back and that my books count is equal to three. So this
is standard TDD. I'm writing my test and then I need to implement
this in my books handler though I'm getting the books back
and this is just a placeholder for the books I have in the bottom.
I haven't defined any spans for my traces though,
so if I do run the test itself I'm
going to get an error. So I'm going to see yes,
the status code is 200, that's fine, but I'm not seeing
any traces for my books list count. I don't
have any span that correlates with that. So this test will
fail. So we're still keeping the TDD process.
However, if I do jump back into the code, I initialize my tracer
and then I say yes, my books list count is equal to the books that
I'm getting and I want to add that to my trace.
And then from there if I run the test again, it's going to pass just
fine. Where this is the red green process I was talking
about. But one more thing that I think is immensely powerful
is that if you were doing some performance testing and you want to assert on
timing, so let's say you have, as you can see here, you have the
span duration and you want the duration of the span to be less than
500 milliseconds so let's say you want your initial HTTP request
to return in less than 500 milliseconds and
you can do that as well. If I run this test and if it's taking
more than 1 second, this is going to obviously fail. But if
I go ahead and obviously
change the test and makes sure that actually change the code to make sure that
my test is executing, my API request is executing faster,
I can check, as you can see in the UI here, the same response,
and I'm getting that the test is passing in less than
500 milliseconds, as you can see here.
Now these are all very powerful things that you can do, but the most powerful
thing that I think trace based testing allows you to do is asserting on
every part of an HTTP transaction. Now here's a
perfect way of explaining that we have these two services
that are communicating with each other via one API call.
So my one API call is triggering the books list. So I want to
get back the books, a list of the books obviously. But my
books, I need to make sure that they're available. So right here I'm
actually triggering another API to another service
that's called the availability service. And from
this availability service I'm actually checking if the book is available.
So I have one API call. The first service is
going to trigger an external service and then do some validation there.
Now if we check the other service, you can see here that this
is actually what's going on. I'm triggering an endpoint and
I'm passing in the book id and I'm checking if it's available
or not. So this is the external API and in
traditional testing I really don't know what's happening here. There's no real way
figuring out if this entire transaction is correct or not.
And then obviously inside of the availability I'm adding
in my tracer, I'm adding in my spans and I'm checking here.
Yes, I want to makes sure that the book that's available
is added to my span of this distributed
trace. And now I'm going to know exactly what's happening in the external
service that I'm actually not triggering myself, it's getting triggered
from inside of the books service itself.
Now the way the is book available works, I'm just getting
some stock, and based on the stock, if it's zero
then it's going to fail the
test itself. So the assertions itself is going to look like this.
So I have my assertions from the previous example. I have
my span duration, I have my books list count, but I also have this
at the bottom where I'm checking the availability and I want to make sure all
of these checks, so this is going to be three checks because I have three
books, all of these checks need to be equal to true. So if one of
my books is not in stock, I want the test to fail.
And the key point here is at the bottom where you see the attribute is
available equal to true. If I run this test it
is going to fail because you saw one of the books wasn't in stock,
it was equal to zero. So obviously this test will fail. So here I'm validating
the entire transaction of an HTTP request,
even though this test in a traditional test would have returned 200
and that would be fine. And in the UI
visually you'll see that as well. Yeah, you can see that it's passing all
fine by checking the availability API. So it's
triggering the availability API just fine. But if
it's checking the book itself inside of the span inside of this service,
it's going to say nope, the value it got back was false,
this particular span is returning false, meaning this particular book
is not in stock. It's a very very powerful thing where
you can test every single part of the transaction.
And what's cool here is that this works with any distributed system as
long as you have open telemetry instrumentation in your services.
Now this is what the traditional setup
would look like where you have your app with your open telemetry instrumentation.
It's sending traces to your open telemetry collector from the collector you're
sending to the trace data store, which is Yeager open search tempo,
or whatever trace data store you're using.
Now the way it functions with trace test is
pretty similar. It hooks into your data store
and it triggers your app with HTTP or
GrPC requests. So it just triggers the API, fetches the
response, gets the trace data, and then runs assertions based on that
trace data. So it's just another service
alongside your existing open telemetry and observability
setup. To install it you can use
a CLI and then from there you install the server. That's just a container that
runs inside of your infrastructure.
Super, super simple one line command, install one line
to install the server and get set up running supported
out of the box for docker compose and kubernetes. And then
what I think is incredibly cool is that the way you connect data store,
you can either connect directly through open telemetry,
so funnel all of the traces you get from open telemetry collector into
trace test. Or you can use a trace data store like Yeager.
Now to wrap everything up, let's just run through
what we learned. First and foremost,
observability driven development is awesome.
Why? It's awesome because you don't have any mocking. You can test against real data
and you have no more black boxes. You know exactly what's happening in your tests,
you know exactly what your system looks like. You don't have to ask anybody in
your team. So what was happening with that one service? You have the entire layout
and you can run tests from that. You know exactly what's happening
and that's a big, big deal. So from there, because you know what's
happening, you can assert every step of that transaction.
Cool. Let's just do a quick recap. First and foremost,
these three things I really want you to take away.
First and foremost, testing on the back end is hard. It is
very hard. Testing distributed systems is even harder.
And that's why I think the way to do it in the best way possible
is to elevate your TDD with distributed tracing and then use
odd as well. And that's
it. Thank you for listening. If you have any questions,
you can reach me at pretty much anywhere on Twitter or LinkedIn.
Or if you want to check it out, what we're doing,
jump over to GitHub, leave a star if you like it. If not,
I'm not going to force you. If you want to try it out, go to
the download page. Or if you want to read the entire blog post
that I wrote as a tutorial for this talk, you can check
that out as well. I'm just going to leave this short
slide for you to join our community. If you want to join our community.
And yeah, find me at Twitter or GitHub. That's my handle.
You can send an email as well if you want to reach out directly.
And that's it. Super happy to be
to have been with you today at Conf 42 observability
and yeah, see you next time.