Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello and
welcome to my Conf 42 talk on continuous feedback
and opentelemetry. If you're not familiar
with the term continuous feedback, I promise you I didn't make it
up and I hope that by the end of this talk
you will have a firm grasp of what it is and why
I think it's absolutely important for develop purs to master it.
Now, just a few words about me.
Most of my background is in development. I've been a developer for
over 25 years. Throughout that time
I've been pretty obsessed with how to create the right
development processes, and I've
seen a lot of situations where we're doing things pretty wrong.
And one of the things I did notice was that even when
we were following best practices and kind
of developing according to the latest Gitops
processes doing continuous deployment, we were
missing something in that loop. And that particular something
is what I want to discuss today.
Now, I've been a developer for 25
years, and one thing I can say is that coding
practices really changed throughout that time.
So those of you are old enough to remember,
this was our git ops process back in the days. Like,
we basically took whatever release we
managed to stabilize, we burned
it on CD and we gave it to the guys across
the hall called the QA department and they would test it sometimes,
give us back some errors to fix, and eventually this would get
deployed. Now, obviously this
isn't the case anymore, and as of yesterday at least,
developers have much more involvement in how to test
and validate their code. So your job as a developer
doesn't end when you finish the coding part.
Usually you at least write some unit test integration
tests perhaps. And then many developers are also very involved
in how their code gets deployed. So that means that to be a
kind of a full stack whole product developer,
you need to worry about helm files and terraform as well.
But the question remains, and I'm going to
leave that out there for a second and we'll come back to that point.
And this is a really pivotal point, which is what happens
next. So what happens after you deploy
to production? And I
mentioned before that I am,
or I adopted the latest and
greatest dev practices, we're doing continuous deployment,
we're releasing pretty fast.
But if there is no kind of continuation to this
process, it seems that all we were doing, in fact, is actually just
throwing features over the fence, maybe at a higher rate
or a higher velocity. Because what
really strikes me about this particular diagram
is that it's pretty linear.
And if you recall from all of what we've
been taught or how we learn to code our craft.
Development is supposed to be nonlinear. It's supposed to be kind of like
a loop, right? So where is the loop in this straight line
if all you're doing is just following these
stages and moving on? But we'll get back to that.
However, all of this was happening yesterday, and I think today there
are a lot of new tools that we can leverage to actually make
us even more production. So definitely
code generation tools are helping me, us create
code faster. Oftentimes we create
pieces of code with copilot, and if we're lucky enough, then we
don't just delete them because they're not good enough and rewrite them.
But sometimes there is actually things we can use, or at least we
get a working model. But at the same
time, it looks like we're just beginning to scratch the
surface about what these same AI tools can do
to make information about our code more accessible.
So if you think about it this way, it's kind of input and output.
So the output is the code that you write. That's fine, we're doing that much
faster. But how are we improving on
actually analyzing what the code does?
How are we improving on actually learning from our
code? And the question is, what do we know about
our code, really? Now this
may seem to be a very dumb question. Like, what do you mean,
what do I know about my code? I wrote it, I compiled
it. If it's a language that requires compilation,
I debugged it, I ran my tests.
Is that enough? That's a really good question.
Right? Because in the spectrum of what is
enough to know about your code, you can
kind of put knowing that your code
works or makes the world better or makes your software better,
or doesn't cause any regressions to be something on one
end of the spectrum, which may be requiring a little bit more analysis.
But at a very basic level, it can also mean, do I know
if my code even runs?
So I'll give you an example. I was working on one
project where one of the developers wrote this meticulous
piece of code that is a complete refactoring of the data access layer.
And he completed this piece of code. He did
a code review, everybody looked at it. He wrote
the test that validated that it's working as expected.
Then he pushed that into production.
And eventually, and when I say eventually, I mean like a month later,
we find out that it never ran due to some bad if statement.
So in the spectrum of what do I know about my code?
Sometimes it doesn't go even beyond do I know
whether my code runs? And I
actually confirmed that that developer and I asked them,
why didn't you check, you actually
wrote this great piece of code, you rolled it into production.
Why didn't you check that your code is running? And he gave me an answer
that it was very hard for me to challenge.
And his answer was,
well, how? And I
think this is something that is in the very
center of this talk about continuous feedback,
because the how is really
a missing piece here. So if the developer asks me
how do I actually test that my code is running
and I don't have a good answer for him,
then that's a problem.
And what adds to that problem is the fact that what that
developer did right after he rolled that feature into
production is just to take the next feature from
the backlog and continue on.
And it's no fault of his
own, because us software organizations are pretty forward
leaning. I was a product management manager at one stage,
and as a product manager I kind of contributed to this problem
because I put a lot of pressure on engineers to move
forward. Like, I don't think any
product manager that I met in the world has a roadmap that
says, okay, roll out user management feature
and then another box that says get technical
feedback about whether it's scalable. No, nobody has that right.
So when a developer tells me he's done, I'm already putting pressure
on how do we get to the next stage?
And this creates a gap both in
the code itself and in what
we know about it. And when we don't understand our
code, when we start pushing features into production
and only discovering later that they're
lacking, or that we didn't quite get things right,
then we end up troubleshooting. When we troubleshoot in the middle of
the night, a problem in production, it's because we don't know how our
code works and performs.
Now this is a problem.
And in some teams this escalates to the
point where it's actually interfering with regular work.
I worked in one team where we were actually using the
phrase BDD, and I don't mean behavior
driven design, I mean bug driven development, because all
we were doing is actually chase the bugs that
we were introducing by just quickly
throwing features over the fence, or in this case rolling in
features without getting any feedback about how
that actually works, about whether our code
is good or bad. And this
feedback doesn't need to exist in production.
We can actually get a lot of feedback from our code much
before. So I will give you an example
of that. And this is, I'm the last person
to believe a ten X engineer exists. I actually think engineers are
very versatile and usually are kind of, they have
different traits that mesh together to create maybe a ten X team
when you have the lights, combination of skills.
But in this instance, I was working with a team of about ten people,
and one of the developers really stood out to me
as a Tanx engineer, and he did something amazing.
He also did something terrible that I wouldn't actually recommend anybody
to do. And here's the story.
He was working on a feature, he was working on
some kind of a refactoring
and adding features to a batch job that was
doing data processing. And as he was working
on that, he decided to do something strange.
And I witnessed that, or I was kind of exposed
to that. When I was looking at the logs and I started seeing rows
upon rows upon rows of numbers appear,
it was completely spamming the application output and
it was something like 30, 40, 60 40,
kind of looked like the screen from the matrix.
And obviously I confronted him about it and asked like,
why are you generating so many numbers to the output?
And he said two things. First of all, that he was sorry that he checked
it and he was working locally on some customer database
or customer like database. And here is what he was
doing. He was actually,
as he was working, keeping track of
how much time each iteration of
this batch job took. And somehow in
his peripheral vision, in his kind of spider sense
that he had developed, he was able
to discern whether the numbers were going a little crazy,
whether it was like too many seventy s and eighty s or ninety s
and too little of the he
expected in milliseconds. And whenever that happened,
he knew that the code that he was adding all the
time into the code base and adding into
the loop was actually degrading the application
performance. Now, first of all,
I do not recommend this practices, but second,
I think that this is a great thing
that he did, because unlike the other ten developers
in the team, he actually cared enough. He had
that sense of ownership in order
to check, to validate, to continuously
try to get feedback about what
he was trying to do. And in
fact, if we kind of examined where we
have feedback and where we don't have feedback, so sure when
we develop, we get some feedback because we can debug our code and
we can see it working. But even that is kind of limited
because if you've ever worked with a debugger,
then you introduced a breakpoint and it's kind of like an instance
in time and space where you've captured a specific state of the
system. It doesn't tell you anything about how it will behave
the minute after you resume your breakpoint. And the same
goes for tests. Tests offer a very limited pass fail results,
but they don't really tell you anything about how the system
behaved during the test, which is maybe the more interesting part.
Now, if you talk about production feedback, then that's where developers
are clueless. Like, there's so many observability systems and so many
instances where I've seen developers are
reaping none of those benefits at all.
And it, through no fault of their own.
I think for generations,
software generations, APM software
kind of evolved towards it,
Sres DevOps, and less towards developers and
their code. I was actually in
one, visiting one company in Israel where I was
talking to the VP of engineering, and he was showing me
they had this amazing observability dashboards
on their but,
and I was actually awestruck by that. I told him, look,
you guys are way ahead of the curve, I need to learn from you.
And he said, well, I'm not so sure because I
just had an interview or a review with one
of the engineers on the team, and what he said was,
they actually believe these dashboards were just screenshots.
So it seemed that for some organizations, for some engineers,
the point of observability is to
create dashboards or to support production monitoring
rather than integrate them as feedback towards the developers.
And I was kind of surprised by that and
kind of not, because at the same time I've been feeling the same pain again
and again in all of those examples from continuous deployment,
lacking that one piece of feedback through developers
not knowing if their code was even running to developers,
resorting to very strange solutions just to be able to be
capturing what their code is doing even while in dev.
So I went back to kind of the source where we
kind of always like to model,
and I have to be frank, by now,
this analogy of the DevOps loop has completely worn itself out.
I think we need to find a better one. But for lack of one,
let's stick with this model,
ancient as it is. And if we look at the DevOps
loop, we can notice something really alarming about
it because, and by the way, this is just an image.
Like, I think it's one of the first images that you get if you search
for DevOps loop on Google images. And one thing
you'll notice here is that there's so many tools attached to
each stage of the feedback loop.
So we have plan, build, CI deployment,
all of that. Sorry, the dev loop, it has a lot of tools associated
with it, but smack in the middle of this diagram,
we have this actual portion that says there's
this segment that says continuous feedback.
As I mentioned in the beginning of the talk, I didn't make
this up. It's not something that I came up with. It was there all
along. But you will notice can alarming lack of
any tools associated with it, except for some reason
for Salesforce, which of course has nothing to do with continuous
feedback from operations to planning. Well, maybe it does if you look
more from a product manager's perspective, from an engineering perspective,
it's not related. So we have this
hole in the middle of our DevOps loop that is
actually causing all of these different symptoms. And the
question is, how do we actually solve it? How do we actually
bridge this gap? And going
back to these two examples that I mentioned,
the developer that didn't know if his code was
even running, and evidently it didn't,
and the developer that had to use these primitive
tools to actually try to use a spider sense to
discern if something was off as he was writing the
code, what is the actual
technology that can help us solve this? And to go back
to the way practices are changing, we talked about
how we can use AI to generate code. What is the equivalent
that we can use here? So this
is a perfect segue to start discussing
opentelemetry and why it is important.
So one thing I would say about opentelemetry is that
it is not revolutionary. It is more
evolutionary. It is the merging together of two
standards, open census and Opentrace. It didn't
bring anything groundbreaking in terms of the technology,
but it did bring something else that's really groundbreaking to
the software development industry,
and that is that everybody agrees on it, because that didn't exist.
Never before has there been such a gold standard for
observability. Why is that important?
Because it's open. Because it's
accepted by everyone. And that means that all of the
libraries that you all are using, and it
doesn't matter if you're using Java or using Python
or you're using go. It doesn't matter if you're using Fastapi
or if you're using Echo or Mox and
go, or if you're using net with ASP Net MVC or
if you're using spring in Java,
if you're using Kafka or RabbitMQ, if you're
using postgres databases or mongodb. All of these libraries
have already introduced support for Opentelemetry
because it was a simple choice to support that standard.
But why is that important? Because that means that all
of the data that you need to make all of these
decisions to understand how the code is working, to know that it's
working correctly, to know that you haven't introduced a regression in
dev test and prod, all of that data is already
there. It's just a matter of flipping
a switch, turning the light on, because the minute you
activate opentelemetry, and we'll take a look at
an example in this talk, which is based off of Java, but the
minute you activate open telemetry, you already have access
to a lot of data. Now I've
been in many organizations where yeah, all of these things were very
important to us and we treated it like a project. Yeah, we have this observability
project and we needed to prioritize it and we needed to create backlog
items for it, and it was always less urgent
than other things and the technical debt just kept piling up.
This completely obliterates that problem, because we no longer
have this issue where we need to do something. You don't
need to make any code changes in some languages to just activate it.
And the problem may be transformed from having too
little data to having too much data. But we'll talk about how to
solve that, because having too much data is an easier problem
to solve. So in the context
of today's talk, I'll be talking mostly about tracing.
Opentelemetry does offer other integrations with
metrics and logging, and the ability to actually
triangulate these three. But in this case
I'll talk about tracing because I think it's a much undervalued
and underused technology that developers can reap a lot of benefit
from. So to maybe
start off, I want to explain what tracing is
and why it is so important. Apologies if
this is already familiar to you. So just a refresher about traces.
So traces are essentially an anatomy of a request. They tell
us what our code is doing. If I
look at an API call, in this case, this is a spring boot application.
I have an API that uses javaspring.
It does a lot of things. It then goes and makes an external call
to an API, and it uses hibernate to
use a database driver to talk to postgres.
All of these things that happen when a request get
handled. And this can extend to other microservices. This is why it's
called distributed tracing. So we can follow along as the
request gets handled by multiple microservices.
But this entire kind of flow we call a trace.
And another piece of terminology that we should be
familiar with is a span. And a span is kind of
an activity that happens within that trace.
In fact, if you're using net then activity is the terminology
they chose for it. It's synonymous to span for some reason.
So in this case I would look at the request anatomy
and say okay, so this is what happens when my code gets
executed. It first handles the HTTP request and it might have some checking
of permissions validating with other sources. Maybe some queries are running.
Each of these is a segment in the trace and this kind
of gives me a breakdown of
exactly what occurred when my code was invoked.
This is exactly what would tell me if when somebody was
running my new refactoring of the data access layer,
my new code was actually triggered.
And this is an example trace. We'll look at an actual code
example in a sec, but this is what a trace looks like. We have
the different segments or different spans
that happen throughout that request handling. We have things like
queries and we'll actually see that we're able to go down and see
all the way to the actual query that was running and
we have the different processes where we can identify where the issues are
or what is taking so long.
Enabling instrumentation is very easy. Again,
this is a Java example. You can easily find examples
for other languages. I like the Java implementation of opentelemetry
because it doesn't require any code changes. If you want
to run it in dev or test you can just use a Java agent.
So you download the agent, you just add an environment variable
as you can see in the box above here. And immediately you start
seeing information about your code. Then if you
want to track specific functions you can just add an annotation
and that will start tracking how that piece of code behaves.
All of that will be made clear in a sec when we look at the
example, but I just want to make it
clear that my goal in this session is
not to show you how to activate telemetry or
observability in your code.
There are way too many talks
and guides on how to do that and I feel
I don't really have anything useful to add to that.
I do think that where many of these guides
are lacking or what we are missing is more information
about how to actually use it. When we're developing,
it's one thing to collect information, but then
we're just left at that we have a lot of pretty dashboards.
Situation where we want to get to is
a situation where we actually generate data and
then we can use it as we code.
So I'm going to show you a quick example in my ide
and then we can take it from there and see kind of where does
that leave us? And is that enough? What else do we need to actually make
the process better? In the course of this demo I
will use Jaeger, which is an open source tool.
And in fact everything that will be on
this talk will be either open source or free tooling.
So there are no commercial tools that I'll use
in this talk that actually require you to pay money to use.
We may have some time to talk about Grafana and
Prometheus open source version, we may not.
And I'll also show you a library that I'm working on called digma,
which is also a free library that you can use
as developers to maybe make use of that data.
So without further ado, let's open the ide and stop
looking at slideware and maybe start looking at code for a change.
So let's take a look at some code to demonstrate how
we can really do this feat or
manage this feat of actually using observability
when we code. I've opened a familiar
project for every Java developer out there called the
Pet Clinic project. It's kind of a go to sample
for the spring boot framework and
to make things interesting and also to kind of
simulate some kind of a realistic scenario,
I've actually added some functionality to it. So this
is a pet shop that allows you to see
pets and owners and visits and things like that.
I have added the functionality of vaccinations.
So I used some mock API to create
this vaccination external service.
We're going to asynchronously kind of get the data
about the pet vaccines from that service
and we can see we created this adapter
that actually does all of the work of actually
kind of retrieving the object.
We've also extended or added a table to
the database to keep the vaccination data.
And after working on this feature for a few
hours, kind of proud to say that it
seems to be working. Let's take a look at
what it is. So you may notice some text here.
This is because I've actually enabled opentelemetry for this project.
Again, if you want to see how to do that, go to continuousfeedback.org.
This is where I have all of the kind of open
telemetry 101 things that I'm not touching
on in this talk. I've set up
all of the links so you can very easily follow.
And let's take a look at what this pet clinic Project
looks like. So I've
started my app. I can see pretty
straightforward interface. My own
small contribution was to add the vaccination
status. And that means when I look at a specific owner,
I can see whether they need a vaccine for their pets, which is awesome.
And I can also add a new pet,
which now will also include
the vaccination data or kind of trigger a
call to this vaccination service to get the vaccine information for
this pet. So let's add lucky the
dog.
Whoa, this looks like an issue.
Oh, there is already a lucky one. So let's call
him lucky tools. Perfect. And we've
added this dog. I can assure you that in the background we're
running some processes to get the vaccination status
for this pet. Pretty neat.
Now the question is, again,
to go back to what we were talking about before.
What do I actually know about this code? So I
can check via tests? And indeed I'm using
test containers here. I wrote some integration tests
to test this very scenario. I can show you.
So, for example, we have some tests here that
check that the owner is created. There are kind of end to end tests
and they get all of the data, kind of
making sure it's persisted in the database, making sure it's rendered in the view.
We've kind of done our due diligence about the
testing part here. This code was also reviewed.
So all in all, I would say that I've
kind of reached the limit of what I'm able to do as a developer.
And to go to the previous examples we were discussing,
this is pretty much the limits of what any developer would do in
a similar situation. So, recap. I've written this code.
Maybe not the best because my Java was a little rusty, but it
works. Somebody's reviewed it. I have written the
tests. I'm ready to move on to the next feature. But like we mentioned
in the previous examples, that isn't
really getting feedback about my code. So let's see how opentelemetry
can help me, because as I mentioned, I was running open telemetry in
the background. So the first tool I'm going to
use is an open source tool called Yeager. And Yeager
is just a simple tool to visualize the traces that I've been collecting
while debugging and running this code.
So I'm going to open up the yeager interface. And this
is just a container that's running on my machine. And immediately
I can see the different actions
that were carried out here. And we can see
kind of the adding of the pet, the post request, and then getting of
the pet details, all of that tools fine.
I may notice some very strange things here.
The first thing is that the number of spans,
and remember span is like a segment inside the trace
is exceedingly high. This is 256 spans, it's 164
spans. That's quite a lot for
a request of adding a new pet.
And we can drill in to see exactly
why that is and what was going on here.
So the first thing I may notice, and I'll
make this a little bigger, but if you are familiar with orms,
then this particular pattern will be
familiar to you.
We have one select statement followed by many select statements,
a longer specific relationship. And this is
usually an indication that there
is a select and plus one issue. And I can kind of go in a
little deeper and actually see what the queries were.
And if I take a look at that, I will see that there
is indeed because of the lazy coding
of the objects, an m plus one issue in here in
dev, it's not that problematic. But you can imagine if
I had a lot of visits, then the problem would
escalate. The next thing you may notice,
and by the way, we can see the same thing around types, which seems
to be also related to the issue that we saw before.
But what I completely missed out on was these
amazing number of HTTP requests
that are also happening here in the background.
And this seems to indicate a problem in the way
that I'm fetching the vaccine data. I would have to
guess that this is kind of a leaky abstraction where I'm accessing
the vaccine error record and I'm not noticing that
it's actually triggering an HTTP call. We can see
more of these crazy queries right around
the types table and the visits table that I've both
highlighted as areas that are problematic.
So as you can see, I was just about ready to check in my
code, but a closer inspection of
the anatomy of a request or what is happening when my code gets handled
shows me that there were a lot of issues just hiding behind
the scenes that are very easy to surface.
If you just are able to take a look at the request as a whole.
Let's take a look at the other operation that I've modified where
I'm actually looking at or rendering
the owner view as well as whether they need a vaccination.
So here again we see the familiar issues with
the select statements around visits and types,
but we also see another antipattern. So this is the
rendering stage. We can see, by the way,
all of this data without having changed our code
almost at all. So all of these come from the built
in support that all of these libraries have for opentelemetry.
So hibernate has, or, sorry,
spring data has support for opentelemetry.
So we automatically get the repository operations.
Here we have the postgres driver,
the JDBC driver, actually reporting the queries.
And here spring itself is reporting that the rendering
phase is happening. And then after that
we can see calls that are being made to my newly added
table, the pet vaccines. And what that
means is that in the view, we're accessing a lazy
reference to an object and triggering
queries during the rendering phase, which is generally considered an anti
pattern called open session and view.
So all of these kind of new
things that I found out about my code I would categorize
as things that are pretty easy to pick up on.
They're not very complex, and you can detect them just by looking
at a single trace. Now the
question is, will this be useful?
So just to be clear, these are the easy things.
We just mentioned three or four different issues here that
are pretty easy to pick up on if you have
the expertise, if you know what you're looking for, and if
you know to look. I'm not talking here at
all yet about these things that would
require more processing to understand. For example,
does this scale well?
Did my changes introduce a regression in terms of the system performance?
Are there any new errors that are happening here behind the scenes that I'm not
seeing? All sorts of things that are not
necessarily manifesting within a single request.
But if I look at a collection of requests, I lights
pick up on them. So they may not be
like, this is just a random trace,
a specific action that I've taken. But it could be that more
interesting things can be picked up by looking at 1000 or 100,000 traces
and seeing the stream of data and what we can
discern from it about the code changes. But for now
on, let's stay in the simple territory. And the truth is that I
took this tool and I took these methods
and I went back to the team and I told them, here you are,
you can go and apply them. The developer that added
the rendering to the console
of how much each iteration of the loop was taking can
now use this, or maybe just add
a few metrics. We didn't talk about metrics. Yet the
developer that didn't see if their code was even running within
the flow can kind of check and see.
And the truth is that nobody used this.
And this is, I think, one of the problems with
what we have today, because sure,
collecting the data is a prerequisite,
but as you recall, the name of this
methodology is continuous feedback. It's not just
feedback, meaning what I've
just demonstrated is manual feedback. It's reactively
going into an equivalent of a dashboards and drilling in
to find issues, which is nice, but the reason it doesn't work
is twofold. And you could kind of see that by
the number of developers who actually use this tool. When I tried to introduce
it, and two developers used it twice, one of them
used it and then saw it highlighted too many issues,
many of which he did not want to fix because they were not related
to him. And eventually it was more painful, more of
a headache than it's worth. The other developer didn't find
anything. He tried again, didn't find anything again. And then he assumed
that by production that means that there are no issues in his
code. So the
problem is, how do we take this information and make
it a part of our dev workflow, make it
continuous. And I'm kind of reminded of tests,
and with tests we kind of face the same issue.
We're testing when they were manual.
And I remember that phase in the DevOps
cycle where there was no DevOps cycle, and we actually had to run
tests manually upon as a preparation for each release.
And no developer wanted to be the one stuck with running the
build of the test, seeing all of the red tests because nobody was maintaining them
and then going to fix them all one by one.
That was a chore that everybody wanted to avoid.
And only when tests became continuous, then we were able
to overcome that barrier and people were kind of internalizing
that. Tests are just a part of what you do when you develop in
the same manner. The question is, how do we get this
data and make it automatic, if you will? Why do I
need to have a developer go and review
these traces and find these issues,
both the ones that are easy to detect and the ones that require me to
do some statistical modeling and anomaly detection and regressions and removing
outliers and so on, to look at a huge number of traces
when we can have some kind of an intelligent agent that does
that. And going back to the original point where we're talking about the changing
landscape, we all are witnessing the
change in generating code and how easy it is to
use copilot or Chachi PT or other technologies to generate
code. But at the same time we can use the same technology
in order to analyze what our code is doing. So we
start getting some feedback back
from it and there are many commercial
tools that are moving in this direction.
Recognizing that this is possible, I'm going to
show you a library that I'm working on and that library is free
for developers, which is why I feel very comfortable showing it to you and
in fact would encourage you to let me know what you think because
it's still under development and something that I think
would benefit from a lot of developer feedback.
So going back to the original code, I'm just going to
repeat these actions that I've just performed.
So we're coding to be adding a new pat and
it,
and then I want to see what
that pet looks like for a specific owner. That's fine.
Now you may notice that in the ide here,
I see a green dot pop up next
to a telescope icon, which is where I can see observability.
So to get to that point, and in fact to
kind of streamline this entire scenario, all I did was
install an intellij plugin, a free one called Digma,
which is the project that I'm working on. So what
Digma does essentially is collect open telemetry
data and analyze it and then show it back to you
based on the analysis that it performs.
Similar to what we just did here when we reviewed the traces,
but also going beyond to look at what happens
across numerous traces. So I've
installed IgMA here in my IDE. This is, by the way,
only supports Java at the moment, but we are expanding to other languages
and immediately what you can see is exactly all of
the different actions that I
just performed. I'm going to move this a little away so that you
can see this better, so we can see all
of these actions that I've performed, the post request,
the get request and so on. And if I go
to the post request for a second and you can see by
the way that there are some issues found here,
I can start seeing a lot of things that are now plain
and obvious and I'm going to make this a little bigger so you guys can
see. But basically it's telling me that there's an
excessive HTTP calls, which is exactly what we saw. We can kind of see
a breakdown of what's going on. And because this is asynchronous,
we actually never noticed that this was taking too long.
But we now can see that DB queries are taking
133 milliseconds, which is quite a lot as well.
But the HTTP clients running in the background and asynchronously are
taking over 7 seconds, which is insane. Of course we
can see what are the bottlenecks and of course it's the mock API that is
slowing me down and also the
call to get the vaccines, which is again, it makes sense.
And then we can actually go and see what's going on with
this n plus one issue. We can see the query that's causing
it, we can see the affected endpoints and we can also
go and see what the trace looks like. And in
that we can actually see these select statements that are
repeating themselves and we can go in a little deeper to
understand exactly where it is happening into trace. So in
fact what we've done is we've inversed the pyramid.
Instead of looking at or sifting
through a lot of traces and logs and metrics, trying to find issues
which is time consuming and reactive, we're doing that automatically
in bringing that data to us so that we can
see in the id exactly what are the
issues. And then if we want to go to the related traces, logs and
metrics or understand more about the impact and who's affected
by it, that's very easy and effortless to
do. At the same time we've also streamlined and removed the whole
kind of boilerplate around enabling open telemetry. So once I've
installed the plugin to collect all of this data, all I need to
do is click enable for observability and
that would start collecting all of that data for me, which is neat
and nice. Now at the same time
I've been collecting information from other sources because opentelemetry
is so easy to use. I also collected information from CI and
I can see some of those results here and they're also
quite intriguing. So for example here we can see
that during my performance test
I've found a scaling issue. So this specific
area of the code is actually scaling badly. And what does that mean?
It means, and this again takes this a step further, to not
only just look at the immediate suspects, which is analyzing traces
and understanding what's wrong there, but this actually looks at
kind of a whole lot of traces and
what the agent was doing here. It was just looking at concurrent
calls and trying to find out whether
whenever this code was called concurrently, whether it
exhibited a degradation in performance and by how much.
So is it scaling linearly? Is it scaling exponentially?
Like what can we figure out here? And I know it's a
little small, but if you can pick that up. Basically what we're
seeing is that there's a constant performance degradation by about 3 seconds
per execution after or when we tested it
at about 31 concurrent calls. And there
is also a root cause that was identified just by studying the
traces, which is the validate owner with external service function.
And we can actually see exactly where the bad scaling is happening,
how that specific function scaling is correlated to the entire
action. We can see a trace again, or we can
go see exactly where that is happening in the code to find
out that yes, this is scaling badly and is also a bottleneck.
So all of these examples are just meant to show you how
where continuous feedback is taking the
observability is not towards
its traditional role of being dashboards pretty as
they are, but how can we actually take them
and include them into the development cycle.
So it actually gives us can opportunity to
improve our code on the one hand to catch issues earlier
in dev or even earlier in test, or earlier before
they manifest to their full degree in production
and get to that level of code ownership very similar to that developer
I was describing, the ten x developer where
we actually know our code and how it behaves,
and we don't need to use our peripheral vision
or spider sense to get those insights.
And I think this is kind of the key takeaway,
which is there is an opportunity here that did not exist
before, in the same way that before we had containers,
we did not have the opportunity to run tests so much and
with such ease, and we had to do config management
and things like that. And today we can use immutable
infrastructure and we don't need to worry about these things. And anybody can include
integration tests very easily into their project. In the
same way, I think today the ability to both collect
data, which is what we saw with open telemetry on the one hand,
but also to analyze it using in
this case data science. Some of it is statistics,
some of it is anomaly detection and analytics.
And different models that we're running here that are very basic,
can still provide so much value so
that we can have a different type of development process
that allows us to shine as developers to assume
more responsibility over our code and to be that
next developer. Now being practicing continuous
feedback myself, I would really appreciate your feedback.
So please do try it. It is free. If you happen to be
using Java, you can just go ahead and pick it up from the intellij
marketplace and let me know what you think. If you are
using a different technology or if you just want to experiment
and not use this specific tool, I would love to hear from
you as well. And that is why I created the
continuous feedback website and you
can find it again@continuousfeedback.org. It's just
a notion page where I set up all of these links that I want to
make sure that people have access to, and it just contains a lot of useful
data. It covers
setting up Otel, collecting observatory data, some tooling.
I've even included an opensource stack of
various tools that you can use, some blog posts on the subject,
my contact details, as well as more
information about the continuous feedback manifesto
that I'm trying to kind of understand with
other folks and kind of phrase it so that
we can kind of understand what it is that we're trying to create here
that is really new and needs a
lot of other minds to consider and think about.
Now, the last thing I want to mention is that I do have a udemy
course that covers these basics as well as some more about how to
enable opentelemetry work with it. What are the
things that you can detect and what are some anti patterns
you should avoid? If you're interested in that,
please email me. My email me is at.
You can email me either at ro ni dover@gmail.com
or my work email which is rd ovr
at digma. Just email me there and I will
gladly send you a coupon with free access to that udemy
course. That's it. It was a pleasure to be hosted
here on con 42 to talk about what is only my favorite
subject in the world, and again, would love
to hear from you and provide more information.