Coding with the lights ON - OpenTelemetry from code to prod

Video size:

Abstract

Observability can be about more than pretty dashboards, it can be a powerful tool for designing better code. In this talk, we’ll look at practical ways of leveraging OpenTelemetry and open-source tools to validate code changes, glean production insights and improve coding practices

Summary

Conference 42 talk on continuous feedback and opentelemetry. What happens after you deploy to production? Development is supposed to be nonlinear. How are we improving on actually analyzing what the code does?
There's so many observability systems and so many instances where I've seen developers are reaping none of those benefits at all. For some organizations, for some engineers, the point of observability is to create dashboards or to support production monitoring. We have this hole in the middle of our DevOps loop that is causing all of these different symptoms.
Opentelemetry is the merging together of two standards, open census and Opentrace. Never before has there been such a gold standard for observability. The minute you activate open telemetry, you already have access to a lot of data. Having too much data is an easier problem to solve.
Traces are essentially an anatomy of a request. They tell us what our code is doing. Everything that will be on this talk will be either open source or free tooling. Enabling instrumentation is very easy.
So let's take a look at some code to demonstrate how we can really do this feat or manage this feat of actually using observability when we code. I've opened a familiar project for every Java developer out there called the Pet Clinic project. In the background, I was running open telemetry in the background.
A closer inspection of the anatomy of a request or what is happening when my code gets handled shows me that there were a lot of issues just hiding behind the scenes. They're not very complex, and you can detect them just by looking at a single trace. The name of this methodology is continuous feedback.
Digma collects open telemetry data and analyze it and then show it back to you based on the analysis that it performs. Instead of looking at or sifting through a lot of traces and logs and metrics, we're doing that automatically in bringing that data to us. The project is still under development and would benefit from developer feedback.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Hello and welcome to my Conf 42 talk on continuous feedback and opentelemetry. If you're not familiar with the term continuous feedback, I promise you I didn't make it up and I hope that by the end of this talk you will have a firm grasp of what it is and why I think it's absolutely important for develop purs to master it. Now, just a few words about me. Most of my background is in development. I've been a developer for over 25 years. Throughout that time I've been pretty obsessed with how to create the right development processes, and I've seen a lot of situations where we're doing things pretty wrong. And one of the things I did notice was that even when we were following best practices and kind of developing according to the latest Gitops processes doing continuous deployment, we were missing something in that loop. And that particular something is what I want to discuss today. Now, I've been a developer for 25 years, and one thing I can say is that coding practices really changed throughout that time. So those of you are old enough to remember, this was our git ops process back in the days. Like, we basically took whatever release we managed to stabilize, we burned it on CD and we gave it to the guys across the hall called the QA department and they would test it sometimes, give us back some errors to fix, and eventually this would get deployed. Now, obviously this isn't the case anymore, and as of yesterday at least, developers have much more involvement in how to test and validate their code. So your job as a developer doesn't end when you finish the coding part. Usually you at least write some unit test integration tests perhaps. And then many developers are also very involved in how their code gets deployed. So that means that to be a kind of a full stack whole product developer, you need to worry about helm files and terraform as well. But the question remains, and I'm going to leave that out there for a second and we'll come back to that point. And this is a really pivotal point, which is what happens next. So what happens after you deploy to production? And I mentioned before that I am, or I adopted the latest and greatest dev practices, we're doing continuous deployment, we're releasing pretty fast. But if there is no kind of continuation to this process, it seems that all we were doing, in fact, is actually just throwing features over the fence, maybe at a higher rate or a higher velocity. Because what really strikes me about this particular diagram is that it's pretty linear. And if you recall from all of what we've been taught or how we learn to code our craft. Development is supposed to be nonlinear. It's supposed to be kind of like a loop, right? So where is the loop in this straight line if all you're doing is just following these stages and moving on? But we'll get back to that. However, all of this was happening yesterday, and I think today there are a lot of new tools that we can leverage to actually make us even more production. So definitely code generation tools are helping me, us create code faster. Oftentimes we create pieces of code with copilot, and if we're lucky enough, then we don't just delete them because they're not good enough and rewrite them. But sometimes there is actually things we can use, or at least we get a working model. But at the same time, it looks like we're just beginning to scratch the surface about what these same AI tools can do to make information about our code more accessible. So if you think about it this way, it's kind of input and output. So the output is the code that you write. That's fine, we're doing that much faster. But how are we improving on actually analyzing what the code does? How are we improving on actually learning from our code? And the question is, what do we know about our code, really? Now this may seem to be a very dumb question. Like, what do you mean, what do I know about my code? I wrote it, I compiled it. If it's a language that requires compilation, I debugged it, I ran my tests. Is that enough? That's a really good question. Right? Because in the spectrum of what is enough to know about your code, you can kind of put knowing that your code works or makes the world better or makes your software better, or doesn't cause any regressions to be something on one end of the spectrum, which may be requiring a little bit more analysis. But at a very basic level, it can also mean, do I know if my code even runs? So I'll give you an example. I was working on one project where one of the developers wrote this meticulous piece of code that is a complete refactoring of the data access layer. And he completed this piece of code. He did a code review, everybody looked at it. He wrote the test that validated that it's working as expected. Then he pushed that into production. And eventually, and when I say eventually, I mean like a month later, we find out that it never ran due to some bad if statement. So in the spectrum of what do I know about my code? Sometimes it doesn't go even beyond do I know whether my code runs? And I actually confirmed that that developer and I asked them, why didn't you check, you actually wrote this great piece of code, you rolled it into production. Why didn't you check that your code is running? And he gave me an answer that it was very hard for me to challenge. And his answer was, well, how? And I think this is something that is in the very center of this talk about continuous feedback, because the how is really a missing piece here. So if the developer asks me how do I actually test that my code is running and I don't have a good answer for him, then that's a problem. And what adds to that problem is the fact that what that developer did right after he rolled that feature into production is just to take the next feature from the backlog and continue on. And it's no fault of his own, because us software organizations are pretty forward leaning. I was a product management manager at one stage, and as a product manager I kind of contributed to this problem because I put a lot of pressure on engineers to move forward. Like, I don't think any product manager that I met in the world has a roadmap that says, okay, roll out user management feature and then another box that says get technical feedback about whether it's scalable. No, nobody has that right. So when a developer tells me he's done, I'm already putting pressure on how do we get to the next stage? And this creates a gap both in the code itself and in what we know about it. And when we don't understand our code, when we start pushing features into production and only discovering later that they're lacking, or that we didn't quite get things right, then we end up troubleshooting. When we troubleshoot in the middle of the night, a problem in production, it's because we don't know how our code works and performs. Now this is a problem. And in some teams this escalates to the point where it's actually interfering with regular work. I worked in one team where we were actually using the phrase BDD, and I don't mean behavior driven design, I mean bug driven development, because all we were doing is actually chase the bugs that we were introducing by just quickly throwing features over the fence, or in this case rolling in features without getting any feedback about how that actually works, about whether our code is good or bad. And this feedback doesn't need to exist in production. We can actually get a lot of feedback from our code much before. So I will give you an example of that. And this is, I'm the last person to believe a ten X engineer exists. I actually think engineers are very versatile and usually are kind of, they have different traits that mesh together to create maybe a ten X team when you have the lights, combination of skills. But in this instance, I was working with a team of about ten people, and one of the developers really stood out to me as a Tanx engineer, and he did something amazing. He also did something terrible that I wouldn't actually recommend anybody to do. And here's the story. He was working on a feature, he was working on some kind of a refactoring and adding features to a batch job that was doing data processing. And as he was working on that, he decided to do something strange. And I witnessed that, or I was kind of exposed to that. When I was looking at the logs and I started seeing rows upon rows upon rows of numbers appear, it was completely spamming the application output and it was something like 30, 40, 60 40, kind of looked like the screen from the matrix. And obviously I confronted him about it and asked like, why are you generating so many numbers to the output? And he said two things. First of all, that he was sorry that he checked it and he was working locally on some customer database or customer like database. And here is what he was doing. He was actually, as he was working, keeping track of how much time each iteration of this batch job took. And somehow in his peripheral vision, in his kind of spider sense that he had developed, he was able to discern whether the numbers were going a little crazy, whether it was like too many seventy s and eighty s or ninety s and too little of the he expected in milliseconds. And whenever that happened, he knew that the code that he was adding all the time into the code base and adding into the loop was actually degrading the application performance. Now, first of all, I do not recommend this practices, but second, I think that this is a great thing that he did, because unlike the other ten developers in the team, he actually cared enough. He had that sense of ownership in order to check, to validate, to continuously try to get feedback about what he was trying to do. And in fact, if we kind of examined where we have feedback and where we don't have feedback, so sure when we develop, we get some feedback because we can debug our code and we can see it working. But even that is kind of limited because if you've ever worked with a debugger, then you introduced a breakpoint and it's kind of like an instance in time and space where you've captured a specific state of the system. It doesn't tell you anything about how it will behave the minute after you resume your breakpoint. And the same goes for tests. Tests offer a very limited pass fail results, but they don't really tell you anything about how the system behaved during the test, which is maybe the more interesting part. Now, if you talk about production feedback, then that's where developers are clueless. Like, there's so many observability systems and so many instances where I've seen developers are reaping none of those benefits at all. And it, through no fault of their own. I think for generations, software generations, APM software kind of evolved towards it, Sres DevOps, and less towards developers and their code. I was actually in one, visiting one company in Israel where I was talking to the VP of engineering, and he was showing me they had this amazing observability dashboards on their but, and I was actually awestruck by that. I told him, look, you guys are way ahead of the curve, I need to learn from you. And he said, well, I'm not so sure because I just had an interview or a review with one of the engineers on the team, and what he said was, they actually believe these dashboards were just screenshots. So it seemed that for some organizations, for some engineers, the point of observability is to create dashboards or to support production monitoring rather than integrate them as feedback towards the developers. And I was kind of surprised by that and kind of not, because at the same time I've been feeling the same pain again and again in all of those examples from continuous deployment, lacking that one piece of feedback through developers not knowing if their code was even running to developers, resorting to very strange solutions just to be able to be capturing what their code is doing even while in dev. So I went back to kind of the source where we kind of always like to model, and I have to be frank, by now, this analogy of the DevOps loop has completely worn itself out. I think we need to find a better one. But for lack of one, let's stick with this model, ancient as it is. And if we look at the DevOps loop, we can notice something really alarming about it because, and by the way, this is just an image. Like, I think it's one of the first images that you get if you search for DevOps loop on Google images. And one thing you'll notice here is that there's so many tools attached to each stage of the feedback loop. So we have plan, build, CI deployment, all of that. Sorry, the dev loop, it has a lot of tools associated with it, but smack in the middle of this diagram, we have this actual portion that says there's this segment that says continuous feedback. As I mentioned in the beginning of the talk, I didn't make this up. It's not something that I came up with. It was there all along. But you will notice can alarming lack of any tools associated with it, except for some reason for Salesforce, which of course has nothing to do with continuous feedback from operations to planning. Well, maybe it does if you look more from a product manager's perspective, from an engineering perspective, it's not related. So we have this hole in the middle of our DevOps loop that is actually causing all of these different symptoms. And the question is, how do we actually solve it? How do we actually bridge this gap? And going back to these two examples that I mentioned, the developer that didn't know if his code was even running, and evidently it didn't, and the developer that had to use these primitive tools to actually try to use a spider sense to discern if something was off as he was writing the code, what is the actual technology that can help us solve this? And to go back to the way practices are changing, we talked about how we can use AI to generate code. What is the equivalent that we can use here? So this is a perfect segue to start discussing opentelemetry and why it is important. So one thing I would say about opentelemetry is that it is not revolutionary. It is more evolutionary. It is the merging together of two standards, open census and Opentrace. It didn't bring anything groundbreaking in terms of the technology, but it did bring something else that's really groundbreaking to the software development industry, and that is that everybody agrees on it, because that didn't exist. Never before has there been such a gold standard for observability. Why is that important? Because it's open. Because it's accepted by everyone. And that means that all of the libraries that you all are using, and it doesn't matter if you're using Java or using Python or you're using go. It doesn't matter if you're using Fastapi or if you're using Echo or Mox and go, or if you're using net with ASP Net MVC or if you're using spring in Java, if you're using Kafka or RabbitMQ, if you're using postgres databases or mongodb. All of these libraries have already introduced support for Opentelemetry because it was a simple choice to support that standard. But why is that important? Because that means that all of the data that you need to make all of these decisions to understand how the code is working, to know that it's working correctly, to know that you haven't introduced a regression in dev test and prod, all of that data is already there. It's just a matter of flipping a switch, turning the light on, because the minute you activate opentelemetry, and we'll take a look at an example in this talk, which is based off of Java, but the minute you activate open telemetry, you already have access to a lot of data. Now I've been in many organizations where yeah, all of these things were very important to us and we treated it like a project. Yeah, we have this observability project and we needed to prioritize it and we needed to create backlog items for it, and it was always less urgent than other things and the technical debt just kept piling up. This completely obliterates that problem, because we no longer have this issue where we need to do something. You don't need to make any code changes in some languages to just activate it. And the problem may be transformed from having too little data to having too much data. But we'll talk about how to solve that, because having too much data is an easier problem to solve. So in the context of today's talk, I'll be talking mostly about tracing. Opentelemetry does offer other integrations with metrics and logging, and the ability to actually triangulate these three. But in this case I'll talk about tracing because I think it's a much undervalued and underused technology that developers can reap a lot of benefit from. So to maybe start off, I want to explain what tracing is and why it is so important. Apologies if this is already familiar to you. So just a refresher about traces. So traces are essentially an anatomy of a request. They tell us what our code is doing. If I look at an API call, in this case, this is a spring boot application. I have an API that uses javaspring. It does a lot of things. It then goes and makes an external call to an API, and it uses hibernate to use a database driver to talk to postgres. All of these things that happen when a request get handled. And this can extend to other microservices. This is why it's called distributed tracing. So we can follow along as the request gets handled by multiple microservices. But this entire kind of flow we call a trace. And another piece of terminology that we should be familiar with is a span. And a span is kind of an activity that happens within that trace. In fact, if you're using net then activity is the terminology they chose for it. It's synonymous to span for some reason. So in this case I would look at the request anatomy and say okay, so this is what happens when my code gets executed. It first handles the HTTP request and it might have some checking of permissions validating with other sources. Maybe some queries are running. Each of these is a segment in the trace and this kind of gives me a breakdown of exactly what occurred when my code was invoked. This is exactly what would tell me if when somebody was running my new refactoring of the data access layer, my new code was actually triggered. And this is an example trace. We'll look at an actual code example in a sec, but this is what a trace looks like. We have the different segments or different spans that happen throughout that request handling. We have things like queries and we'll actually see that we're able to go down and see all the way to the actual query that was running and we have the different processes where we can identify where the issues are or what is taking so long. Enabling instrumentation is very easy. Again, this is a Java example. You can easily find examples for other languages. I like the Java implementation of opentelemetry because it doesn't require any code changes. If you want to run it in dev or test you can just use a Java agent. So you download the agent, you just add an environment variable as you can see in the box above here. And immediately you start seeing information about your code. Then if you want to track specific functions you can just add an annotation and that will start tracking how that piece of code behaves. All of that will be made clear in a sec when we look at the example, but I just want to make it clear that my goal in this session is not to show you how to activate telemetry or observability in your code. There are way too many talks and guides on how to do that and I feel I don't really have anything useful to add to that. I do think that where many of these guides are lacking or what we are missing is more information about how to actually use it. When we're developing, it's one thing to collect information, but then we're just left at that we have a lot of pretty dashboards. Situation where we want to get to is a situation where we actually generate data and then we can use it as we code. So I'm going to show you a quick example in my ide and then we can take it from there and see kind of where does that leave us? And is that enough? What else do we need to actually make the process better? In the course of this demo I will use Jaeger, which is an open source tool. And in fact everything that will be on this talk will be either open source or free tooling. So there are no commercial tools that I'll use in this talk that actually require you to pay money to use. We may have some time to talk about Grafana and Prometheus open source version, we may not. And I'll also show you a library that I'm working on called digma, which is also a free library that you can use as developers to maybe make use of that data. So without further ado, let's open the ide and stop looking at slideware and maybe start looking at code for a change. So let's take a look at some code to demonstrate how we can really do this feat or manage this feat of actually using observability when we code. I've opened a familiar project for every Java developer out there called the Pet Clinic project. It's kind of a go to sample for the spring boot framework and to make things interesting and also to kind of simulate some kind of a realistic scenario, I've actually added some functionality to it. So this is a pet shop that allows you to see pets and owners and visits and things like that. I have added the functionality of vaccinations. So I used some mock API to create this vaccination external service. We're going to asynchronously kind of get the data about the pet vaccines from that service and we can see we created this adapter that actually does all of the work of actually kind of retrieving the object. We've also extended or added a table to the database to keep the vaccination data. And after working on this feature for a few hours, kind of proud to say that it seems to be working. Let's take a look at what it is. So you may notice some text here. This is because I've actually enabled opentelemetry for this project. Again, if you want to see how to do that, go to continuousfeedback.org. This is where I have all of the kind of open telemetry 101 things that I'm not touching on in this talk. I've set up all of the links so you can very easily follow. And let's take a look at what this pet clinic Project looks like. So I've started my app. I can see pretty straightforward interface. My own small contribution was to add the vaccination status. And that means when I look at a specific owner, I can see whether they need a vaccine for their pets, which is awesome. And I can also add a new pet, which now will also include the vaccination data or kind of trigger a call to this vaccination service to get the vaccine information for this pet. So let's add lucky the dog. Whoa, this looks like an issue. Oh, there is already a lucky one. So let's call him lucky tools. Perfect. And we've added this dog. I can assure you that in the background we're running some processes to get the vaccination status for this pet. Pretty neat. Now the question is, again, to go back to what we were talking about before. What do I actually know about this code? So I can check via tests? And indeed I'm using test containers here. I wrote some integration tests to test this very scenario. I can show you. So, for example, we have some tests here that check that the owner is created. There are kind of end to end tests and they get all of the data, kind of making sure it's persisted in the database, making sure it's rendered in the view. We've kind of done our due diligence about the testing part here. This code was also reviewed. So all in all, I would say that I've kind of reached the limit of what I'm able to do as a developer. And to go to the previous examples we were discussing, this is pretty much the limits of what any developer would do in a similar situation. So, recap. I've written this code. Maybe not the best because my Java was a little rusty, but it works. Somebody's reviewed it. I have written the tests. I'm ready to move on to the next feature. But like we mentioned in the previous examples, that isn't really getting feedback about my code. So let's see how opentelemetry can help me, because as I mentioned, I was running open telemetry in the background. So the first tool I'm going to use is an open source tool called Yeager. And Yeager is just a simple tool to visualize the traces that I've been collecting while debugging and running this code. So I'm going to open up the yeager interface. And this is just a container that's running on my machine. And immediately I can see the different actions that were carried out here. And we can see kind of the adding of the pet, the post request, and then getting of the pet details, all of that tools fine. I may notice some very strange things here. The first thing is that the number of spans, and remember span is like a segment inside the trace is exceedingly high. This is 256 spans, it's 164 spans. That's quite a lot for a request of adding a new pet. And we can drill in to see exactly why that is and what was going on here. So the first thing I may notice, and I'll make this a little bigger, but if you are familiar with orms, then this particular pattern will be familiar to you. We have one select statement followed by many select statements, a longer specific relationship. And this is usually an indication that there is a select and plus one issue. And I can kind of go in a little deeper and actually see what the queries were. And if I take a look at that, I will see that there is indeed because of the lazy coding of the objects, an m plus one issue in here in dev, it's not that problematic. But you can imagine if I had a lot of visits, then the problem would escalate. The next thing you may notice, and by the way, we can see the same thing around types, which seems to be also related to the issue that we saw before. But what I completely missed out on was these amazing number of HTTP requests that are also happening here in the background. And this seems to indicate a problem in the way that I'm fetching the vaccine data. I would have to guess that this is kind of a leaky abstraction where I'm accessing the vaccine error record and I'm not noticing that it's actually triggering an HTTP call. We can see more of these crazy queries right around the types table and the visits table that I've both highlighted as areas that are problematic. So as you can see, I was just about ready to check in my code, but a closer inspection of the anatomy of a request or what is happening when my code gets handled shows me that there were a lot of issues just hiding behind the scenes that are very easy to surface. If you just are able to take a look at the request as a whole. Let's take a look at the other operation that I've modified where I'm actually looking at or rendering the owner view as well as whether they need a vaccination. So here again we see the familiar issues with the select statements around visits and types, but we also see another antipattern. So this is the rendering stage. We can see, by the way, all of this data without having changed our code almost at all. So all of these come from the built in support that all of these libraries have for opentelemetry. So hibernate has, or, sorry, spring data has support for opentelemetry. So we automatically get the repository operations. Here we have the postgres driver, the JDBC driver, actually reporting the queries. And here spring itself is reporting that the rendering phase is happening. And then after that we can see calls that are being made to my newly added table, the pet vaccines. And what that means is that in the view, we're accessing a lazy reference to an object and triggering queries during the rendering phase, which is generally considered an anti pattern called open session and view. So all of these kind of new things that I found out about my code I would categorize as things that are pretty easy to pick up on. They're not very complex, and you can detect them just by looking at a single trace. Now the question is, will this be useful? So just to be clear, these are the easy things. We just mentioned three or four different issues here that are pretty easy to pick up on if you have the expertise, if you know what you're looking for, and if you know to look. I'm not talking here at all yet about these things that would require more processing to understand. For example, does this scale well? Did my changes introduce a regression in terms of the system performance? Are there any new errors that are happening here behind the scenes that I'm not seeing? All sorts of things that are not necessarily manifesting within a single request. But if I look at a collection of requests, I lights pick up on them. So they may not be like, this is just a random trace, a specific action that I've taken. But it could be that more interesting things can be picked up by looking at 1000 or 100,000 traces and seeing the stream of data and what we can discern from it about the code changes. But for now on, let's stay in the simple territory. And the truth is that I took this tool and I took these methods and I went back to the team and I told them, here you are, you can go and apply them. The developer that added the rendering to the console of how much each iteration of the loop was taking can now use this, or maybe just add a few metrics. We didn't talk about metrics. Yet the developer that didn't see if their code was even running within the flow can kind of check and see. And the truth is that nobody used this. And this is, I think, one of the problems with what we have today, because sure, collecting the data is a prerequisite, but as you recall, the name of this methodology is continuous feedback. It's not just feedback, meaning what I've just demonstrated is manual feedback. It's reactively going into an equivalent of a dashboards and drilling in to find issues, which is nice, but the reason it doesn't work is twofold. And you could kind of see that by the number of developers who actually use this tool. When I tried to introduce it, and two developers used it twice, one of them used it and then saw it highlighted too many issues, many of which he did not want to fix because they were not related to him. And eventually it was more painful, more of a headache than it's worth. The other developer didn't find anything. He tried again, didn't find anything again. And then he assumed that by production that means that there are no issues in his code. So the problem is, how do we take this information and make it a part of our dev workflow, make it continuous. And I'm kind of reminded of tests, and with tests we kind of face the same issue. We're testing when they were manual. And I remember that phase in the DevOps cycle where there was no DevOps cycle, and we actually had to run tests manually upon as a preparation for each release. And no developer wanted to be the one stuck with running the build of the test, seeing all of the red tests because nobody was maintaining them and then going to fix them all one by one. That was a chore that everybody wanted to avoid. And only when tests became continuous, then we were able to overcome that barrier and people were kind of internalizing that. Tests are just a part of what you do when you develop in the same manner. The question is, how do we get this data and make it automatic, if you will? Why do I need to have a developer go and review these traces and find these issues, both the ones that are easy to detect and the ones that require me to do some statistical modeling and anomaly detection and regressions and removing outliers and so on, to look at a huge number of traces when we can have some kind of an intelligent agent that does that. And going back to the original point where we're talking about the changing landscape, we all are witnessing the change in generating code and how easy it is to use copilot or Chachi PT or other technologies to generate code. But at the same time we can use the same technology in order to analyze what our code is doing. So we start getting some feedback back from it and there are many commercial tools that are moving in this direction. Recognizing that this is possible, I'm going to show you a library that I'm working on and that library is free for developers, which is why I feel very comfortable showing it to you and in fact would encourage you to let me know what you think because it's still under development and something that I think would benefit from a lot of developer feedback. So going back to the original code, I'm just going to repeat these actions that I've just performed. So we're coding to be adding a new pat and it, and then I want to see what that pet looks like for a specific owner. That's fine. Now you may notice that in the ide here, I see a green dot pop up next to a telescope icon, which is where I can see observability. So to get to that point, and in fact to kind of streamline this entire scenario, all I did was install an intellij plugin, a free one called Digma, which is the project that I'm working on. So what Digma does essentially is collect open telemetry data and analyze it and then show it back to you based on the analysis that it performs. Similar to what we just did here when we reviewed the traces, but also going beyond to look at what happens across numerous traces. So I've installed IgMA here in my IDE. This is, by the way, only supports Java at the moment, but we are expanding to other languages and immediately what you can see is exactly all of the different actions that I just performed. I'm going to move this a little away so that you can see this better, so we can see all of these actions that I've performed, the post request, the get request and so on. And if I go to the post request for a second and you can see by the way that there are some issues found here, I can start seeing a lot of things that are now plain and obvious and I'm going to make this a little bigger so you guys can see. But basically it's telling me that there's an excessive HTTP calls, which is exactly what we saw. We can kind of see a breakdown of what's going on. And because this is asynchronous, we actually never noticed that this was taking too long. But we now can see that DB queries are taking 133 milliseconds, which is quite a lot as well. But the HTTP clients running in the background and asynchronously are taking over 7 seconds, which is insane. Of course we can see what are the bottlenecks and of course it's the mock API that is slowing me down and also the call to get the vaccines, which is again, it makes sense. And then we can actually go and see what's going on with this n plus one issue. We can see the query that's causing it, we can see the affected endpoints and we can also go and see what the trace looks like. And in that we can actually see these select statements that are repeating themselves and we can go in a little deeper to understand exactly where it is happening into trace. So in fact what we've done is we've inversed the pyramid. Instead of looking at or sifting through a lot of traces and logs and metrics, trying to find issues which is time consuming and reactive, we're doing that automatically in bringing that data to us so that we can see in the id exactly what are the issues. And then if we want to go to the related traces, logs and metrics or understand more about the impact and who's affected by it, that's very easy and effortless to do. At the same time we've also streamlined and removed the whole kind of boilerplate around enabling open telemetry. So once I've installed the plugin to collect all of this data, all I need to do is click enable for observability and that would start collecting all of that data for me, which is neat and nice. Now at the same time I've been collecting information from other sources because opentelemetry is so easy to use. I also collected information from CI and I can see some of those results here and they're also quite intriguing. So for example here we can see that during my performance test I've found a scaling issue. So this specific area of the code is actually scaling badly. And what does that mean? It means, and this again takes this a step further, to not only just look at the immediate suspects, which is analyzing traces and understanding what's wrong there, but this actually looks at kind of a whole lot of traces and what the agent was doing here. It was just looking at concurrent calls and trying to find out whether whenever this code was called concurrently, whether it exhibited a degradation in performance and by how much. So is it scaling linearly? Is it scaling exponentially? Like what can we figure out here? And I know it's a little small, but if you can pick that up. Basically what we're seeing is that there's a constant performance degradation by about 3 seconds per execution after or when we tested it at about 31 concurrent calls. And there is also a root cause that was identified just by studying the traces, which is the validate owner with external service function. And we can actually see exactly where the bad scaling is happening, how that specific function scaling is correlated to the entire action. We can see a trace again, or we can go see exactly where that is happening in the code to find out that yes, this is scaling badly and is also a bottleneck. So all of these examples are just meant to show you how where continuous feedback is taking the observability is not towards its traditional role of being dashboards pretty as they are, but how can we actually take them and include them into the development cycle. So it actually gives us can opportunity to improve our code on the one hand to catch issues earlier in dev or even earlier in test, or earlier before they manifest to their full degree in production and get to that level of code ownership very similar to that developer I was describing, the ten x developer where we actually know our code and how it behaves, and we don't need to use our peripheral vision or spider sense to get those insights. And I think this is kind of the key takeaway, which is there is an opportunity here that did not exist before, in the same way that before we had containers, we did not have the opportunity to run tests so much and with such ease, and we had to do config management and things like that. And today we can use immutable infrastructure and we don't need to worry about these things. And anybody can include integration tests very easily into their project. In the same way, I think today the ability to both collect data, which is what we saw with open telemetry on the one hand, but also to analyze it using in this case data science. Some of it is statistics, some of it is anomaly detection and analytics. And different models that we're running here that are very basic, can still provide so much value so that we can have a different type of development process that allows us to shine as developers to assume more responsibility over our code and to be that next developer. Now being practicing continuous feedback myself, I would really appreciate your feedback. So please do try it. It is free. If you happen to be using Java, you can just go ahead and pick it up from the intellij marketplace and let me know what you think. If you are using a different technology or if you just want to experiment and not use this specific tool, I would love to hear from you as well. And that is why I created the continuous feedback website and you can find it again@continuousfeedback.org. It's just a notion page where I set up all of these links that I want to make sure that people have access to, and it just contains a lot of useful data. It covers setting up Otel, collecting observatory data, some tooling. I've even included an opensource stack of various tools that you can use, some blog posts on the subject, my contact details, as well as more information about the continuous feedback manifesto that I'm trying to kind of understand with other folks and kind of phrase it so that we can kind of understand what it is that we're trying to create here that is really new and needs a lot of other minds to consider and think about. Now, the last thing I want to mention is that I do have a udemy course that covers these basics as well as some more about how to enable opentelemetry work with it. What are the things that you can detect and what are some anti patterns you should avoid? If you're interested in that, please email me. My email me is at. You can email me either at ro ni dover@gmail.com or my work email which is rd ovr at digma. Just email me there and I will gladly send you a coupon with free access to that udemy course. That's it. It was a pleasure to be hosted here on con 42 to talk about what is only my favorite subject in the world, and again, would love to hear from you and provide more information.

See all 33 talks at this event!

Conf42 DevSecOps 2023 - Online

November 30 2023

Coding with the lights ON - OpenTelemetry from code to prod

Video size:

Abstract

Summary

Transcript

Roni Dover

Co-Founder & CTO @ Digma

Join the community!

Featured event

2025

2024

Info

Conf42 DevSecOps 2023 - Online

November 30 2023

Coding with the lights ON - OpenTelemetry from code to prod

Video size:

Abstract

Summary

Transcript

Roni Dover

Co-Founder & CTO @ Digma

Join the community!