Conf42 Observability 2023 - Online

Use the Observability, Luke!

Video size:

Abstract

Do you know what your lambda is doing? How long it is connecting to DynamoDB, for example? No? Do not wait, join me and learn how to make these information visible!

Summary

  • We will talk about observability for serverless. How good is your monitoring? Are you happy with it? Do you know how to improve it? That's the questions we need to answer, right?
  • Pawel Piwosz is DevOps Institute ambassador, AWS community builder and CD foundation ambassador. Says we have less and less places where we can observe and understand what our system is doing. This is the problem which came with agile in his opinion.
  • Decoupling of the system microservices, serverless, is bringing more and more complexity, at least in the communications pattern, right? We need to think not only about the communication between the components themselves, but also between the teams. To have full structure log, we need to go into observability.
  • Three elements of observability: logs, traces and metrics. They need to be consistent throughout the whole system. They should be constructed again for automated systems and collected consistently. And very important element, they need tobe relevant for the business.
  • context is a clue, right? It's the heart of everything. I built something, what I called very creatively, meal. Meal contains four elements, not free, like observability logs, events, metrics, and also actions. And these action can be automated.
  • Using x ray, we can see what is the impact of cold start on each invocation. We can measure cold start with lambda insights. We go into our API stage logs, tracing, and enable x ray tracing. We see how many invocations suffer from cold start.
  • Open source project from AWS, which is very close to the open telemetry. It is ready for Python, typescript, Java and Net. Best use of it is using AWS Lambda power tools. Who is monitor your monitoring server is a quote from DevOps Borat.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I hope you have a good day. And now we will talk about observability for serverless. But before I will talk about how to do it with serverless in AWS, I need to, let's say, introduce you to you my approach to observability, what observability really is, and what we still kind of miss in observability. And before we will go there, let me ask you very small question, but important one, how good is your monitoring? Are you happy with it? Do you know how to improve it? Do you know what you miss there? Are you sure that your monitoring is aligned with your business needs? That's the questions we need to answer, right? So, my name is Pawel Piwosz and I've been working for Spacelift as developer advocate for a couple of days today. So this is quite new for me. And also I am DevOps Institute ambassador, AWS community builder and CD foundation ambassador. So now you'll know why I will talk about AWS, right? Okay, so what is the problem today we have less visibility, right? How I understand it, how we should understand it. So, first of all, when we had barmetal, these rock and some servers in this rock in the data center, we could observe everything, starting from air conditioning, through these power flow to the end behavior of the application by every single user, right? Then we went to virtual machines. And as long as we are not responsible for this virtualization platform, we do not have the visibility there. So we can see everything what is above, not below, right? Then we have containers and we can see even less. For example, if you think about ecs, especially Fargate, ecs, just ecs, we have less and less places where we can observe and understand what our system is doing, especially if we talk about Fargate, we don't know almost nothing. And finally, we have serverless, and in serverless we deliver the code, we deliver the information, what is the endpoint of the API and what it should do. And that's all. We don't see anything else, right? And why is that? Because we have less interactions with the systems through the whole scope of the system, we have less accessibility to the system, we have more and more tools. We start to think that central logging is passe. And this is the problem which came with agile in my opinion. Especially when we say that all teams should be self organizing and they know what they do and that's it, right? Yes, but on the end of the day, it's the company who sell the product, not the team. And if you have multiple teams working together you see the problem if you have monitoring only of your own component without caring about anything else, because it's not your responsibility, it's other team responsibility. What can go wrong, right? I saw examples of this and believe me today I can laugh from it, but it wasn't funny that day. Okay, and finally we have decoupling, and this decoupling of the system microservices, serverless, et cetera, et cetera, is bringing more and more complexity, at least in the communications pattern, right? And here we also need to think not only about the communication between the components themselves, but also between the teams. So convey's law. I don't want to go deeper into this topic, but this is really interesting one. So what does less visibility means? Well, exactly this, right? So we have different computing models. And on this picture you can see that the greens are the elements which we control and the reds what we do not control in any of these computing models. So we have a couple of approaches here. We have on premise simple. We control everything. We have infrastructure as a service, platform as a service, software as a service, on these end, right, where the serverless lies. And here, as you can see, in this approach with software as a service, we do not control anything except code. And this brings this complexity, right? However it sounds. Okay. So first of all, we need to have luke a cultural shift. Why this is important because we cannot catch logs as we used to do. So we need to create new, sometimes even more complex, more responsive ways to do that. And first element, which I need to present before we go to the observability, is something what is called structured logs. Okay, and what is structured logging? This is the definition from samologic. And structured logging is the practice of implementing a consistent predetermined message format for application, blah, blah, blah, blah, right? So generally what it means that our message from all the systems should be as much informative as possible, as much organized as possible, and following the standard as much as possible. This message also has these standard, right? So how it looks in these example, we will see it. But please remember one thing, that structured message is not the structure log. Structure log is a little bit more. So we have structured message and around these we have structured information. Why we have this message, what happened in the system, right? What is system is doing at this point? So standard lock line example, especially from sysadmins, this is very common and very thing which is very familiar, right? When we want to structurize the message, we can do something like that. And this is the hint how to become senior, you just put your message into the JSON and that's. It sounds silly, but believe me, it's already a huge upgrade. Why? Because your systems, which are working with your locks, doesn't need to process the message as heavily. Okay? Because with JSon it is very simple. You have fields, you have values of those fields, and that's it. You don't need to look, you don't need to create these patterns to search through the locks, et cetera, et cetera, et cetera. It makes things easier. Then we have structurized lock example. It's not fully structurized yet. I mean, it's not the structure lock itself, but the line is pretty nice, right? So we have information about the message and also, for example, who from these, et cetera, et cetera, et cetera. It's a lot better, right? But still, this is a good start. But this is not a structure lock yet. Okay? And to have full structure log, we need to go into observability. But please remember one thing, observability is not Grafana on kubernetes. If you have Grafana on kubernetes, you don't have observability. You have Grafana on kubernetes, period. Yes, this is the tool, very nice tool, which we can use in order to, let's say, complete the whole approach to observability. But this is only the tool, nothing more. Okay, so what are the elements of the observability? Three of them, logs, traces and metrics. So what we can tell about locks? So this is very common scenario, we write everything to logs, right? But we write errors only because we must save money. How many of you are or was in this situation? So the problem is that if we write errors only, we can forget about everything. Honestly speaking, and I know that it sounds tough, but this is really the truth in terms of observability. Right? So what are these locks here? First of all, they need to be structurized. We mentioned that they need to be consistent throughout the whole system. They need to cover all information needed. They need to be constructed for automated systems, because on these end, we want all of these elements, all of these components here to be managed and let's say, processed by automation, not us. We want to sleep at night, right? They need to be collected consistently. What about the performance and the metrics? The metrics generally? So how many of you, again, were in the situation when you were asked about the performance metrics and you said, that's great, because in fact, you don't know, really. Right. Because for example, you have only collected errors from Njax. Right? So what about the metrics then? They need to be structurized, they need to be consistent, they need to have context, they need to have full information. And as you already probably start to think, hey, this is almost the same slide like the one before. Yes, you're right, it is. I'm lead. Right. They should be constructed again for automated systems and collected consistently. Okay. And very important element, they need to be relevant for the business, because on the end of the day, the business is the one who is paying for your systems. Right? Okay. Of course we can collect them directly as a metrics, or we can, let's say, convert logs to metrics as well. What about traces? Traces are the elements which we, let's say, use not that often, right? So logs, it is quite obvious, metrics, we collect them, but what about traces? So if your business will come to you and will say, hey, I have this John Doe who is claiming that the request took too long for him and he is annoying and he is paying a lot of money to us, please tell me if he is right with this, what you will do in order to prove or disprove that everything is okay with your system, with your complicated microservices system, for this specific request, probably you have these problem. So what about races then? Of course they need to be consistent, they need to be collected consistently throughout the whole system. They allow to track the request through the whole system. Really? Right. And we can use them as a performance measures. And we have a zooming option. If we are very closely, very close to the system, we can observe one request for one specific user. If we zoom out, we can see the whole system in general. And now what about the context? I mentioned this context and I believe for the observability, for monitoring, for all of these aspects here. Context is a clue, right? It's the heart of everything. So tell me please, what it is. I give you like 5 seconds to think about this. Five, four, these two, 10. That was very quick seconds. So some of you will think probably stone or, I don't know, maybe something else, right? But how many of you thought that can be a planet fragment of the planet? So without context you just guessed correctly or not, but it was only the guess. And without context, you lose something, right? So what is the context in the observability? So first of all, logs allows us to understand the surroundings, these traces allows us to understand the path, the journey. And finally, metrics allows us to understand the scale, okay? So please don't tell that there's something in the locks, because if there is something in the locks, you should do something. And the important is what those somethings are, okay? So please avoid this headache. Just know what you have there around this. I built something, what I called very creatively, meal. And honestly speaking, I build this before the observability becomes a buzzword. So, meal contains four elements, not free, like observability logs, events, metrics, and also actions. And generally what it means that in this framework, mine framework, we don't only collect stuff, but we also want to automatically act on it, right? Well, it is in the observability somehow, but I was the first. So generally what we have, we have metrics, events and logs, right? And we collect them together and based on it, we take action. And these action can be automated. So enough for the theory. Now, we have a little bit more than ten minutes, so let's go through the tech stack, which I want to show you. So first of all, we have lambda, right? Aws lambda. This is the serverless compute engine from AWS. And we have one issue here, which we can very nicely observe with the proper approach to observability, which is not possible without it. So if we are experienced enough, we can see this issue, but we are not able really to measure it. And what I mean by that is cold start, right? So we can measure cold start with lambda insights. And these we will see how many invocations suffer from this cold start. And using x ray, excuse me, we can see what is the impact of this cold start on each invocation. So generally, cold start looks like this. This is the time when AWS needs to prepare everything for us to execute our runtime. And also there is another type of call start. But this is not the talk about cold start. So we can have shortened call start. And when the lambda is how we call it, warm, there is no call start at all. We just go to the execution. In order to deploy the environment, I used AWS serverless application model, this kind of framework, which is in fact cloudformation. So infrastructure as code. This is the extension to cloud formation. And with that I've created something, what I call standard example, with standard logging from AWS. And this is the code of this Sam template. So what we have here, we have here a couple of elements. So first of all, I define the resource, which is the serverless function, lambda function. I say, where is the code here, right from where it needs to be deployed, what is the handler to my function. So what will be executed first when the request came, what is the runtime, memory size, timeout and the event here? It means that I assign somehow the API to my function. And in order to reach my function, I need to go to these simple path with methodget. Simple like that. So what we have after that, and this is 20 lines, right? So we have lambda function, which is created in AWS. Very nice. I'm sorry, that was the API gateway. It was created for us also with the proper endpoint slash, like I said. Then we have our lambda functions. And when we execute these, and I want to go into the metrics, measurements, logs, et cetera, et cetera, I will see something like that. Nice. I see that my API gateway was triggered. Good. All right, what more, I see that my lambda function was triggered and I have some information here. So how many invocations, what is the duration? But why here is like 2.2 and here is a little bit more than 1.5. Why it's not saying anything about that. Right. So maybe locks and those three elements here opened shows us all the information which we have by default from lambda execution. Very informative. All right, all right, so we can agree, I believe, that it's useless, or almost very close to be useless, right next to be useless. So how we can improve this? So first of all, we will enable x ray for our lambda. So we need to go into configurations, monitoring and operation tools, click edit and just enable x ray tracing. And by one click we can become regular engineer. Then we can do the same thing for API. Right? So we go into our API stage logs, tracing, and enable x ray tracing, two clicks and we are regular engineer and we have something like that. So we can see here the request path response distributions, et cetera, et cetera. Nice, very nice. And also we have information like that. So we have the traces, we have the information about all our executions. Please don't look on this last one here, because as you can see, there is no get, that means that this execution was done without API. We are interested in all of those which were run through API. So we have the fastest execution around 60 milliseconds, and the slowest 28 milliseconds. Why the same execution? I mean the same function? Let's try to find out. So we go to the traces, now we are in the trace, which is the longest, and we see kind of gap here. And when we go into the shortest one, we see this gap here as well. It's a little bit shorter, but again, it doesn't say anything, right? We have invocation somewhere here. Is this called start or not? What happened here? I know, because I work with lambda for, I don't know, eight years or something. I know, right? But not everyone knows. So what we can do, we can enable lambda insights. So again we go into the same configuration for the lambda, we enable enhanced monitoring, and after some time, some time we will receive another screen, another, let's say board these we can see also the information, like a more detailed information, what was the memory used, the CPU time used, the network iOS, but also we have the duration and init duration. And here we can see that the init duration, we can understand it as a cold start. Okay, so these invocation suffered from the cold start, those not. We can also enable tracing and logging in the API, right? So we go again to the same position, the same config screen in the API, and we enable all of them, enable cloudwatch logs, et cetera, et cetera, et cetera. So these, and additionally what I suggest is to add a log format for your logs from API. It's called custom. We enable it by clicking this tick here, and we put this, and now the hint how to become senior engineer. This was filled by clicking the JSON button here. Nice one, right? So I added here only one, it one thing, trace id, just to have the trace id see throughout the whole system. Okay? And of course we remember about keeping this tracing enabled. And after that we have information from API. So good progress, right? We can use something what is called contributor insights to have different boards, different view, different understandings on what's going on in our system. But all of these was only about the exteriors, what about things which happening inside the function. So here we have AWS lambda power tools. So it's open, oh my, I forgot, open source project from AWS, which is very close to the open telemetry. And we have multiple ways of implementation. It is ready for Python, typescript, Java and. Net. And we can build the observability almost out from the box, right? And the best use of it is using AWS lambda. Finally with that we can start implementing the code. So what we will have is that after the implementation we have full information what's going on in our functionality, right? Even going into specific sub functions, information about the initialization time, et cetera, et cetera, et cetera. It's much, much more rich than it was before. We can have additional metrics, like a custom metrics. So those metrics are created by power tools, of course, by instrumenting it by me, so I can, for example, a simple example, these, I can count how many times the sub functions were called, right? For the logs, this is the information which we have so much richer, we have all the surroundings, we have very, very, let's say organized output, which is always the same, right? And for the Sam model, I know that is quite small, but the change for the infrastructure as code, which I've done, is by adding 67 lines, right? Because it can be less. But I have this format described here as well, right? So I didn't do that in one line, but in multiple lines. So if you want to, let's say, implement it by yourself with Python, you can try with these article and what we can do with that, a lot of things really, because we have Cloudwatch, we have a possibility to analyze this through Athena, we can go into quicksight through open search and kinesis, we can put it into time stream and publish data through Grafana, right? Or send it to Prometheus, whatever. We can build alerts and alarms on it and act on it using, for example, lambdas, right? So there is a lot. So for the instrumentation itself, except power tools, what we can use for, of course, power tools, right? But also Jaeger has the possibility, Prometheus has the possibility to instrument your functions. Opentelemetry has also the possibility to do that. For visualization, we can use Grafana, we can use Prometheus, we can use many, many other tools, right? For the databases, we should use NoSQL databases, it's, I believe, obvious, mainly time series, especially for metrics, but for logs, for example, open search. And we have also all in one tools like Prometheus, like Jaeger, like Hanakomp IO, very nice tool which allows you to control the whole process, right? Dynatrace for example, as well. And Splunk is quite new, but Splunk also allows us to build the observability. And finally, the question for you on the end, who is monitor your monitoring server is a quote from DevOps Borat. If you have questions, I'll be happy to discuss it with you. You can contact me and connecting with me through the LinkedIn or on my webpage. And also, ah, strongly recommend. Well, strongly recommend I ask you to subscribe to the podcast which I host with my two friends. We talk there about it with different aspects of it. Thank you very much for your time. Enjoy rest of the day and I hope this talk was useful for you. Thank you.
...

Pawel Piwosz

Developer Advocate @ Spacelift

Pawel Piwosz's LinkedIn account Pawel Piwosz's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)