Observability across serverless asynchronous managed services

Video size:

Abstract

While applying distributed tracing on your own code is relatively simple, the real challenge is how to trace a full transaction across services. In this session, we’ll understand the technical challenges gaining observability with managed services and see together how we can solve them.

Serverless computing transactions are a combination of owned code (like AWS Lambda) and managed services (Like SNS, SQS, EventBridge, DynamoDB and more). Applying distributed tracing on your own code through instrumentation is relatively simple (though require a lot of work). The challenge lays with the managed service, how to trace a full transaction across services like queues, streams, and databases.

In this session we will discuss: - The technical challenges gaining observability with managed services. - Methods to build the full trail of transactions across managed services. - Ideas on how to obtain observability in a highly async distributed world. - We’ll technically drill down to some managed services examples.

Let’s have an async observability discussion, to be continued on Discord.

Summary

We're going to talk about observability in services applications with a special focus on those asynchronous parts which are much harder to observe. There's different solutions that you can use for services observability and we'll go over them so you can decide what's best for your own usage.
Lumigo is a serverless monitoring platform, serverless observability platform. Serverless is not only lambda in the meaning that it can also be function as a service from other cloud vendors like Google or Azure. Almost any application uses some third party. You need to know what's going on there.
Going serverless means that you have nanoservices in your environment. This allows you to decouple your services. The challenge is identifying and resolving issues. The third impact is the change in the cost paradigm. This brings with it new challenges.
When we troubleshoot serverless, what are those challenges that we talked about, how they are implemented and how we can solve them? First thing you want to know is how this lambda impact your customers. What do you need in order to debug it?
Cloudwatch is actually a number of different services which you can use, metrics, logs, insights. There's also x ray allowing you to do some distributed tracing. The cons is it's complicated to use and it has only partial asynchronous support.
So now let's talk about option number two, homebrewed solutions. These solutions are the ability to add to your own code different data points. In the end this will allow you to correlate all that information on your own.
Third option is serverless monitoring platforms. These are SaaS platforms focused on these kinds of solutions. Instead of doing all of this on your own, you get it just integrating to these platforms. It automates the distributed tracing.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody, thank you for joining me today. I'm aviat, CTO co founder at Lumigo, and we're going to talk about observability in services applications with a special focus on those asynchronous parts which are much harder to observe. Today what we're going to do is see how actually serverless changes everything. Why serverless environments are different that's very important to understand so we can understand why we need a special tool in order CTO have serverless observability. What we used to have until now isn't good enough. Today we'll talk about the main challenges when doing observability, but more importantly, we're going to talk about different solutions. So there's different solutions that you can use for services observability and we'll go over them so you can decide what's best for your own usage. Now, before we continue, a few words about myself. As I said, I'm CTO at Lumigo. Lumigo is a serverless monitoring platform, serverless observability platform, and not only do we do observability for serverless users, but our own back end is also serverless. So I've been in the software companies for the last 15 years, and in the last three years I've been doing serverless all day long and usually all night long as well. At Lumigo we work with a lot of different companies. Some of them are very big and known out there like the ones you see here, but also a lot of small startups, sometimes like four person startups. As long as you have a serverless or any cloud native environment, we like to be there and try to help you out. When we say serverless, what do we mean? I'm sure that you know what serverless is, but let's see that we're all on the same page because a lot of times different people mean different things when they say serverless. So I'll keep it short. Serverless is not only lambda in the meaning that it can also be function as a service from other cloud vendors like Google or Azure. It also includes managed cloud services. So when you're building an application, compute is not enough. Of course lambda is the main glue that everything is surrounded by, but you also need a dynamodb for example, for your data, s, three for your files and so on. So all those different services that you get from AWS or any other cloud. For me, that's an integral part of serverless and the last part, but also very important, those third party SaaS so when you build your own application today, almost any application uses some third party. And the way I see it, it's part of your serverless environment. You need to know what's going on there. So if your call to PayPal, or if you're using Twilio, for example, and you have errors there, you need to understand that it doesn't matter that it's a third party, you still need to understand how it's affecting your overall application. So now that we talked about what we mean by serverless, let's see how going serverless impacts your application. There's a lot of impact. I want to talk about three main ones. First one is you have nanoservices in your environment. What does that mean? It means that you can call it microservices on steroids, you can call it whatever you want. But now you have a lot of very small parts, a lot of small lambdas, dynamodb tables, all of those atomic parts which each and every one of them runs on its own. And you need to know what's going on with each and every one of them. But also in order to enjoy these services, the fact that they're separated one from each other, you need CTO make sure that they're connected in the right way. That usually means they're connected in some asynchronous way. This allows you to decouple your services. So if something goes wrong with one of them or one of them has a high load, it doesn't mean that all your environment is now affected. A second impact is you're using a lot of fully managed services. So again, that's great. It helps you to focus on what you want to focus and not on all the infrastructure under it, under the hood. But it does mean that you have a lot less control. The third impact is the change in the cost paradigm now that what you do has a very close impact on how much you pay. Since your pay as you go model means that any change in the code, even bugs in your code, can cause a spike in the payment. And of course, any improvement in the code means that you can now pay less. This brings with it new challenges. First challenge is identify and resolve issues. Now of course this is not new for serverless, but this challenge does take a new twist when going serverless. And I'll show you an example in two minutes. But it certainly makes how to find the issue, and especially the root cause much, much harder. And the second part is visibility. When I say visibility, I mean how does the application look like as a whole. So you have a lot of different components, but how do all these components combine into one single application? And what happens is, for example, one of those components stops working. Does that mean that my application stopped working? Does it mean that my application is fine and it doesn't really matter? So getting that visibility on the holistic part of my application became much harder. So before we continue, I mentioned asynchronous. I just want to say again to make sure that we're all on the same page. When I say asynchronous, I mean not only when a single lambda calls another lambda asynchronously, but I also mean when it's implicitly asynchronously, usually when there's a service between two lambdas. And although each lambda is being called synchronously, together, we have an asynchronous system. So in this example, this is pattern taken from Jeremy Daly's website, which I really recommend. So here we have a lambda, first one calling synchronously a dynamodb. But because that dynamodb has a dynamodb stream, it means that a second lambda will also be called, and there's an asynchronous connection between them. So just to be clear, asynchronous doesn't mean that the lambda itself is calling another thing asynchronously. Now, let's see, when we troubleshoot serverless, what are those challenges that we talked about, how they are implemented and how we can solve them? Let's be a little bit dramatic. It's 02:00 a.m. At night and something is going wrong. You're getting a notification that a lambda has stopped working. That lambda. Okay, this is all you know about that lambda, that it stopped working. Maybe you also know that that lambda sends email and you want to understand how severe the fact that that lambda is not working. How is it affecting your system? So let's even say that it's not only 02:00 a.m. It's also Saturday night. So you really want to know if this is something that you need to solve right away. Or maybe you can wait a little bit till the morning or till Monday morning until you decide to solve it. So first thing you want to know is how this lambda impact your customers. So maybe you have a lot of services that your customers are using. So you want to know which one of those services are being affected. In this example, this lambda, which all it does is send emails, actually is being used by two different services. The first one is process payment, which is processing all the payments of your customers. So that's very, very important. You're losing money if it's not working correctly. But it's also used by another service, the launchbot. Launchbot. Actually all it does is even used internally by your developers to make sure that they remember to go and eat lunch every day, so it sends them an email. So of course, if it's the second services, it's not that important. You can wait for Monday morning. You somehow need to know this lambda, how is it connected? So you start looking, who invokes this lambda? So you see maybe some sort of queues, and then you look for a lambda that uses this queues and you get up to this lambda, but very quickly you understand this is taking too long. If you're going to try out and understand the connection, you're going to waste all night to understand it. So hopefully you have some sort of drawing, a schema of your entire environment. So let's say you have a great architect, which is very diligent and keeps an up to date picture of your entire environment. And by the way, this is a real environment, published by Jan Kui. And this is, by the way, only a part of the environment. So you can see very quickly it's not a simple one, it's a little bit complex, but if you have a drawing up to date now, you can maybe start CTO, understand how this lambda is related to the different services. But still you need something that makes this exact connection, right? So it's not only the fact that there is a connection, you want to know how is it connected, through which different services is it going through? Now, once you've made this connection and you know that this lambda right now is failing for process payment, maybe you'll be able to still go to sleep. If it's affecting only, let's say, a test user and not a real user. So you want to know the exposure, so you somehow want to understand how many users are now being affected. Maybe only one is this owned user somehow, maybe not an important one, but maybe it's a vip user. So it's not enough to know that this lambda is failing process payment. You also somehow want to know exactly what was happening in this API call every time this lambda failed. Okay, let's say you checked it and you see there's a vip customer, and you know you need to fix it as quickly as possible. Let's try to debug it. What do you need in order to debug it? You need to zoom in on the specific failure. Okay, try to find not only the fact that there was a failure, you want to understand from all the different invocations this lambda had. What are the different points of data that maybe it outputted during its failure. So you can go for example, to cloud watch and take a look at the metrics of that lambda, see if there was a failure in a specific time, and then go to the logs. Now, there's no direct connection, so maybe based on the timestamp of the failures you see here, you can try and find the specific logs. Hopefully there is not a lot of invocations at the same time, so maybe you'll be able to find that. And then next thing you want to do is to extract debugging info, because just taking the logs is usually not enough. So maybe you need to somehow add some more logs and then get that lambda running again, hopefully getting same error very quickly, and then understanding what's going on. So again, you'll be doing that using the cloud watch logs. And then the last thing is you probably need distributed tracing, because if you find the issue in the lambda itself, that would be easy. But a lot of times in this very distributed environment, where serverless is usually very distributed, the issue, the root cause, is not exactly the same lambda where you see the issue. So you need to somehow start going up the system and finding the exact problem, maybe in different parts, in different components. Again, you'll be able CTO do some quick looks through Cloudwatch logs and Cloudwatch insights, and we'll also see how x ray can hopefully help you do that. Distributed tracing. As I said, we talked about the challenges, we started talking about the solutions, and now I want to show you different types of solutions which you can use in your environment. So the first option, the first family of solutions, is Cloudwatch and friends. Cloudwatch is actually a number of different services which you can use, metrics, logs, insights. There's also x ray, which is not exactly Cloudwatch, but goes together. They're not easily connected, but they're out of the box. We saw those examples right before. In all these examples, these are actually all Cloudwatch, but there's also x ray allowing you to do some distributed tracing. X ray is a great place to start, but you'll see, especially around asynchronous connections, you won't be able to see all the different connections. So the main advantage of using Cloudwatch is it's out of the box, it's right there. If you're using AWS, it has AWS support, which is very cool. And the cons is it's complicated to use and it has only partial asynchronous support. And if you're looking for specific issues, it's not very easy to query. While it shows you the technical impact, a lot of times it's very hard to understand the high level business impact like to which API it was related. So now let's talk about option number two, homebrewed solutions. These solutions are the ability to add to your own code different data points, which in the end will allow you to correlate all that information on your own. So I won't go of course into the code, but usually what you'd want to do is add a combination id to all of your functions and CTO all the different services which are being used. So you need to make sure that that id is being passed somehow between your kinesis, sqs, SNS, dynamodb streams and so on. You generate it at the earliest stage, for example, when the lambda that is being called by API gateway runs for the first time, and then propagate it throughout your different transactions. You want to make sure that you're outputting that id to each and every log, or else that id of course has no meaning because you need to somehow consume that id. As I said, you'll be adding it to your code at some place, creating a unique id and then passing it to each and every function which is running, and also to all the different services that are being called and make sure to log it each and every time. And probably the easiest way to do that is to add it to your logger. And then when you look at your logs, you'll see that id in all your logs. So if you find for example, a log of an issue and you want to see all the different logs which are related to it, maybe even in different lambdas that are all related, you can search for that id in any elastic based solution like logs IO, elastic on AWS and so on, and you'll be able to make your life easier. Now, if you were going to do it, that means a lot of changes in your code. I highly recommend you use some kind of open source. There's the power tools open source, which is great for in this case we see it in NPM, meaning in node. You'll have it for different services and make sure you add it to all the services that you're using. And of course there's a second kind of those open sources, like open tracing opensensus, and of course the new open telemetry which you can add to your lambda now remember, it's not specific to lambda, so you'll need to add it on your own and make sure that you're adding it to all the different places that the services are being called. This is an example of yeager, and this is how a timeline, once you've added it to all the right places, you'll be able to see this timeline, which of course is very helpful when trying to troubleshoot an issue in a serverless, asynchronous environment. Let's talk about the pros and the cons. The pro is it's tailor fit. You added it to your own code, so of course it will be exactly the way that you need it. It's supported by many different vendors and it's not cloud specific. It's not something that you will get only on AWS, for example. And the cons is that it's very high touch. You need to add it to all your different lambdas, you need to make sure that it's added in all the right places. And for example, it's not good enough to do it one time because you need that every new lambda and every new team member remembers to add it. And of course if a different team starts to use it, you need to make sure that they use it as well. So keeping it up to date at all times is not that easy. And not all components are covered by these solutions. If you're going that way, I again highly recommend Jan Kuiz, also known as the Burning Monk. He has a great blog post about it. So look for this and it's very helpful. So let's talk about the third option, which is serverless monitoring platforms. These are SaaS platforms focused on these kinds of solutions. Basically the classic buy versus build. Instead of doing all of this on your own, you get it just integrating to these platforms. It does everything automatically. It automates the distributed tracing. Between these different platforms there's a common implementation, you add a library to your code, you have can im role and by doing that you're able to get a solution for the different challenges that we mentioned before. So the pros, this is serverless focused, it helps you, not as like a generic solution that is good for everything, but then when you need it for your own specific environment, it becomes very hard. It gives you the best of breed for serverless environments. It's more than just tracing, it does correlation for your logs and it identifies the issues automatically. It sends you the information that you need and it's very very low touch. All you need to do is the first integration and then you get all the rest automatically. The cons is you need to integrate with another third party, it's another screen you need to look at. And it's more than just tracing. So if you were looking only for the tracing part, you'll still be getting a lot of other parts with it. Now let's take a look at an example. So this is an example of Lumigo, which is this kind of platform, how it's being used at Medtronic under live environment, and these examples are from their dev environment. So for example, here you can see an automatically generated transaction. So if before we saw a schema of the architect drew of all the different components, how they're connected, this you get automatically. Once you integrate with Lumigo, you see how everything is connected to each other. For example, here s three, which triggers a lambda kinesis, dynamodb, another kinesis, and so on. So by seeing how everything is connected automatically, you have can up CTO date understanding of what's going on with your system. And if something goes wrong, you're able to follow the data. So, for example, if something goes wrong with a specific lambda, you can see what was the data that was passed to this invocation of the lambda, and see how it was in the kinesis, and then what exactly happened in that lambda. And by following that data, a lot of times you're able to understand what went wrong between the different asynchronous events. You can click on each and every lambda, and then you do a deep dive and understand exactly what was happening, what was the return value, how much time it took it to run, what was the event, meaning the input. You can also see the outputs of the lambda, and this you get automatically. With these platforms, you can also focus on the actions. Sometimes you don't want to see only the data, but also see exactly what happened, the story of what happened, by starting at this lambda right at the top, and knowing exactly how the story of this transaction rolled out. So a lot of times you maybe still start with, in this example, cloud watch insights. But then when you get to a specific issue, you can go and pinpoint that issue. In Medtronic case, they have 1 billion invocations, and very quickly they understood that using Lumigo was much, much easier for them, and they're able to do a specific search according to the issue, according to the request id or anything else, and get to that specific invocation and see all the information they need about that invocation. Another thing that you can do with these platforms is see the timeline so not only can you see exactly who called who, like a dynamodb called another lambda, but you can also see exactly how much time each call took. And then you can focus on the bottlenecks if you want. CTO improve the latency and not just spend your time maybe fixing something that took only one or two milliseconds. And you can also track deployment effects, because when you look at serverless environments, there are a lot of changes, because it's so easy to change each and every part on its own, there's a lot of changes. So you want an easy way CTO track those changes, see the exact point, like here you see of every deployment, and then you're able to see, for example here, okay, we deployed something and once we've deployed it, the issues stopped. So basically we understand that the fixed that we deployed actually did a job and now we can go back to sleep. So the main takeaways, serverless, like we said, changes everything. You have a lot of moving parts, a lot of nanoservices, there are a lot of asynchronous patterns, and the environments are highly distributed. There are different solutions which you can use. You have those out of the box, like AWs x ray, you have the homebrewed solutions, different open sources or things that you can do on your own. And then you can change your code and get that distributed tracing, and you have serverless monitoring platforms which you can use and integrate. And then all the monitoring observability, distributed tracing is done automatically for you and you can pick which one is best for you. Thank you. Because the way that conf fourty two is done, there won't be any questions. But feel free to reach out either through my email or my twitter her and ask any questions. I'm very happy to answer. So I hope you enjoyed and have a great day.

See all 45 talks at this event!

Conf42 Cloud Native 2021 - Online

April 29 2021

Observability across serverless asynchronous managed services

Video size:

Abstract

Summary

Transcript

Aviad Mor

CTO @ Lumigo

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2021 - Online

April 29 2021

Observability across serverless asynchronous managed services

Video size:

Abstract

Summary

Transcript

Aviad Mor

CTO @ Lumigo

Join the community!