Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thank you for joining me today. I'm aviat,
CTO co founder at Lumigo, and we're going to
talk about observability in services applications with a
special focus on those asynchronous parts
which are much harder to observe.
Today what we're going to do is see how actually
serverless changes everything. Why serverless environments are
different that's very important to understand so we can
understand why we need a special tool in order CTO have serverless
observability. What we used to have until now isn't good
enough. Today we'll talk about the main challenges when doing
observability, but more importantly, we're going to talk about different
solutions. So there's different solutions that you can use
for services observability and we'll go over them so
you can decide what's best for your own usage. Now,
before we continue, a few words about myself. As I said, I'm CTO
at Lumigo. Lumigo is a serverless monitoring platform,
serverless observability platform, and not only do we do
observability for serverless users, but our own back
end is also serverless. So I've been in the software companies for
the last 15 years, and in the last three years I've been doing serverless
all day long and usually all night long as well. At Lumigo we
work with a lot of different companies. Some of them are very big
and known out there like the ones you see here, but also a lot of
small startups, sometimes like four person startups.
As long as you have a serverless or any cloud native environment,
we like to be there and try to help you out. When we say serverless,
what do we mean? I'm sure that you know what serverless is, but let's see
that we're all on the same page because a lot of times different people mean
different things when they say serverless. So I'll keep it short.
Serverless is not only lambda in the meaning that it can
also be function as a service from other cloud vendors
like Google or Azure. It also includes managed
cloud services. So when you're building an application,
compute is not enough. Of course lambda is
the main glue that everything is surrounded by,
but you also need a dynamodb for example, for your data,
s, three for your files and so on. So all those
different services that you get from AWS or any
other cloud. For me, that's an integral part of serverless
and the last part, but also very important, those third
party SaaS so when you build your own
application today, almost any application uses some
third party. And the way I see it, it's part of your serverless
environment. You need to know what's going on there. So if your
call to PayPal, or if you're using Twilio,
for example, and you have errors there, you need to understand that it
doesn't matter that it's a third party, you still need to understand how it's affecting
your overall application. So now that we talked about
what we mean by serverless, let's see how going serverless
impacts your application. There's a lot of impact.
I want to talk about three main ones. First one
is you have nanoservices in your environment.
What does that mean? It means that you can call it microservices
on steroids, you can call it whatever you want. But now you have a lot
of very small parts, a lot of small lambdas,
dynamodb tables, all of those atomic parts which
each and every one of them runs on its own. And you need
to know what's going on with each and every one of them. But also in
order to enjoy these services, the fact
that they're separated one from each other, you need CTO make sure that they're
connected in the right way. That usually means they're
connected in some asynchronous way. This allows you to decouple
your services. So if something goes wrong
with one of them or one of them has a high load, it doesn't
mean that all your environment is now affected.
A second impact is you're using a lot of fully managed
services. So again, that's great. It helps you
to focus on what you want to focus and not on all the infrastructure
under it, under the hood. But it does mean that you have a lot less
control. The third impact is the change in the cost paradigm
now that what you do has a very close impact
on how much you pay. Since your pay as you go model means
that any change in the code, even bugs in your code, can cause
a spike in the payment. And of course, any improvement
in the code means that you can now pay less.
This brings with it new challenges. First challenge is
identify and resolve issues. Now of course this is
not new for serverless, but this challenge does
take a new twist when going serverless. And I'll show you
an example in two minutes. But it certainly makes how
to find the issue, and especially the root cause much, much harder.
And the second part is visibility. When I say visibility, I mean
how does the application look like as a whole.
So you have a lot of different components, but how do
all these components combine into one single application?
And what happens is, for example, one of those components
stops working. Does that mean that my application stopped
working? Does it mean that my application is fine and it doesn't really
matter? So getting that visibility on the holistic
part of my application became much harder.
So before we continue, I mentioned asynchronous.
I just want to say again to make sure that we're all on the same
page. When I say asynchronous, I mean not only when
a single lambda calls another lambda asynchronously,
but I also mean when it's implicitly asynchronously,
usually when there's a service between two lambdas. And although
each lambda is being called synchronously,
together, we have an asynchronous system. So in this example,
this is pattern taken from Jeremy Daly's
website, which I really recommend. So here we have
a lambda, first one calling synchronously a dynamodb.
But because that dynamodb has a dynamodb
stream, it means that a second lambda will also be called, and there's
an asynchronous connection between them.
So just to be clear, asynchronous doesn't mean that the lambda itself
is calling another thing asynchronously. Now, let's see,
when we troubleshoot serverless, what are those challenges
that we talked about, how they are implemented and how we
can solve them? Let's be a little bit dramatic. It's 02:00 a.m.
At night and something is going wrong. You're getting a notification
that a lambda has stopped working. That lambda. Okay, this is all
you know about that lambda, that it stopped working. Maybe you also know
that that lambda sends email and you want
to understand how severe the fact that that lambda is not working.
How is it affecting your system? So let's even say that it's
not only 02:00 a.m. It's also Saturday night. So you
really want to know if this is something that you need to solve right
away. Or maybe you can wait a little bit till the morning or
till Monday morning until you decide to solve
it. So first thing you want to know is how this lambda
impact your customers. So maybe you have a lot of services that
your customers are using. So you want to know which
one of those services are being affected. In this example,
this lambda, which all it does is send emails,
actually is being used by two different services.
The first one is process payment, which is processing
all the payments of your customers. So that's very,
very important. You're losing money if it's not working correctly.
But it's also used by another service, the launchbot.
Launchbot. Actually all it does is even used
internally by your developers to make sure that they remember to go
and eat lunch every day, so it sends them an email.
So of course, if it's the second services, it's not that important. You can
wait for Monday morning. You somehow need to know this lambda, how is
it connected? So you start looking, who invokes this lambda?
So you see maybe some sort of queues, and then you
look for a lambda that uses this queues and you get up to this lambda,
but very quickly you understand this is taking too long.
If you're going to try out and understand the connection,
you're going to waste all night to understand it. So hopefully
you have some sort of drawing, a schema of your
entire environment. So let's say you have a
great architect, which is very diligent and keeps an
up to date picture of your entire environment.
And by the way, this is a real environment, published by Jan Kui.
And this is, by the way, only a part of the environment. So you can
see very quickly it's not a simple one, it's a little bit complex,
but if you have a drawing up to date now, you can maybe start
CTO, understand how this lambda is related to
the different services. But still you need something that makes
this exact connection, right? So it's not only the
fact that there is a connection, you want to know how is it connected,
through which different services is it going through? Now, once you've
made this connection and you know that this lambda right now is failing
for process payment, maybe you'll be able to still
go to sleep. If it's affecting only, let's say, a test user
and not a real user. So you want to know the exposure, so you somehow
want to understand how many users are now being affected.
Maybe only one is this owned user somehow,
maybe not an important one, but maybe it's a vip user.
So it's not enough to know that this lambda is failing
process payment. You also somehow want to know exactly
what was happening in this API call every time
this lambda failed. Okay, let's say you checked it and you
see there's a vip customer, and you know you need to fix it as
quickly as possible. Let's try to debug it. What do you need in
order to debug it? You need to zoom in on the specific
failure. Okay, try to find not only the
fact that there was a failure, you want to understand from all
the different invocations this lambda had. What are the different
points of data that maybe it outputted during its failure.
So you can go for example, to cloud watch and take a
look at the metrics of that lambda, see if there was a failure in
a specific time, and then go to the logs. Now,
there's no direct connection, so maybe based on the
timestamp of the failures you see here, you can try and
find the specific logs. Hopefully there is not a lot of
invocations at the same time, so maybe you'll be able to find
that. And then next thing you want to do is to extract debugging
info, because just taking the logs is usually not enough. So maybe you
need to somehow add some more logs and then get
that lambda running again, hopefully getting same error very
quickly, and then understanding what's going on. So again,
you'll be doing that using the cloud watch logs. And then the
last thing is you probably need distributed tracing, because if
you find the issue in the lambda itself, that would be easy.
But a lot of times in this very distributed environment,
where serverless is usually very distributed, the issue,
the root cause, is not exactly the same lambda where you
see the issue. So you need to somehow start going up the
system and finding the exact problem,
maybe in different parts, in different components.
Again, you'll be able CTO do some quick looks through
Cloudwatch logs and Cloudwatch insights, and we'll
also see how x ray can hopefully help you do that.
Distributed tracing. As I said, we talked about the challenges,
we started talking about the solutions, and now I want to show you different types
of solutions which you can use in your environment.
So the first option, the first family of solutions,
is Cloudwatch and friends. Cloudwatch is actually a
number of different services which you can use, metrics,
logs, insights. There's also x ray, which is not
exactly Cloudwatch, but goes together. They're not easily connected,
but they're out of the box. We saw those examples
right before. In all these examples, these are
actually all Cloudwatch, but there's also x ray allowing you
to do some distributed tracing. X ray is a great place to start,
but you'll see, especially around asynchronous connections,
you won't be able to see all the different connections. So the
main advantage of using Cloudwatch is it's out
of the box, it's right there. If you're using AWS,
it has AWS support, which is very cool. And the
cons is it's complicated to use and it has only partial asynchronous
support. And if you're looking for specific issues,
it's not very easy to query. While it shows you
the technical impact, a lot of times it's very hard to understand the
high level business impact like to which API it
was related. So now let's talk about option number two,
homebrewed solutions. These solutions are the ability
to add to your own code different data points, which in
the end will allow you to correlate all that information
on your own. So I won't go of course into the
code, but usually what you'd want to do is add a combination id
to all of your functions and CTO all the
different services which are being used.
So you need to make sure that that id is being passed somehow
between your kinesis, sqs,
SNS, dynamodb streams and so on.
You generate it at the earliest stage, for example,
when the lambda that is being called by API gateway
runs for the first time, and then propagate it throughout your
different transactions. You want to make sure that you're outputting
that id to each and every log, or else that id of course
has no meaning because you
need to somehow consume that id. As I said,
you'll be adding it to your code at some place,
creating a unique id and then passing
it to each and every function which is running,
and also to all the different services that are being called and
make sure to log it each and every time. And probably the
easiest way to do that is to add it to your logger.
And then when you look at your logs, you'll see that id in all
your logs. So if you find for example,
a log of an issue and you want to
see all the different logs which are related to it, maybe even
in different lambdas that are all related, you can search
for that id in any elastic based solution
like logs IO, elastic on AWS
and so on, and you'll be able to make your life easier.
Now, if you were going to do it, that means a lot of changes
in your code. I highly recommend you use some kind
of open source. There's the power tools open source,
which is great for in this case we see it in NPM, meaning in node.
You'll have it for different services and make sure you add it to
all the services that you're using. And of course there's a second
kind of those open sources, like open tracing
opensensus, and of course the new open telemetry which
you can add to your lambda now remember, it's not specific
to lambda, so you'll need to add it on your own and
make sure that you're adding it to all the different places
that the services are being called. This is an example of
yeager, and this is how a timeline, once you've added it to all
the right places, you'll be able to see this timeline,
which of course is very helpful when trying to troubleshoot
an issue in a serverless, asynchronous environment.
Let's talk about the pros and the cons. The pro is it's tailor
fit. You added it to your own code, so of course it
will be exactly the way that you need it. It's supported by
many different vendors and it's not cloud specific. It's not
something that you will get only on AWS, for example.
And the cons is that it's very high touch. You need
to add it to all your different lambdas, you need
to make sure that it's added in all the right places. And for
example, it's not good enough to do it one time because you need
that every new lambda and every new team member remembers
to add it. And of course if a different team starts to
use it, you need to make sure that they use it as well. So keeping
it up to date at all times is not that easy. And not
all components are covered by these solutions.
If you're going that way, I again highly recommend Jan
Kuiz, also known as the Burning Monk. He has
a great blog post about it. So look for this
and it's very helpful. So let's talk about the third option,
which is serverless monitoring platforms. These are SaaS
platforms focused on these kinds of solutions.
Basically the classic buy versus build. Instead of doing
all of this on your own, you get it just integrating
to these platforms. It does everything automatically.
It automates the distributed tracing.
Between these different platforms there's a common implementation,
you add a library to your code, you have can
im role and by doing that you're able
to get a solution for the different challenges that we
mentioned before. So the pros, this is serverless focused,
it helps you, not as like a generic solution
that is good for everything, but then when you need it for your own specific
environment, it becomes very hard. It gives you the best of breed
for serverless environments. It's more than just tracing,
it does correlation for your logs and it
identifies the issues automatically. It sends you the information
that you need and it's very very low touch. All you need
to do is the first integration and then you get all the rest automatically.
The cons is you need to integrate with another third
party, it's another screen you need to look at. And it's more than just tracing.
So if you were looking only for the tracing part, you'll still be getting
a lot of other parts with it. Now let's take a look at an
example. So this is an example of Lumigo, which is this kind
of platform, how it's being used at Medtronic under
live environment, and these examples are from their dev environment.
So for example, here you can see an automatically
generated transaction. So if before we saw a
schema of the architect drew of all the different
components, how they're connected, this you get automatically.
Once you integrate with Lumigo, you see how everything is connected
to each other. For example, here s three, which triggers a
lambda kinesis, dynamodb,
another kinesis, and so on. So by seeing how everything
is connected automatically, you have can up CTO date understanding
of what's going on with your system. And if something
goes wrong, you're able to follow the data. So, for example,
if something goes wrong with a specific lambda, you can see what was
the data that was passed to this invocation of the
lambda, and see how it was in the kinesis, and then what
exactly happened in that lambda. And by following that data,
a lot of times you're able to understand what went wrong between
the different asynchronous events. You can click on each
and every lambda, and then you do a deep dive and
understand exactly what was happening, what was the return value,
how much time it took it to run, what was the event,
meaning the input. You can also see the outputs of the lambda, and this you
get automatically. With these platforms, you can
also focus on the actions. Sometimes you don't want to
see only the data, but also see exactly what happened,
the story of what happened, by starting at this lambda
right at the top, and knowing exactly how the
story of this transaction rolled out.
So a lot of times you maybe still start with,
in this example, cloud watch insights.
But then when you get to a specific issue, you can
go and pinpoint that issue. In Medtronic case,
they have 1 billion invocations, and very quickly they
understood that using Lumigo was much, much easier
for them, and they're able to do a specific
search according to the issue, according to the request id
or anything else, and get to that specific invocation and
see all the information they need about that invocation.
Another thing that you can do with these platforms is see the
timeline so not only can you see exactly who
called who, like a dynamodb called another lambda,
but you can also see exactly how much time each call took.
And then you can focus on the bottlenecks if you want.
CTO improve the latency and not just spend your time
maybe fixing something that took only one or two
milliseconds. And you can also track deployment effects,
because when you look at serverless environments,
there are a lot of changes, because it's so easy to change each
and every part on its own, there's a lot of changes.
So you want an easy way CTO track those changes,
see the exact point, like here you see
of every deployment, and then you're able to see, for example
here, okay, we deployed something and once we've
deployed it, the issues stopped. So basically we understand
that the fixed that we deployed actually did a job
and now we can go back to sleep. So the main takeaways,
serverless, like we said, changes everything.
You have a lot of moving parts, a lot of nanoservices,
there are a lot of asynchronous patterns, and the environments
are highly distributed. There are different solutions which
you can use. You have those out of the box, like AWs
x ray, you have the homebrewed solutions,
different open sources or things that you can do on your own.
And then you can change your code and get that distributed
tracing, and you have serverless monitoring platforms which you
can use and integrate. And then all the monitoring
observability, distributed tracing is done automatically
for you and you can pick which one is best for you.
Thank you. Because the way that conf fourty two is
done, there won't be any questions. But feel free to reach
out either through my email or my twitter her
and ask any questions. I'm very happy to answer.
So I hope you enjoyed and have a great day.