Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, this is Erez Berkner, CEO and co founder of Lumigo.
And today we're going to talk about observability
monitoring and what should we do when things go wrong.
So think about this. It's 232
in the morning, page of duty wakes you up and DynamoDB is failing
again. What do you do?
And I like to call this talk, what happened when DynamoDB
explodes and I use DynamoDB,
but I want to make it a bit broader. It's not just
about DynamodB. DynamoDB is different in that it
is a managed service. You don't control the server, you don't control
the operating system and you're very limited with
your visibility. So when we
talk about Dynamodb, when we talk about managed services,
I want to broader this to everything that is as
a service. So we all know function as a service.
But think about Q as a service or data as a
service. Snowflake, DynamoDB storage as a service,
even stripe Twilio, all the SaaS services, everything that
you don't control, you don't deploy agent, you don't
maintain the server, you don't write the code over there, you don't
define the API. All of this really creates
a challenge when it comes to monitoring, debugging,
troubleshooting, and that's what we're going to focus on today.
I like to call this broader sense of managed
services serverless and it's a broad definition,
but it really helps me define these
core. There are no servers over here and those applications are
usually very distributed with dozens or hundreds
of services that keep changing and no longer three tier
monoliths like we used to have. So across all these services,
when DynamoDB actually explodes, it's very hard to zoom
out and understand the context, understand the overall application health. How does
that impact without having the actual connection
and context between different services?
So that would actually be the first thing we need to make
sure that we have in order to understand what's going on. When something
goes wrong, we first of all have to implement one way or another
tracing. Cause it's distributed, it's distributed tracing. And the
point is that this tracing will allow us to
go back, go upstream and understand not
just that this failed, but what happened before
that, where did it originate from and what did it impact.
So in things case I can find actually what is a customer facing API.
Is this business critical and take a decision and assign a priority
based on that. Now I want to use the concept
of distributed tracing and take it to the next level and
be able to look at. I like to call
it a virtual stack trace. Like in the past I used to
see exactly which functions were called in a monolith environment.
I want to be able to do the same in a distributed environment.
So seeing the inputs, these outputs, environment variables,
the stack trace, the logs, everything that I can on
each and every service along the path of that failure.
And there are different ways to do that. One, you can use
Cloudwatch and other cloud native friends.
I think the main point here is that
you can use code and implement
in the containers, in the lambda before,
after the managed service, different outputs,
different logs that will allow you to understand what's going on. So you can
log the request and you can log the payload and the output and
the environment variables and can exception, you can catch and log
the exception of course, and on the additional details.
And that would actually work.
The main problem is that it takes a lot of time,
it requires a lot of maintenance, and you still need
to connect the dots. So when
you have millions it can work. When you have like few thousand,
let's say invocations a month, when you have millions or billions,
you can no longer connect the dots via timestamps. So you can no
longer see that I have one request being through in that second you
have thousands going through. And how can you differentiate
the different logs from these different executions? So that becomes a problem.
Drilling down to the specific log groups and log stream
to understand how
this connects. So you have this data islands
but you're not able to connect these.
But it's good for many of the cases, especially for dev
and early production, before you scale,
it's out of the box. It's easy, relatively easy
I would say, to get started. It's supported
by AWS, it's in the same cloud or
cloud vendor. I think the biggest challenge, it's complicated,
it's time consuming to implement, it's time consuming to
make sense of it. You need to really know how to
configure it to get power,
visibility. And the biggest challenge is there's no
good way to do event correlation, to trace. To understand the bigger picture.
There are tools like x ray for example within AWS that allow
you to do that, but they're also very limited. So they're not
going across DynamoDb and
Sri and event bridge and the other things that we started talking about.
And therefore it's hard to understand the business impact, what is a customer facing
API, how critical that is. The second option is to implement
homebrewed solutions so I can actually build
a distributed tracing system or better
off. I can use something that is open source and that's great
because there are different frameworks for that you might
want to read if you don't know about Zipkin, about Jaeger,
and in general about the framework of open telemetry.
And this provides you a very nice
common ground for implementing distributed tracing
and getting the information back to you in
a visual way, seeing latency,
breakdown of the environment, getting traces.
So if you planning to implement
distributed tracing, I suggest starting with open telemetry
as a first point. And if
you do that, I really suggest reading this, a consistent approach
to track correlation ideas through microservices by Kui.
If you're not following can Kui, that's a great time to start doing
that. If you are interested about distributed tracing, about managed services,
about serverless, he is for me
the number one guy out there and he
blogs a lot. He has great workshops also to consider and
books. But to our point, he is blogging a lot about correlation
tracing, distributed tracing. So this would be a very
strong rate for that. I think the
pros for a homebrew solution, it gets as tailored
as you can possibly get because you build it so you will have
all the different perks that you want to have.
It's your solution, it will be the best fitted to
your needs. Open telemetry is supported by
many vendors, so that's great as a standard to use that
and then you can base future engagements based
on that. And it's not cloud specific, so you can oppose to
Cloudwatch for example, so you can actually use
it across clouds and move cloud in between.
The main challenge is that you build it so it's tailor fit,
but it's very high touch. It does not solve the problem of
managed services. So if you need to trace across DynamoDB, you still need
to figure a way to get a correlation id, a request id
across to the other side, across s three, across event
bridge across kinesis.
So it doesn't tackle managed services and
components that are not supported like API, Gateway,
Dynamodb and others that we mentioned.
Luckily there are several cloud
native monitoring solutions out these that
were built for the modern environment to solve that
problem. Exactly. To solve the distributed
tracing and observability monitoring troubleshooting of
the modern cloud native architectures. They're distributed,
they granular, and these are multiple services
over there. And what you can expect
is usually those are SaaS platforms,
so they get the traces to their
back end and they process it to generate the view for you. So that's
great in terms of maintenance it's a SaaS at
the same time, by the way, it's a SaaS, so you need to be able
to send information to those services.
That's also for you to know,
especially around privacy. Most of the vendor has
a very good policies in place for GDPR,
for ISO 270,
zero one, et cetera.
Most of the vendors also solving a larger problem, not just the distributed
tracing and showing you what happened, but as I mentioned, kind of
building the virtual stack Trace. So getting the inputs, the outputs
and everything you need to know, it's much more and cost analysis
and latency breakdown, et cetera, much more than just a
map of services.
And usually they're using
code libraries integrated in one way or another and an API
using can im role. I think
pros are, you'll find out that those tools
are usually very opinionated. You come with a
set of pre configured alerts that
you should know about. That's what it
means to be niche focused.
They provide more than just tracing. So many times you'll
be amazed of what you can get and you say, wow, I can get this.
And I was just looking for tracing. And they're very low touch, very easy
to get started. Usually like few minutes, ten minutes,
15 minutes to get started and actually see what's going on within
your environment. On the other side, this is yet
another third party platform among
the many others that you probably have. And they provide more
than just tracing, meaning you
do get additional data. You do have,
I want to say, more layers to the tools that
might be great, might be even beyond what
you're looking for if you're at an early stage. So just to
remember that,
and this is where I want to take one tool,
our tool, Lumigo, and share with you how we actually do that in Lumigo.
I think the main point is to see and
understand what you should be aiming for. And it doesn't matter if you
implement this with Cloudwatch or you're using open telemetry or
Jaeger or using Lumigo or other tool.
I want to share what are the abilities if you do
it right, or what you should expect from well
observed system. So that's the main reason why
I want to share how this can look like.
And the main thing is that one, you're getting the monitoring,
the alert, the things that tells you that everything
okay or not and what's not okay. And then it allows you
the ability to drill down and actually debug and troubleshoot to find the
root cause. So let me very
quickly share with you how that looks like
with Lumigo. This is our dashboard. It takes
literally five minutes and five clicks, no code changes required
in order to get started, and you get value,
get alerts about errors that you probably never knew you
have. As I mentioned, the dashboard is
focused on the alerts, on things that are focused
on cloud native. So it's no longer about cpu or I O,
et cetera. It's about the number of failed invocations.
It's about cold starts. It's about show
me the biggest latency offenders I have in dozens
of services that I have, because that becomes really a big
problem. Runaway cost,
function duration, timeout, all of these are things that
you get very easy view and alerts on with
cloud native tools. But at the same
time, let's take a scenario when something actually goes wrong. So let's
look at our issues and we'll find an issue that
is occurring. And let's suppose we'll get alert on this
for slack for page of duty. But let's assume we
want to dive into this. We click on this specific error
that is happening.
Last happened three minutes ago. This is actually a live environment
that I'm showing that is based on
a cloud native architecture in AWS. And when we drill
down, we can see a lot of information about that error. It's in
a specific lambda. I can see that there was a deployment
over here. I can see number of failures, and I can see that one by
one, the actual failures that happen.
And this is were we actually move to troubleshoot. So if
we click on any of the invocation we actually starting,
I like to call it a debugging session,
which is how I mentioned
about the virtual stack trace. This is where you actually can start
looking at these virtual stacked trace. So what do we have over
here? You can see this lambda, this is why we got here.
This post to social failed. And you can see this event
bridge. This is the service that triggered that
lambda. As I mentioned, we want to see the
full transaction, the end to end transaction show. So I'll ask Lumigo
to calculate and go back and upstream
and build the entire request from the very beginning,
all the way to the different nodes.
And this is what Lumigo built over here. And this
is the core of what you should be targeting,
having a direct view going from a failure,
an internal failure, a dynamodb that failed, or lambda whatever, and then
immediately being able to zoom in and understand, okay,
this is a customer facing API. It's an upride.
I can tell you it's a business critical API. So I need to fix it
now and at the same time to be able to
drill down and understand. Okay, let's click on this.
And this is the added layer that I mentioned that I
don't just get a map,
I actually can click on any service and see a lot
of information like the issue,
the actual stacked race and the exception
variables, the event that triggered the lambda, these environment
variables, the logs, everything that has to
do with this invocation, these are things that are generated by
the vendor, in this case Lumigo, and most of them
do not exist in AWS or in
other regular tools. So in this
case, details write id cannot be an empty string, that's a
failure. And if I look at these event and
I click to understand what these message actually that
the lambda got, I can see that details write id was empty
to begin with. So the lambda got an empty write id.
Just by having that visibility that you get only
in tools that are focused on cloud native applications.
Just by seeing that, now I know that this
lambda is not a problem. I need to understand why
is this empty? Where is this coming from? So let's
go upstream and let's go to the event bridge and we'll click on
this. And now we can dive
into the property that
it got in the message and we can see that details run id
is naturally empty also in eventbridge.
So we can go upstream and look at the lambda that triggered that and
we can go one by one and check all the different
services, including things like dynamodb. What did you try
to write? What was the outcome?
In this case there's a failure. The provided key element
does not match these scheme. There was a retry probably. So I see the second
call was successful. So it really tells you the
story of what happened in each and every
service along the way, all the way to things like Twilio for example,
that you can actually see. SMS was sent to this number and the response
and so on. At the same time, you can also check
this out in a timeline view. So to see if there are any
latency issues, to see if there are any.
You can see this is taking a second, so maybe
I need to dive deeper and understanding what's taking the time and so on and
so forth in a latency view.
I want to stop over here. And again,
this is just to give you the context of what you should expect from
a modern distributed tracing that is focused on cloud native.
To summarize, serverless and
managed services in general really
changes the way we develop, really changes the way we're
doing things. There are many strong benefits.
We didn't touch on that, a lot of accelerated development,
but there are some new challenges, especially around visibility
and troubleshooting of an application.
We talked about three approaches to monitoring and troubleshooting
distributed services. We talked about cloud native tools,
we talked about homebrewed and open source solution.
We talked about third party SaaS vendors.
I think two,
three main things that I wanted you to leave these session with.
One is I think you saw what
you should expect in these modern environment. Don't settle for
what you used to have with having logs
out there in a log aggregator and that's it.
You can expect more. There are better tools, better technology
to serve you. Number two, think about this upfront.
It's much better to bake it in during dev,
during preprod rather than after the fact. And third,
consult. There are a lot of companies that are going through what you
are going through or already have solutions.
So consult with the community. There's a lot of resources and from
my experience everybody is really happy to help and
in that thought I also want to offer my
help. I really enjoy talking
with folks in our community and hearing about new project and new application
being developed. So feel free. Even if you're not
using Lumigo, even if it's just about managed services, distributed tracing
observability to reach out. I'm available
on this email or direct message me on Twitter
and would love to try and do our best to help.
Thank you very much and enjoy these rest of the conference.