Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Erez Berkner and I'm CEO and co founder of Lumigo.
And welcome to this session about monitoring and debugging Kubernetes
versus serverless. So just as a context, we're going to talk about
the evolution of the cloud. We're going to talk about monitoring Kubernetes
and how serverless changes everything. And then we're going to talk about monitoring
servers. And hopefully we'll do a quick demo at the end. So when we
talk about the evolution of transportation as an example,
we can see different things that evolve over time. And the reason I'm talking
about transportation because I think it's very relevant, very similar metaphor
for serverless. So when we think about transportation,
we based to ride by a car we owned, we bought it, we fueled
it, we navigate and we got to our destination or we could
went a car, then we don't own it. But we still need to take care
of all the maintenance. We can navigate through trains or
based and that means just figuring out how to get there or
we can get an Uber and that's focusing on getting there. And this
is very much what's happening in the cloud. So we used to work with
the physical servers where we own the hardware,
deploy the operating system and in charge of scale and
our code. Virtual machines really took the hardware away. So we were renting
servers, containers, brought this to an upper
level of virtualization and abstraction.
So you care mostly about runtime, the scale and the code. And serverless
is really about upload your code to the cloud and we got you covered.
So this is where these are similar trends at the
same time when we talk about continue managing them is
very popular to do with Kubernetes. And I want
to mention five layers of Kubernetes that we need to monitor.
We have the infrastructure, the actual hardware where we have a
vital sign that we need to monitor. We have the cluster
of the Kubernetes itself, we have the pod,
we have the overall readiness of availability of the pods.
And we have the application itself where we see application logs.
And you need to remember that whenever you go and work with Kubernetes, you need
to make sure that you cover all these five layers in order to really understand
what's going on across hardware, infrastructure and
application. I want to share with you a couple of tools that are very
work really good with Kubernetes. I'm sure some of you heard of
Prometheus, a great, great open source tool
that basically allow you to connect or basically
you have able to pull metrics from the different nodes and
aggregate those. You can save it to a storage, you can analyze them
and push alerts through Prometheus alert
manager to slack to predator duty. And you can use Grafana which is also
very popular with Prometheus in order to visualize that and have dashboards
that show you the health of your system. This is an example of Grafana
monitoring a Kubernetes cluster. And note
that there are different layers over here of what we talked about
before, like the pod usage, like the cluster's
availability, the cpu, the hardware, so on and so
forth in terms of getting the application logs. So you can use
a log stash in order to log stash like basically elk in order
to get aggregation of all the logs and make them available and
searchable. Another great best practice to have monitoring of the
application logs of kubernetes. This is a really cool open
source tool that I want to share. It's not that
known in the community, but I found it very helpful,
especially when we need to understand how traffic is flowing across
different containers. And this
tool called viseral and relatively easy to connect to this
and again, it's open source. So go check it out. It might
be very useful for your scenario. I won't finish the Kubernetes part without
mentioning service mesh, which basically allow you
to, or I would say when you're monitoring service mesh based architecture,
your life is much easier because you have a centralized point where
you can deliver and get information and metrics and you don't need to
go and add a layer that do that. It's actually
integrated in your architecture. Istio has some great,
great tools for that. So that's another point to remember
when you're architecturing your environment and you
care about monitoring. If you have a service mesh
baked in or planning to use service mesh, use it also for your monitoring.
And let's talk a bit about serverless. So when we talk about
serverless, I just want to frame it. It's not just like AWS
lambdas, it's a variety of services from lambdas
to containers to manage containers. Fargate, DynamosDB's API
gateways, stripe, Twilio, all of these are
ways to consume functionality without
actually maintaining a server. And this is what I define as serverless
environment. And when we talk about this environment can understand that serverless is
different, it's ephemeral, meaning there is no server that is always
up and running. There are hundreds of components that work
together, not just like three tier application that we used to have and
there are actually no servers, so you cannot deploy agents anywhere.
You need a new methods and in order to monitor
serverless the right way, you need to make sure that you have the
right tools. So that's a quick comparison of server based
versus serverless. So in a server base or continue for
all of them, you have many many small parts. You need distributed tracing
in a serverless. In a continue environment you can use good old
agent based and there are many good solution open source
or proprietary solutions that solve that. In serverless
alerts doesn't work anymore, so you need to use
APIs or libraries in order to integrate and
infer what's going on within a service. When we talk about
costs, this is containers is per resource,
serverless is driven per request, which really makes a difference when you
think about what to monitor on containers. Kubernetes you
need to monitor hardware, operating system, serverless based
applications. On serverless it's only the application that's your responsibility.
Service discovery again, you have the different
tools in continue the legacy tools also
of course they are accepted using service mesh serverless you
can do that based on APIs from a main point,
like AWS for example. I think the most important
thing to remember is that you still own the
monitoring part. Nobody will do that for you in containers
or in serverless. And that's an important point to remember when you're offloading
things to the cloud provider. So what do we really need when we talk
about serverless monitoring or modern cloud monitoring?
First of all, we need to be able to identify and fix issues in
minutes. And for this, because we have so many different services, we need somebody
to connect the dots and make data bugging data available for
us on demand. I'm going to show that in a second. In a quick demo
we have hundreds of services we need to do distributed tracing,
but it has to be automatic. I cannot chase after every new service
that is popping up. And in the third point, we need
to make sure that we're able to identify bottlenecks because there are so many
potential bottlenecks in those environments. And as I mentioned,
all of this need to be agentless and based on APIs and code libraries.
So this is what we do at Lumigo.
We basically take metrics, tracing and logs, and we
connect the dots in order to make sure that you're
able to understand when things are going wrong and be able to fix it.
And I want to show you this in a very quick demo.
So I have over here our demo environment,
sorry, 1 second. So this is a Lumigo
environment. Basically, this is a demo environment that is connected
to AWS Wildride, a serverless environment.
And I want to take you to one scenario that is very popular with
our customers. So just refreshing the dashboard to make sure we will
observe the last seven days instead of the
last hour. That's basically taking us from like live monitoring
when I want to have this as a dashboard that is kind of always open
to something that is more of can investigation. What I
want to show you over here is this is an example of
what we have in terms of environment like invocation.
What is the number of failures? What are the functions that fail the most?
Where do I have latencies? Do I have call start? What are the
main issues with call start? Same goes to cost analysis,
slow APIs, timeouts, dashboard, showing you the out
of the box, the main thing that you should care about in a serverless environment.
And let's suppose we got an alert to petroduty from Lumigo about
this failure. So if we want to understand what's going on over here
or over here, we click on that specific service. This is a
lambda, in order to start investigating what actually happens and
what are the cases where this failed. So we can see
that this lambda ran 7000 time in the
last seven days, three almost 50% failures.
And over here we have the actual invocations and the results
we want to drill down into a specific failure to understand what happened.
And this is where we move from just monitoring to debugging. And Lumigo
builds the end to end story of this
request, of the request that failed. A specific invocation failed
within that request. And now Lumigo will show me
the story of that request across all the different services.
What we see over here is that this is the actual failure, the reason
that we started, we got here because this lambda failed. At the same time,
I can understand what is the customer facing API and decide whether
this is critical to fix now or not. When I want to understand
what happened, I can click on a specific service. And then Lumigo
generates a lot of debugging information, post mortem,
things like what was the stacked race, what was the parameters of the
stacked race when it failed, what was the event that triggers this
lambda environment? Variables, the logs, a lot of. I call
it debugging heaven, because it's always there with all the information that you
need in order to understand what happened and solve the
issue. And you don't need to go across thousands of logs and try to find
what you want. This is without any code changes, without the need to
issue logs. Logs. You basically can go into every service and
see what are the inputs and outputs of that service. So this is an event
bridge. I can see the message that went to Eventbridge,
I can look at Dynamodb and see the actual
data. This is a query to Dynamodb. And this was the response,
no response. And this is also true to external services like Twilio. So I
can click on Twilio and see the request for Twilio and see the response
coming from Twilio, sent a successful sms, this number,
so forth, so on and so forth. And I can also see the specific
logs of that specific request. Maybe I have million requests,
but this one request is the one I want to see. It's like a story
with 62 logs over here, starting from the very basic first
authentication that was done over here, all the way to going
across the different services. And I can look at this
and read what's actually happening like a story across
all the different services. Great. So going back to summarize
where we are, a couple of takeaways. We talked about the five layers of
kubernetes. You need to monitor all of them.
Microservices requires distributed tracing, whether it's containers,
whether it's cloud native, whether it's serverless. Emerging monitoring.
There's an emerging monitoring challenge around tracing of
managed services. We saw the DynamodB, the event bridge.
How do I know, how do I trace across them? How do I know the
messages that go through them? When I need to investigate. This is growing and
you need to make sure this is covered in your environment. Use existing frameworks,
open source. We talk about a couple of tools that are available, commercial or non
commercial, but make sure you bake it in there. And serverless requires
you to also be able to manage to understand managed distributed
tracing. So serverless is not just lambdas, it's DynamoDB,
it's API gateways, it's eventbridge, it's stripe, it's Twilio. Make sure
that you are able to distribute to trace across those services.
I want to thank you and if
you have any questions, please feel free to reach out.
This is all my details, my mail and my twitter. If you want
to try out Lumigo, we have a free trial and a free
tier. You're welcome to just go to Lumigo IO, click start a
free trial and it's five minutes to connect the system. No code
changes and everything is automated so you can have a full view of
your environment with this. I thank you very much.
And wish you a great week.