Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, Amir here.
I'm the CEO and co founder at Senser.
It's a pleasure to be here as part of the Conf42 Kube Native conference.
What we're going to speak is about the deficiencies and the problems
with working solely with logs when trying to solve Kubernetes incidents.
In terms of what we're going to cover, we're going to cover the
trouble with logs in isolation.
What's missing in them.
A lot of these elements are probably a lot.
A lot of you are familiar with what additional context look
like and why it is important.
We're going to go through a short story or a short real life
example of an investigation.
And eventually we'll wrap it all up and trying to explain the pitfalls
and explain how to get started with a more with more advanced techniques.
If we'll start and speak about logs, Kubernetes in particular, but in
general, of course, they're one of the most helpful tools in troubleshooting.
But as we all know, they're not ideal, and most of the time they're not enough.
A good way to think about it is that, as any of us know, the problem with logs
is that usually they require more logs.
And the problem, especially with regard to Kubernetes or any distributed
system, is that the rest of the logs which are needed are not.
usually placed in the same location.
You need to look at another node, another pode and another element.
And individual logs lack a lack of context because of a few reasons.
One, which is probably clear, a human wrote them, and humans are not
very good at predicting the future.
So essentially a log, if you think about it very similarly to a line
of code, because it is the line of code, but it has more to it, they
exist on the basis of experience.
What have I seen?
In the past or assumption, what type of corner case am I anticipating that my
code flow is going to go through and that would represent and we're looking at it
at the process level, even though we're sometimes very good at reviewing code.
Changes or reviewing any type of code edition, we're not very good at
accompanying that with reviewing the logs and what context do they bring in?
What is their meaning and different path different passes that the
code or the program could take.
Another aspect is that other issues are happening at the same time where log
is a encapsulation of a single moment.
Most likely at a single time of a single event, but other things are
happening in tandem, and they have a lot of importance when we're trying
to we're trying to analyze the issue on similar prior incidents.
If we have the ability to correlate, or we have the ability to look at
past incidents and capture this knowledge with regards to a specific
log, that would be very beneficial.
And, of course, the connection to topology.
If I have some way or some technique to know something about the environment,
about both the infrastructure layer, the application layer, what elements are
interconnected, that, of course, is not something which appears inside of the log.
But it's a very common aspect or a very important aspect in doing the analysis.
If I know where something happened and how it is connected to the other
elements in the system, I can start and build the flow of analysis.
Essentially, so my goal is to to give you a little bit of inspiration or a little
bit of a methodical way of thinking of how to do it, how to do it in real life.
We're going to take a real example in this example.
We're looking at part of our system, but that could be done in any
other observability systems or from technique perspective to the line
with things you're familiar with.
What we're starting with is a customer affecting issue.
In this case, the front end, the element which is responding or
interacting with the user is essentially returning a server error to the user.
Of course, it is bad because we're not fulfilling some sort of a business flow,
but that is where we are starting with.
And the first thing that we will do if we're relying on logs is
to try and find errors, which has some sort of correlation into that.
We'll deep dive into the log soon, but from customer perspective, we've impacted
our meantime to meantime to detect.
We've impacted our meantime to recover.
There's going to be a long investigation time if we're relying solely on logs.
And eventually even SREs.
Our people, even though they're sometimes required to do a lot of crazy things.
So their sanity is of course in of course in danger.
Not only because the logs are not written by them but this entire
process is pressuring to begin with.
But now there's a lot of other element.
within it.
So what do when I say context?
A context would be how do I take the essence of a log,
which is a single moment, single element, single point in time.
And how do we tie that in with the fact that the system is much larger.
And when we're trying to put this into context, we're
essentially enriching the log.
The log by itself is not enough.
What other element, what other views of the system are
important for that analysis.
So we brought three examples here.
Each and every one of them is important.
Each and every one of them is working together as part of the analysis.
But first of all, topology.
If I'm a subject matter expert.
Which is hard.
I have a mental view, if you will, of how the system is really deployed, how
the system look like very hard to do it.
In reality, we're speaking about auto scaling type of system.
We're speaking about a system, which has a ephemeral type of type of behavior.
So it's hard to do it in in real life or as a.
Thought experience, but topology is a very important part.
If I have a concrete and up to date, real time, high level overview of how the
system is structured, which infrastructure elements are being used, like the type
of nodes, where the nodes are located, what infrastructure is being used, what
type of services are being used, and how they're interconnect, I start to have a
picture, which allows me to understand it.
Essentially, the technical flow and from that try to move on to the next step.
Another important part is incorporating into this process.
Matrices.
Essentially, I have a log.
Something could go wrong.
I know we'll speak about an example, but what happened on the metric level?
Was there a performance issue in tandem to that log appearing research being consumed
or are or lacking like a CPU or memory?
Was there a network?
Latency issue or an error rate that stems from a specific session or
specific API, which can give me some insight to how to move forward
in my decision tree or my path.
The word where the root cause is another important element
is the deployment history.
We all know it, but one of the most important part to enrich log with
is what happened from deployment perspective, what elements which were
lately deployed by the CI CD pipeline.
A lot of the time, there is a strong correlation between an error in
the system and a recent deployment.
It could be a deployment in the form of a new image.
Coming into the production environment, meaning like a new piece of software,
which makes sense, it could be a configuration change, which is
also deployed by the pipeline.
All of this are elements which we want to enrich our logs or our log view with.
Why is it hard to get this kind of context the way to look at it the way I've decided
to look at it is to look at maybe the two common ways of observing the system.
There's a lot of overlapping between them, but just for the benefit of the
debate, the commercial observability tech type of solution and the
open source observability stack.
So it could be.
Anyone of the incumbent solution, it involves being able to deploy the
observability engines to take the logs and retained logs within the analysis system.
There's a lot of common challenges with it doing it in scale.
Of course, the cost, or more specifically, the cost to get good coverage.
Sometimes you would see people that are employing only infrastructure
monitoring part of the logs.
Employing ATM only in case of need.
So since you don't have all these verticals all at
the same times, it's hard.
It's hard to get to the solution.
And of course, configuring and maintaining dashboard.
How am I building?
I'm building to begin with dashboards, which are good enough.
So they essentially encapsulate enough data and enough.
valuable data for me to go from a behavior aspect like a user
error to a complete analysis and being able to solve the issue.
Open source observability stacks, we gave a few examples here, but they could
involve The equivalent of deploying an agent or sometimes even manually
instrumenting the production environment, which is hard, not only at scale, but
especially if you have various versions of software in your production environment,
legacies environment, so on and so forth.
And again, the challenges here would be the coverage gaps, the blind spot
places, which you weren't able to completely cover the instrumentation.
Overhead, both from declining or covering everything on the production impact.
So spoiler, what type of things could help you here?
So we are, of course working a lot around and automated topology.
Not only in our context, but.
It's a very important part in order to to enrich the analysis environment.
So let's take a look, for example, in the system that I saw you, I shown you in the
beginning, we had a front end service.
That front end service exposes a lot of services to the user.
What we saw in the example to begin with is that we're returning
an error, a server error.
If we're looking into the log.
And logs are always tend to be tend to be messy.
They tend to be big.
What we'll essentially see here is that the API that was exposed
to the user and returning an error is working with the cart service
in this e commerce environment.
What the log is essentially telling us here is that there is a connection
failure between the front end service and the service that represent the cart.
We can go and deep dive into it and get some clues about where it could be.
But it's but it's not very straightforward as it always it's most
likely it's always not very clear.
The second element that we will do right now is we'll try to learn a
little bit about the environment.
So I'm going to show it in a greater zoom.
But essentially for opening the environment, which we
would see here, we would see.
The services, including the front end, we would see the rest of the
services they're interacting with.
What I'm trying to do is 1st to take or get a glance.
The front end needs the car service, which element are implementing the car services.
They're an actual database.
Which is in player.
Is there another layer of business logic or business service, which
is encapsulating the card behavior.
And if the if there is, and if I can understand what they were doing at the
same time, I will for sure get closer and closer to understanding the issue.
Relying on the log alone, what would the troubleshooting
look like or should look like?
To get started, I need to look at the log, as I just did.
And I will need to assume that the issue is local.
It is happening on that node that showed that error.
And then I'll need to go and scrub through the logs.
which logs are happening just prior to that event, which logs are happening
just immediately after the event.
The concept here is the essentially proximity in time.
Elements which happened just before might tell me something about the reason.
Elements which happen just immediately after can add data for me to
move forward with the analysis.
There will be a chain event that triggered.
That were triggered just immediately after, but they're adding information
for me, which are valuable to valuable for the analysis perspective
could be an application error.
That's coming right with it or a resource issue or a network problem
or whatever the challenge is.
Now, I need to go and I need to make sense out of it in different players.
What happened from infrastructure perspective?
Was there a resource?
Issue what happened from an application perspective?
Was there any sort of authentication problem?
Was there a network error?
So on so forth.
I need to essentially work my way up from the bottom, which is the
infrastructure and all the elements, which I'm controlling to a certain
extent all the way up to the application.
Essentially, by a way of elimination, eliminating the fact that it could be some
sort of an infrastructure issue, hardware infrastructure, software infrastructure,
all the way to the application level.
And eventually, the outcome in that case is that type of
troubleshooting is very cumbersome, but it's also very time consuming.
I need to go and test these cases.
I need to verify my thesis, which what I'm seeing in real life.
And eventually, it's very similar to a way of triangulating.
I have.
various points in the plane, and I'm trying to triangulate
to where it might start with.
And, of course, most of the time, especially in a Kubernetes based system,
distributed system, the problem that I'm seeing is just the after effect.
The first element which I'm seeing is not the source of the problem, but
it's already stage two, if you will.
I'm experiencing an issue, but there is a neighboring node, a neighboring
service, which is the source of all evil.
Adding context, so what type of contextual enrichment could be done here?
So we said the 1st stage was understanding the environment.
So we understood.
We'll look at the topology, a real time topology.
We understood what connected and what not.
The 2nd stage would be.
Starting and understanding what is the connectivity looking what I could
learn from that in order to move yours, move myself move myself forward.
What I chose to do in this case, if I'm the front end, I'm returning an error to
the user for a cart API, I most likely use some sort of an internal service,
which implement The card, so let's try and jump forward and understand how to do it.
Assuming.
In this case that the issue stems from a neighboring service.
Let's try to understand how we're connecting to it.
So what I've done here, I quickly filtered used an ability here to look at the
APIs, which the front end is a client of, who's the front end internally is using.
And here is the list of API.
I know from the log that there was a cart issue.
Therefore, here is the cart service.
And if this is the cart service, It's interesting, because I can start and
understand what services what services working with me, and I can now look.
Essentially, sorry, I can look I can look at these API's and try to
understand if these have errors as well, as you'll see here, which they do.
This was a very important part because essentially the two things
I've cleared the front end from being the source of the issue.
And I've moved my, I've moved myself to the next element in the chain.
I know where to focus now.
And essentially one of, once I've done this jump, I moved to the card service.
Another interesting aspect to incorporate topology with is that if the topology is
built to the point like we're seeing here, in which we also have live feedback about
the connectivity about the liveliness of various layers between the element.
I can already stop here because we can see that the front end is going
through that gRPC transaction that we saw the log of to the card service.
We can also immediately see that the card service lost the connection
to to the Redis card in this case.
In that aspect, we're already 90 percent at solving the issue.
So the important part here was how am I incorporating various different
views On the system in order to not only move myself forward, but
to also make reason around what it's probably happening on here.
Started from the front end looking at the card service.
We know that there's an error there now.
Okay.
It completes the picture.
I understand where the connectivity lost either the application
level, networking level, or infrastructure level have taken place.
Common pitfalls and how to avoid them.
We spoke about a lot of different a lot of different strategies that we can
take, but incomplete instrumentation.
I want to be super clear.
Open telemetry, distributed telemetry, the very important part in the tool set.
It's also a very I would say complete strategy, or it's a very beneficial
strategy in order to understand problems in the level of a flow of
a technical flow, even the message level or the transaction level.
But a lot of the times what I see, and I'm most likely believe that most of you are
seeing is that environments are partially.
There will be element, there will be floats.
which are implemented or instrumented, but there is a lot of blind spots.
And when there is a blind spot with instrumentation, it's very hard to
feel, very hard to fill the gap.
Here, eBPF will be and is a very important element because of the ability to auto
instrument the environment, because of the ability to attach the existing tech stack
and do that do that instead of going and manually, putting putting instrumentation
inside of the code, inside of the code.
Another aspect, which which is usually pitfall is the
reliance on tribal knowledge.
Essentially a system that has within the code or within its history and automation
is is is a way of blinding yourself.
So you need to make sure that you need to make sure that the analysis
is taken into account previous accounts, but that it has enough
degree of freedom or enough new data.
For you to to cope with new situation.
The last one is incorrect heuristics.
Essentially, if you're looking at a subset of the picture, or if you're
looking at the small, I will say through a small hole into the system, you're
licking the white system context.
So troubleshooting is often focused on a single point.
On a single node or a single service in which the issue was observed, but in
most cases, that's not what's happening in real life in distributed system.
You will always jump to the next node.
You will always jump to the neighboring service and the
root cause would most probably.
Be there.
It could be a node or a service under your responsibility.
It could be a third party or an external service cloud service.
So that has to be the assumption and has to be something which is focused
on and in mind, because that's most likely where things are stemming
from heaps for getting started.
As I said, regardless of the specific system, try and have try and have, a
very strict process in thinking and incorporating topology, how topology
is coming into play, how you're using it, not only to ramp up or to
onboard new people or new engineers and educate them, but also inside
of the, or as part of the analysis.
Process.
How am I using topology?
How am I gaining some sort of some sort of reasoning or some sort of logic around
how, what am I seeing is connected to to the system float the use of matrix, how
to choose the right metrics, what will be the golden metrics, which I'm employing
to the to the issue or to the issue analysis and deployment, as we said, most
likely, A lot of the time, a configuration change or a software change is going
to be either directly or indirectly the reason of the reason of fault.
So very important part in enriching the log with context, adopting the
layered approach, try to go bottom up.
Is it an infrastructure node level issue?
Is this specific service issue?
If not using the element that we spoke about before, what is
the surrounding environment?
What is the high level insight that I could get and understand?
Not trying to solve the issue at that stage, but what would be the most
logical next step to take and move forward and continue with the analysis.
And if, as I said, avoid context building pitfalls.
Considering for complete coverage or other services that automate your topology,
to avoid manual instrumentation and avoid the pitfall of what's happening
when you don't have instrumentation end to end in the environment.
And of course, if you've seen the elements that we're working with, or the elements
that we are developing, AIOps in general.
What type of artificial pipeline or artificial intelligence elements could
be added to the analysis part, either at the metric level, topology level,
deployment level and help with the automation of going to the root cause.
I hope you enjoyed.
It was a very enjoyable and a very wonderful event for me.
I wish you all a great day and hopefully you'll spend.
More time deploying new new production environment and deploying
new production capabilities and spend less time analyzing.