Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I hope you have a good day. And now
we will talk about observability for serverless.
But before I will talk about how to do it with serverless
in AWS, I need to, let's say,
introduce you to you my approach to observability,
what observability really is, and what we still kind of
miss in observability. And before we will go
there, let me ask you very small question,
but important one, how good is your monitoring?
Are you happy with it? Do you know how to improve
it? Do you know what you miss there?
Are you sure that your monitoring is aligned with
your business needs? That's the questions we
need to answer, right? So, my name is Pawel Piwosz and I've
been working for Spacelift as developer advocate
for a couple of days today. So this is quite
new for me. And also I am DevOps
Institute ambassador, AWS community builder and CD
foundation ambassador. So now you'll know why
I will talk about AWS, right?
Okay, so what is the problem today
we have less visibility, right? How I understand
it, how we should understand it. So, first of all,
when we had barmetal, these rock and some
servers in this rock in the data center, we could
observe everything,
starting from air conditioning,
through these power flow to the end behavior
of the application by every single user, right?
Then we went to virtual machines.
And as long as we are not
responsible for this virtualization
platform,
we do not have the visibility there.
So we can see everything what is above, not below,
right? Then we have containers and we
can see even less. For example, if you think about ecs,
especially Fargate, ecs, just ecs,
we have less
and less places where we can observe and understand
what our system is doing, especially if we talk about
Fargate, we don't know almost nothing.
And finally, we have serverless,
and in serverless we deliver the code, we deliver
the information, what is the endpoint of the API and
what it should do. And that's all. We don't
see anything else, right?
And why is that? Because we have less
interactions with the systems through the whole
scope of the system, we have less accessibility to the
system, we have more and more tools.
We start to think that central logging is passe.
And this is the problem which came with agile
in my opinion. Especially when we say that all teams
should be self organizing and they know
what they do and that's it, right?
Yes, but on the end of the
day, it's the company who sell the product,
not the team. And if you have multiple teams
working together you
see the problem if you have
monitoring only of your own component without
caring about anything else, because it's not
your responsibility, it's other team responsibility.
What can go wrong, right? I saw examples of this
and believe me today I
can laugh from it, but it wasn't
funny that day. Okay, and finally
we have decoupling, and this decoupling of the system
microservices, serverless, et cetera, et cetera, is bringing more
and more complexity, at least in the communications
pattern, right? And here we also need to
think not only about the communication between the components themselves, but also
between the teams. So convey's law. I don't want to
go deeper into this topic, but this is really interesting
one. So what does less visibility means?
Well, exactly this, right? So we have different
computing models. And on this picture you can
see that the greens are the elements which we control
and the reds what we do not control in any
of these computing models. So we have
a couple of approaches here. We have on premise simple. We control everything.
We have infrastructure as a service, platform as a service,
software as a service, on these end, right, where the serverless lies.
And here, as you can see, in this approach with
software as a service, we do not control anything except code.
And this brings this
complexity, right? However it sounds.
Okay. So first of all, we need to have luke a cultural
shift. Why this is important because we cannot catch
logs as we used to do. So we need to create new,
sometimes even more complex, more responsive ways to
do that. And first element, which I need
to present before we go
to the observability, is something what is called structured logs.
Okay, and what is structured logging? This is
the definition from samologic. And structured logging is the practice
of implementing a consistent predetermined message format for application,
blah, blah, blah, blah, right? So generally what it means that
our message from all the systems should
be as much informative as possible,
as much organized as possible,
and following the standard as much as possible.
This message also has these standard, right?
So how it looks in these
example, we will see it. But please remember one thing, that structured message is
not the structure log. Structure log is
a little bit more. So we have structured message and
around these we have structured information.
Why we have this message, what happened in the system,
right? What is system is doing at this point?
So standard lock line example, especially from sysadmins, this is
very common and very thing which
is very familiar, right? When we want to structurize
the message, we can do something like that. And this is the hint how
to become senior, you just put your message
into the JSON and that's. It sounds
silly, but believe me, it's already a huge upgrade.
Why? Because your systems, which are working
with your locks, doesn't need to process
the message as heavily. Okay?
Because with JSon it is very simple. You have fields,
you have values of those fields, and that's it.
You don't need to look, you don't need to create these patterns
to search through the locks, et cetera, et cetera, et cetera.
It makes things easier. Then we have structurized
lock example. It's not fully structurized yet.
I mean, it's not the structure lock itself, but the line is
pretty nice, right? So we have information about the
message and also, for example, who from
these, et cetera, et cetera, et cetera. It's a lot better,
right? But still, this is a good start.
But this is not a structure lock yet. Okay?
And to have full structure log, we need to go
into observability. But please remember one
thing, observability is not
Grafana on kubernetes. If you have Grafana on
kubernetes, you don't have observability. You have Grafana on kubernetes,
period. Yes,
this is the tool, very nice tool, which we can
use in order to, let's say,
complete the whole approach to observability. But this is only
the tool, nothing more. Okay,
so what are the elements of the observability?
Three of them, logs, traces and metrics.
So what we
can tell about locks? So this is very common scenario,
we write everything to logs, right?
But we write errors only because
we must save money. How many of you are
or was in this situation?
So the problem is that if
we write errors only,
we can forget about everything. Honestly speaking, and I know that it
sounds tough, but this is really the truth
in terms of observability. Right? So what are these locks
here? First of all, they need to be structurized.
We mentioned that they need to be consistent throughout the whole system.
They need to cover all information needed.
They need to be constructed for automated systems, because on these
end, we want all of these elements,
all of these components here to be managed and let's
say, processed by automation,
not us. We want to sleep at night,
right? They need to be collected consistently.
What about the performance and the metrics?
The metrics generally?
So how many of you, again, were in
the situation when you were asked about the performance metrics and
you said, that's great, because in fact,
you don't know, really. Right.
Because for example, you have only collected errors from Njax.
Right? So what about the metrics then?
They need to be structurized, they need to be consistent, they need to
have context, they need to have full information.
And as you already probably start to think,
hey, this is almost the same slide like the one before.
Yes, you're right, it is. I'm lead.
Right. They should be constructed again for automated
systems and collected consistently.
Okay. And very important element,
they need to be relevant for the business, because on
the end of the day, the business is the one who is paying for your
systems.
Right? Okay. Of course
we can collect them directly as a metrics, or we
can, let's say, convert logs
to metrics as well.
What about traces? Traces are
the elements which we,
let's say, use not that often,
right? So logs, it is quite obvious,
metrics, we collect them,
but what about traces?
So if your business will come to you and will say, hey,
I have this John Doe who is claiming
that the request took too long for him
and he is annoying and he is paying a lot of money to us,
please tell me if he is right with this,
what you will do in order to prove or
disprove that everything is okay with
your system, with your complicated microservices system,
for this specific request, probably you
have these problem.
So what about races then?
Of course they need to be consistent, they need to be collected consistently
throughout the whole system. They allow
to track the request through the whole
system. Really? Right. And we can
use them as a performance measures. And we have a zooming
option. If we are very closely, very close
to the system, we can observe one request for one specific user.
If we zoom out, we can see the whole system
in general. And now what about the context?
I mentioned this context and I believe for the observability,
for monitoring, for all of these aspects here.
Context is a clue, right?
It's the heart of everything. So tell
me please, what it is.
I give you like 5 seconds to think about this.
Five, four, these two, 10. That was very
quick seconds. So some of you
will think probably stone or,
I don't know, maybe something else, right? But how
many of you thought that can be a planet fragment
of the planet? So without context you just guessed
correctly or not, but it was only the
guess. And without context,
you lose something, right?
So what is the context in the observability? So first
of all, logs allows us to understand the surroundings,
these traces allows us to understand the path, the journey.
And finally, metrics allows us to understand
the scale, okay?
So please don't tell
that there's something in the locks,
because if there is something in the locks, you should do something.
And the important is what those somethings are,
okay? So please avoid this
headache. Just know what
you have there around this. I built
something, what I called very creatively, meal. And honestly
speaking, I build this before the observability becomes a
buzzword. So, meal contains four elements,
not free, like observability logs, events,
metrics, and also actions. And generally what it
means that in this framework, mine framework,
we don't only collect stuff,
but we also want to automatically act on it,
right? Well, it is in the observability somehow,
but I was the first.
So generally what we have, we have metrics,
events and logs, right? And we collect them together and
based on it, we take action. And these action
can be automated. So enough
for the theory. Now, we have a little bit more than ten
minutes, so let's go through the tech stack, which I
want to show you. So first of all, we have lambda,
right? Aws lambda. This is the serverless compute engine from
AWS. And we have one issue here,
which we can very nicely observe
with the proper approach to observability, which is not
possible without it. So if we are experienced enough,
we can see this issue,
but we are not able really to measure it.
And what I mean by that is cold start, right?
So we can measure cold start with lambda
insights. And these we will see how many invocations
suffer from this cold start.
And using x ray, excuse me,
we can see what is the impact of this cold start
on each invocation. So generally,
cold start looks like this. This is the time when
AWS needs to prepare everything for
us to execute our runtime. And also there
is another type of call start. But this
is not the talk about cold start. So we can have
shortened call start. And when the lambda is
how we call it, warm, there is no call start at all.
We just go to the execution.
In order to deploy the environment,
I used AWS serverless application model,
this kind of framework, which is
in fact cloudformation. So infrastructure as code.
This is the extension to cloud formation. And with that I've
created something, what I call standard example,
with standard logging from AWS. And this
is the code of this Sam template.
So what we have here, we have here a couple of elements. So first of
all, I define the resource, which is the serverless
function, lambda function. I say, where is
the code here, right from
where it needs to be deployed, what is the handler
to my function. So what will be executed first when
the request came, what is the runtime,
memory size, timeout and the event here?
It means that I assign somehow the
API to my function.
And in order to reach my function, I need
to go to these simple path with
methodget. Simple like that. So what we have after
that, and this is 20 lines, right?
So we have lambda function, which is created in AWS. Very nice.
I'm sorry, that was the API gateway. It was created
for us also with the proper endpoint slash,
like I said. Then we have our lambda functions.
And when we execute these, and I want to go into
the metrics,
measurements, logs, et cetera, et cetera, I will see
something like that. Nice.
I see that my API gateway was triggered.
Good. All right, what more,
I see that my lambda function was triggered and I
have some information here. So how many invocations,
what is the duration? But why here is like
2.2
and here is a little bit more than 1.5.
Why it's not
saying anything about that. Right. So maybe locks and
those three elements here
opened shows us all the
information which we have by default from
lambda execution.
Very informative. All right,
all right, so we can agree, I believe,
that it's useless, or almost very
close to be useless, right next to be useless. So how we
can improve this? So first of all, we will enable
x ray for our
lambda. So we need to go into configurations,
monitoring and operation tools,
click edit and just enable
x ray tracing. And by one
click we can become regular engineer.
Then we can do the same thing for API.
Right? So we go into our API stage logs,
tracing, and enable x ray tracing,
two clicks and we are regular engineer and
we have something like that. So we can see here
the request path response
distributions, et cetera, et cetera. Nice, very nice.
And also we have information like that. So we have
the traces, we have the information about all
our executions. Please don't look on this last one here,
because as you can see, there is no get, that means that
this execution was done without API.
We are interested in all of those which
were run through API. So we have the fastest
execution around 60 milliseconds, and the
slowest 28 milliseconds.
Why the same execution?
I mean the same function?
Let's try to find out. So we go to the traces,
now we are in the trace, which is the longest, and we
see kind of gap
here. And when we go into the shortest
one, we see this gap here as well. It's a little bit shorter,
but again, it doesn't say anything, right? We have
invocation somewhere here.
Is this called start or not? What happened here? I know,
because I work with lambda for, I don't know, eight years or something.
I know, right? But not
everyone knows. So what we can do, we can enable
lambda insights. So again we go into the same configuration
for the lambda, we enable enhanced monitoring,
and after some time, some time we will
receive another screen, another,
let's say board these we can see also the information,
like a more detailed information, what was the memory used,
the CPU time used, the network iOS, but also we have the
duration and init duration. And here we can see that
the init duration, we can understand it as a cold start.
Okay, so these invocation suffered from the cold start, those not.
We can also enable tracing and logging
in the API, right? So we go again to
the same position, the same config
screen in the API, and we enable all of them, enable cloudwatch
logs, et cetera, et cetera, et cetera. So these,
and additionally what I suggest
is to add a log format for
your logs from API.
It's called custom. We enable it by clicking this
tick here, and we put this, and now
the hint how to become senior engineer.
This was filled by clicking the JSON button
here.
Nice one, right? So I added here
only one, it one thing, trace id,
just to have the trace id see throughout the
whole system. Okay? And of course we remember
about keeping this tracing enabled. And after that
we have information from API.
So good progress, right?
We can use something what is called contributor
insights to have different boards,
different view, different understandings on what's going on in our system.
But all of these was only about the exteriors,
what about things which happening inside the
function. So here we have AWS lambda
power tools. So it's open,
oh my, I forgot, open source project from
AWS, which is very close to the open telemetry.
And we have multiple ways of implementation. It is ready for
Python, typescript, Java and. Net.
And we can build the observability
almost out from the box, right? And the best
use of it is using AWS lambda.
Finally with that we can start implementing
the code. So what we will have is
that after the implementation we have full
information what's going on in our functionality, right?
Even going into specific sub functions,
information about the initialization time, et cetera, et cetera,
et cetera. It's much, much more rich than it was before.
We can have additional metrics, like a
custom metrics. So those metrics are created by
power tools, of course, by instrumenting it by
me, so I can, for example,
a simple example, these, I can count how many times
the sub functions were called, right?
For the logs, this is the information which
we have so much richer,
we have all the surroundings, we have very,
very, let's say organized
output, which is always the same,
right? And for the Sam
model, I know that is quite small,
but the change for the
infrastructure as code, which I've done, is by adding
67 lines, right? Because it
can be less. But I have this format
described here as well, right? So I didn't do that in one line,
but in multiple lines. So if
you want to, let's say, implement it by yourself
with Python, you can try with these article and
what we can do with that, a lot of things really, because we have
Cloudwatch, we have a possibility to
analyze this through Athena, we can go
into quicksight through open search and kinesis,
we can put it into time stream and publish data
through Grafana, right? Or send it to Prometheus,
whatever. We can build alerts and alarms on it and
act on it using, for example, lambdas, right? So there is a lot.
So for the instrumentation itself,
except power tools, what we can use for,
of course, power tools, right? But also Jaeger has the possibility,
Prometheus has the possibility to instrument your functions.
Opentelemetry has also the possibility to do that.
For visualization, we can use Grafana,
we can use Prometheus, we can use many, many other tools,
right? For the databases, we should use NoSQL
databases, it's, I believe, obvious,
mainly time series, especially for metrics,
but for logs, for example,
open search. And we have also all in
one tools like Prometheus, like Jaeger, like Hanakomp
IO, very nice tool which allows you to
control the whole process,
right? Dynatrace for example, as well.
And Splunk is quite new, but Splunk also
allows us to build the observability.
And finally, the question for you on the end,
who is monitor your monitoring server is a quote from
DevOps Borat. If you have questions,
I'll be happy to discuss it with you.
You can contact me and connecting with me through the LinkedIn
or on my webpage. And also,
ah, strongly recommend. Well, strongly recommend
I ask you to subscribe to the podcast which
I host with my two friends. We talk there about
it with different aspects of it. Thank you very much for your time.
Enjoy rest of the day and I hope this talk was useful
for you. Thank you.