Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to this talk. Today we're talking about
observability, one of the strongest muscle for SRE.
So hello everybody, my name is Jonathan Jill.
Today we're talking about this with what part
we start talking for myself who is Jonathong,
Jonathan deal or Jtan 24?
I'm very enthusiast for Linux,
very enthusiast for the community. Also I like share
a lot of the concepts that I learn every day and
my YouTube channel in my repository, in my twitter.
So please follow me if you want to start
with my current followers to take this
tosh for every read that I take. So welcome
to this awesome chair. Talk about the observability
and how we can use for the speed up in our teams.
So what more about this I share with you
this awesome phrase. Life is really simple
but we insist on making it complicated by Confucius.
This is really true because when we
have one part of every challenge
that we have every day could be some days
harder than others. So that is powerful that I
liked this phrase from Confucius.
So what is the agenda for today?
Today we're talking about Si ray.
Also continue with Dora. What is Dora?
How we can use Dora and continue observability,
the golden triangle. Why observability for
us, the CNCF landscape and a little demo.
So with this agenda we
define exactly what topics we want to
cover for this talk. How to from
my perspective I check
and doing in the current stuff or the current job that I done.
So first
topic talking about Si Ra. So the Si Ra
is exactly the cyberbullying engineering could be you take
this definition for another talks called but.
But the objective for SRE exactly is for error the reliability.
Because with reliability we need to think every
day what happened with my application,
with my infrastructure, with my support. Because every
day we need every challenge for our application. How do
we move more fast for the application? How do we
can support that more fast? How to generate directability
for the application. So I take this definition
from glossary
CNCF IO that is a very
high community for generate these
mentions. So what is this?
Cyrability engineering is a discipline that components
operations that sold for engineering the greatest is applied
to infrastructure and operations problems specifically
meaning instead of building product features,
cycability engineers build system to run
application. There are similarities with DevOps.
But while DevOps focuses on getting code to production,
SRE ensures that code running in production works properly.
So you have here two
mixes for DevOps and Siw. But this is
the biggest focus for us. How do we
support the code that we generate from DevOps.
The result for DevOps how to we take
this code and propose the reliability
for that code in production, in staging
or tools big in dev. Because depend what
kind of test I need to generate for every application,
how I can test that application for every
part of my software, right? So that is the objective for us,
how to make more reliability our components
inside of the company, how to make more
reliability our teams and how to generate t spark
very useful.
So you can read more in deep about this concept
problem that is for ensuring the application rule reliability
records multiplicate capabilities for performance monitoring alerting the book into
troubleshooting. So that is the thing that the
biggest part here, the performance monitoring,
alerting the booging to troubleshooting,
right? Because when we have DevOps
and we start for the DevOps journey, we enable
the capabilities for push the code
more easily for every environment for Dev staging and
product. But what happened when I have deployed
that application inside of the environment that I request.
So then we need to enable another capabilities
for that. These capabilities also was called for.
How do I can see what happened with my
infrastructure, my application and what happened
with the communication for the application and then other components that
the application needs to connect. So that is the objective and
the principal objective for us.
How it helps an SI RA approach minimize
the cost, time and effort of the software development
process by continually improving the underlying system.
The system continues measure and monitors the
infrastructure and application components. When something goes wrong,
the system points cycability engineers. So when,
where and how to fix it. This approach helps create
a highly scalable and realable software system
by automating operation task.
If you check in these three part of
the how to sre how to SRE helps
the company in every part takes
a little part of monitoring or observability
or alerting that is part of observability.
So that is our focus for SRE.
I invite you to read more about
the cyberlite engineering,
especially for theglosary cncf.com. That is very
special because it's a lot of community
behind of that. So these SRE
little quotes about the SRE the importance for
SRE implementation SRE is a critical function
for ensuring the reliability, availability and performance of it system
and application teams and tools. SRE important to
SRE, but equally important is the ability to
look at the big picture,
the ability to look right. So every
part that we're talking about is hey, I need to look what
happened with my software, my infrastructure,
my communication, my networking. So the objective for
that it's very huge because in every part of
SRe you need to take some
concepts about observability. How do you can see that?
How do you can take that information for the application,
for the infrastructure, for the database,
for the message queue or something like that.
You need to identify exactly how your application works
and how to do application also as was
connected for another task or another system.
So that is the objective and this is part
why is so important this part of SIW.
So with this on mind we have
the Dora metrics. The Dora metrics come from
the DevOps research and assessment. This is a very huge patch
for DevOps and also applies to SRE
because with these capabilities
for this research have four metrics.
That is deployment frequency, lead time for chains maintenance
to recover and chain failure rate from
the names could be you generate a
general approach for these items.
But the most important here is hey,
if I generate this research for
my company what is the objective for that?
When you improvements and also take a look for
the dorametrics you take a look for your maturity,
the maturity that you have inside of the company and what is
the capability that you have with generate
one new feature or what is the time that you take for
resolve new bulk in production
or if production is down what is the time that you take for
record for that and how
do you can measure that? That is the objective because you
have here the four components ben.
But when you move for I start this metrics
and start to how do I can measure that inside of
my company is the hard part because you need to identify
my repository, identify the process for requirements,
identify the repository for features. So you
need to take a look for all scopes
behind of the IIT process, how to utilize
a new feature for the clients, the customers
from your internal customer also that
is the biggest part that you need to cover how you can
see now see or take a
look from the repository
from the Jira
or confluence depends for what is your requirement
manager for that tool
that you use internally or something like
that and take a look hey, how do
I can take that metrics and how do I can storage that metrics
and how do I generate something behind of that for
see what happened with one commit or
one branch or something like that. That is the biggest
challenge behind of that because every day you need to
take hey I have a process for generate that I
generate could be a standard for commits. I don't know. You need to
identify what happened inside of your company
and if you need to move more than
DevOps and generate the next step
with SRE because with Sre you need to hey,
I get more information that currently I
have. I need to start cover for the repository,
the management for my task or something
like that. So that is very beauty
because in the way and you addresses your report
for that you found a lot of conceptually
definition that the people has but nor
have they really true. So when you start to
read and you start to identify and you start
to define, that process is a very nice
path. For the
other hand, when you identify
and you start to call it that metrics also you take a
look for the maturity,
right? The maturity mentions to you if you
have a low performance or could be a medium performance
or a high or you are a leader
for SRE and you have a best practice for the
current ecosystem for it and you deploy
every day or could be once, a couple of times
during the day or could be in the hour.
That is the objective for this research. Identify exactly what
is your point, how do you could
be moved for the next step. Because what is my
new challenge for every part of this
research.
Okay, start talking about observability.
So observability is a system property,
right? Notice exactly a definition.
It's a property that I have in my system to degree to
which the system can generate actionable insights. It allows
users to understand a system state for the external
output and take corrective actions.
I think that is the best part for here
because we have currently another
tools for generate and take information for
my application. But that is observability.
Let us check. During the session,
computer systems are measured by observing low level
signals such as cpu time, memory, disk space. A higher level
and business signal included API response times,
errors, transaction per second, et cetera.
This observable system are observed or monitored.
That is the key monitored for
the past sentence because currently we monitor
it because I enable could be enos
the cloud watch or could be I take the logs or could be I saw
the information about the cpu or the throw
output for my network, but that is the key for that.
Or could be just monitor your application. So that is
we're talking in this session. There are specialized
tools locally observability tools. A list of this tool can be
viewing the cloud netting landscape observability section.
We will see that landscape in a couple
of slides. Observable system yield meaningful
actional data to their operators, allowing them to achieve
outcomes faster incident response, increase developer productivity
and less toils and downtime. That is one
of the part for reliability for
SRE. One of the pieces could be the
agility sells to us. What happened
if I generate SRE in my company?
Hey, you redo the toil. That is one of the scope.
But the other scope is, hey, how do we can observe,
how do we can take that metrics? How do we can generate that dashboard?
How do we can generate the Roy for our company?
How do we can sell to the managers
that SRE is the right path for us
if we have DevOps? Could be the next step
is SRE. Could be, but you need to generate and identify
exactly with Dora the maturity that you have currently.
Okay. Consequently, you observe a system will significantly
impact its operating and developers cost,
right. You can deliver for
another teams and also the other teams take the most
valuable section for that because you enable the
system SRE observable currently. And also you
forget the current, I don't know, could be your
charts about the hey, I need to take a look for
the information around for my
accounts on production. How do I can take that information?
If you observe, you can enable a couple of dashboard
and you deliver that dashboard. Then the teams can
see what happened for the application without
access to the current infrastructure. That is a nice
part because you enable the capability and
also if you sell that part for security, they open their eyes
and hey, that is the way because they can
observe the application, but they don't need to
access to the infrastructure behind of that. Or could be access to
the code or something like that. Just observe what
happened inside of that. But it's pretty good. When you
sell for your company and sell to your manager,
what is the best part that you can take
this key, this feature key for your team
and how do you reduce the current operational
task behind every needs that the
team has? That is a
big one. X that is related for the NPM,
the application monitoring. So observability is
different. For application monitoring we have at
this part, we talk about the monitor it, but it's
the same. Not exactly. That is when
we open the path and we generate the bifurcation about
what is observability, what is the APM. The APM is the
capability to observe, to SRE,
to obtain, but not for generate exactly
the impact for the organization. Because you just see
what happened inside of my components, but you don't
put anymore for that. But when you start to
move for the next step with observability,
you start to generate alerts and could be take
that alert and generate, could be an automatic
task for solve every alert
or solve a couple of alerts that you have identified that
is the biggest difference here between
observability and APM also exists another ones,
but is the part
for split that part,
right. We talk about for APM observability talking
about the golden triangle. The golden triangle is about
the information behind of my application or
my infrastructure, right? When we talk
about infrastructure we talk about the components behind of the my application database
is queues,
networking,
I don't know, another firewalls or something.
Another networking tools or elements that you need to know
to deploy for support your application. So in what part
we need to obtain the logs and also store
as that locks the logs is exactly
the unstructured data that provides a record of events and
action within a system, right. These logs
are typically text buzzed and are used to configure
information about system behavior errors and donate relevant
events logs sre very
useful for toolshooting issues and identify patterns on trends
over time. How do we can obtain
these logs and how do we can storage that logs for the
next one? The metrics the metrics is exactly
the structured data that provides a quantitative
measure of system performance or behavior.
Metrics are typically numerical values that can
be aggregated over time and SRE used to monitoring
key indicators of system health and performance.
Metrics are very useful for detecting
anomalies, setting performance their jets and making data and
driving decisions. So that is
for metrics the next races exactly.
This is part of the record for the interaction between components
or service within a distributed system.
Traces capture timing and context information for each
interaction, allowing developers and operation to identify
bottlenecks and performance issues.
Traces are useful for understand how requests
flow throughout or a system.
That is very important because when you have the traces
you identify exactly what is the point the request
go to. For example, you have the
access to one login page when down
to the back end, the back end take that greatest
and also generate some operation. What kind of operation
could be storage? The access to
one user could be read the database for
obtain information from the user stable or identify
if the users could be success to login obtain
information around the users. How do you can identify
that you enable the traces and also
how do you can enable the traces? That is the biggest challenge
here because you need to instrument your application. That is
a key pair for observability. How do you can
enable this capability for your application? How do you can
enable this bigger scope for the application?
So that is very nice path for
discuss with your dev team and also for architecture
because you need to modify Colby your framework
for the companies or Colby generate
a new framework for that. And also you use a
pattern for architectural design for that
part. That is very nice part because you
define with another teams how to the
system will be observed. How do you can expose
that metrics and how you can take that metrics for storage.
That is very awesome and very nice path
for you. So why
observability? That is the
main part of this session because we
talk about observability. Hey, observability seems pretty good
but really I need observability. Really I need
to take advantage for observability. What happened? I don't have
observability in JR, my company. I have
any change for that. Really not. But you have
a lot of new features for enable for your teams and also
for you when you enable observability. Because for example
that is a single application that you could be
have in your company that you have here the customer,
the customer comes to your application using your
cloud provider. The cloud provider you have could be generated here a
firewall for identify if the request
is legal or not and you check
and identify here there's a couple of load balancer for
sends the greatest for the correct
application. And also the application here takes
that request and generates the logic behind of that request and
could be consumed the database or could be consumed some
requests for your storage.
But what happened if this
start to that? You move for the next step.
You start to split your application and you split your
business capability and you expand
your business. What happened here? You have the same application
or could be this part is just
a little part of the application and
you have here one challenge. Could be you
made that but what happened if you
start to doing that? What happened with this speedy
can that sre men start to
identify, hey, the firewall is wrong.
Configurated, I don't know, could be the database is
down or replica is down. What happened
if my application is down? How do I identify that?
Or could be the load balancer or could be my
region was down. How do you can identify that?
Or could be me. My archive has
downed also how do you can move for that? How do
you identify so you need to enable observability
because how do you can see that? How do you can take
action behind of the application or the infrastructure
or the components behind and support the application?
Because normally when the request
fails you take a look. Hey, but the
request was pretty awesome to the app. The problem
is the app. Then you move the request for
the developer team. The developer team. Hey, I put idle
local and locally worked for me that it's normally
or could be you can reply the same request in
development environment or QA environment.
In that environment works well. But how do you can see
could be the archive or database
or load balancer or firewall or cloud provider is wrong.
How do you identify that? That is exactly the
question for that. If the application works
well of my components works well for the application or my
infrastructure works well. Hey how do I can take a look for
that? I need to take this very important
observed system for my company, right?
So behind of that could be access and other components like
DNS or could be generate your security compliance
or you generate a new fixes or I
don't know, every day you need to cover new feature from
the different users not for the customer,
just for different users. Like a security could be
for a compliance for the loyal from your company
something else. Yes. Then how do you
can identify exactly in what point the
application was wrong. Then you
can observe that that system that you identify. Hey yesterday
we made some little changes about
for new loyal support that request
the company. So when we deploy
that the company for firewall could be wrong
will be the security team generate
new feature also because they need to also
update for the firewall or the security group of the
rule behind of my load balancer.
Then the application start fails. So how do
you can see that again white observability
another part for observability is alerting.
It's not when we observe, it's just hey I
can see but when you enable observability you
observe and alerting for that. When you start
alerting you start to move more in the observability
ways, right? That is the objective for
observability. Another is hey, we need to take a
look for dashboard, we need to automate some remediation
for x case or something like that.
Could be you move for observability
as code. There is a new moves for our
companies when you could be but to instrument application
or could be you have and generate cluster for kubernetes
and Kubernetes had an operators inside of that and take information for
the applications and generate that.
But instrumentation for you that is one way for
doing that.
So bye bye little
speedy and Star wrote about the CNCF
landscape that is very huge tools
about the observability and monitoring. Here we have
also every tool that every
cloud provider has enabled for day. So we have
the Amazon Cloudwatch Appdynamics application
availability service,
application manager, Appnata app, optics app,
signal alternative azure monitoring bits.
You have here a lot of tools for the
community and how to the biggest company
use that for their services. In another
how do you can use for logging you can use
gray log, home logi logly fluent Grafana,
lockey or elastic or lockstash. You can
enable depends for your expertise or could be you
need to take a look more in depth about this landscape for define
the current stack for your company. Right.
How to move for tracing. How do
we can enable tracing? Hey there is a very huge
community behind of this components that is open telemetry.
The community behind of that is the
second community bigger behind of Kubernetes
Kubernetes spheres. Second is open telemetry. I recommend to
you take a look for this community and also they
are very active, very active community for
generate every day some new feature for
you and also supported a lot of languages
for you and components behind up.
Okay another part is for observability and
talking about chaos. Yeah chaos is our
part for observability. Then we need to take a look for that.
But it's pretty awesome.
You here take a look for the landscape for CNCF,
that is what happened here. And also information
about the continuous optimization behind can observability take
a look for hey how do I configure that? How do
I can take that log? How do I can take that trace?
You enable another capabilities when you
start to generate a journey for observability. You also
enable the capability for the customization because you
see hey I take could be a one
instance that have overloaded for
the application that is normally or not
or could be identify the pod that you
deploy in kubernetes cluster are
overloaded tools or could be you generate can HPA
definition for that or could be you identify the process
that they execute. Could be moved
for another node. So that is another
capability that you enable when you start the journey for
observability. But then you need
to take a look for your company.
You need to take a look exactly how do you move for
that part. But first you start to define what is observability,
how to decompany your subservality inside of that.
Right. So we
move for the demo session, that is the repository.
If we click here we move so fast for the
repository that is here.
And also you can see here the information about
how you can execute this demo.
For that part we enable the
capabilities for Docker. You need to install
Docker first. That is this disclaimer.
Don't ruin this in production because it's
not for production environment just for locally it's for your POC
and identify how you can enable observability for the Docker
stages. Right. So we deploy first
the C advisor second Prometheus
and theoregraphana. Right?
So with that in mind I but here
some reference that I take for doing that and
reduce this readme file for you and also
the listen here is Apache listens for you and your time
and your team. You can put and drop
this for your code and I don't care.
That is part of exactly what happened
when you work with the community, right?
So I cloned the repository here. You can also
download the repository with let me check
here. When you see git remote,
you see here that is the current information about the repository
that I cloned it and also we have their files
here and I have one extra
folder because in the readme file
we talk about them exactly
for this part when you try to optional
point that is for workloads for your Docker environments.
I don't know before for doing that more
easier for you.
And also I drop some information here
in the git ignore for don't put this folder
because it's just for using when you download the
folder and not for generate garbage for the repository.
So with that in mind we start to execute the
instruction inside of the repository.
The first is install C advisor. So let
me check here Docker ps docker
ps ip we have here C
advisor before Grafana Prometheus let me remove that
Docker advisor
Prometheus and Grafana right?
That is the optional point that
matters about that dokierps do
Krps here we
go, we clean our
environment. So the first part is install Docker,
sorry install C advisor. What is C
advisor? C advisor is called be
a proxy for obtain metrics from the
current Docker installation and sends to you
and export that metrics, right? When we move
for the presentation here we have the local host
access here so we can copy and paste for the new tab.
Here we go. We have currently metrics that the
C advisor taking from the current deployed from Docker
environment, right? So the next
one that is the GUI for
C advisor and this is the current metrics.
That is a lot of metrics that you can see here because take
these metrics for Docker and move and
expose for you for you can use for your
preference, right? For that preference we enable
Prometheus. So we copy the Prometheus
Jaml file. But what contains Prometheus Jamil file?
Let me see here, that is the Prometheus Jaml
file for that part you need to check what is exactly your
IP internal address for that deploy.
For my case I doing that and
also we can start for deploy this part.
So we move the Prometheus part here
CP prometheus conf tmp and
also we can start to run the Docker environment.
So let me pass the command
here. So we enable Prometheus and enable the per
90 and mount
the Prometheus conf for
the container and drop to etc. Prometheus.
Prometheus Java Yep. And also the image that we use
is from Prometheus. Here we go then
Docker. Ps we have currently two services
in Docker. What is c advisor?
That is read the metrics about this
metrics will be docker stats.
Docker stats and
that is the current metrics behind of that. When we up
the optional point we see more metrics here.
Okay. The next point is hey we need to take access to
what kind of metrics we have and cover that for
the part of Prometheus. Right.
We see here exactly the status for the components,
the configuration. I have here the information
about the ScEp configs for
metrics exactly for the current metrics and
also the metrics for C advisor for my
internal ip address. Right. When you see exactly
the current metrics here graph,
no alerts, no status runtimes
to configuration common targets.
We have two endpoints.
That is for c advisor and this is the endpoint and
the endpoint for Prometheus that call it metrics currently is actually for Prometheus.
So here take the name for the current docker
name localhost here
localhost. You see another
metric here. That is if you follow
the instructions here you need just copy and paste.
That is more easily way for that right.
That is for target. When you saw the target you
can check exactly the same way.
So we need now deploy grafana
for see what happened for Prometheus. Because hey that is
very awesome. The metrics it's a text basis but
sincerely I don't have time for read that but it's very normal.
We can copy this move for the demo
session here. Here we
go. Do care. Ps we
have currently Grafana in the port 3000
move for the presentation part and also
copy okay
and open Grafana here you need
to access for your grafan account. That is admin
admin. That is all our default
passwords for us and
admin admin. Here we
go. We have access currently for Grafana.
That is pretty awesome because you have hey that is
very nice ecosystem for us.
And you see here
exactly the grafana. Let me move here at this part
because the presentation doesn't work
well. Okay. That is the first part
when we move for the repository we
see here. Hey, we need to generate
a couple of configurations inside of Grafana.
So could be that part is pretty
wrong here. But you can move for the raw file readme
file and move for the go to in row mode
read most readable for you right
here. So down here and we enable
first go to administration data source, add a new data source,
select Prometheus in URL drop HTTP
uip 90 90 navigate to bottom,
save and text and done. Right,
let me drop here this and move for
that. So how do we come up for that here the
three lines on the top.
Okay, three lines on the top.
That part move for them.
Administration here and
data source. Right. When you start here you can
add here a data source and select Prometheus.
That is all the configurations. Then you need to put
here HTTP. HTTP two
points.
HTTP,
that is my ip that is going to take
the information for Prometheus will
be copy and paste here and
just change the port. Right. The port is that
we saw in the current information
here your port 1990.
Okay, I can navigate to the bottom,
save and test. That is the green. That is
awesome for us because green seem
for us everything going well. Okay.
Return to the home because the next step is enable
exactly the new dashboard
that we need to take a look for for day. So let me move it
here very quick. And also
we execute the
information about here. That is the reference. But we deploy these
monitors for our system. So for that
you can access for this HTTP
page and also
appear here what is a little tom for
the dashboard you copy the it to Kiri bar and also move
here. How do you can import that dashboard?
Move for the three lines, move for the dashboard here
in the new section, let me move
here this pretty quick.
And in the new part you move
for the import paste here
exactly the id that you copy in the previous
page. Then load.
That's the page that load you select exactly
your data source, that is Prometheus. And the name could be you
can change or could be you generate a new folder for that
and import and voila. That is information for
the current metrics inside of Docker. That is
pretty awesome because you can see here exactly what happened
with your docker. But here it is another
information for docker monitoring for using
Grafano two so you can copy that
and the same way three dashboards
save and save changes,
new import paste
the information for the another part I want
to change Prometheus and import.
Voila. This is another dashboard that you can see the
same information with another view. That is
pretty awesome because we enable the capabilities for observe
our system, right? That is the first part. But with
Grafana and also with Prometheus you can move
here very quick and generate the alerting.
You can drop here. Your alerts you can drop here
the rules that you made more sense for you and your
company. How to the behaviors you
identify exactly for the current docker or
your application or your cluster of kubernetes or your infrastructure.
You have the possibility to enable here
all alerting that you have. Also you can drop every rule here
in the contact points. You have a lot of
ways to deliver that information for
using alert manager or communication with another
ticket manager or using discord or email.
Or could we put that for Kafka or pageduty
or send an alert for slack for your team or generate alert
for telegram or something like that. You have
the capability for connect Grafana with all
scope for your current stack
for your IT infrastructure. So that
is pretty awesome the webhook because with that you have the capability
to send to another don't
support exactly a tiered provider out of the
Grafana and you can send the information for
the alert and what happened with that alert and generated there
could be a custom payload for that. And since just
the information related and more valuable for you and
the alert will be more great for you,
right? So we can move here
very quick. That is the demo.
Please tell me the comments or
what appear for you this little demo
appear for the zone for observability. Observability is a hidden
muscle for SiD success. We need to enable observability
for our companies. How to the observability made
more bigger our teams. How do we can enable another
stuff for the company? And also what is the
view of the SRE Andora observability? What wide
triangle alerting and remediation that we're talking about and
a little demo some practice for
observability. That is another
part that is effective. Observability requires focus on business value. That is
the objective for that and alignment with authorization and goals and
acute for continuous improvement. How do you can fix
and how do the objective for observability is the same for the
company? Not just hey, I have one
way to generate observability stack but the
company doesn't need observability. I don't know.
You need to take this morning deal with your company.
Another is the best practice. Include investing in the right
tools, prioritizing data quality and embracing collaborative
and data drive approach. That is another that you
can enable for the observability and thank you.
That is all for this session.
Please feel free for drop in the comments or contact me
here with jtan 24 for GitHub
or YouTube or or Twitter.
Or you can find me like Johnny Pong in LinkedIn.
Please feel free to contact me. I have
sometimes for response to you so thank you for your time.
I hope you have enjoyed this session. Thank you
and see you.