Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, welcome to my talk about observability of microservices using
open source tools. Hope you're having a good time at the conference so far
and that you take away something valuable from my talk. So, a quick
introduction to break the ice hi, I'm Shubham and I'm a developed
relations engineer at the NUD. It's an advanced incident management and response
orchestration platform. I'm an expert in making mistakes, learning from
them, and advocating for best practices for setting up their DevOps,
SRE and production engineering teams. So the burning question
what is this talk about? So firstly, we'll have a quick brief on
observability, something I'm sure the brilliant speakers before me must have done a good job
covering, but we'll quickly glance on the same why
we need observability from day one. Then we're going to be setting
up an observability system using open source tools in record time.
And finally, we'll talk about what comes after you have
observability set up.
So again, what is observability?
It's the ability to measure the internal states of the system by just
examining its outputs. So observability gives engineers a
rather proactive approach to optimizing the system.
It provides a connected, real time view of all the operational data in your software
system, as well as the flexibility to ask the question on
the fly about your applications and get the answers that you need.
And that is the key part, having to know ahead of time what you
wanted to ask. It's also about understanding the service
level required to meet the customer needs, and then instrumenting
the necessary component to ensure that the desired level of
performance is achieved. It is not about collecting a colossal amount
of data to just have all the time, but rather the
right metrics to track and monitor the
key performance indicators relevant to your software and adapting the
system to meet the customer expectations.
So why do I keep hearing about it all the time now?
The word has been thrown around everywhere, and every fourth post on
LinkedIn does have observability as the core concept.
It's a fair question. We've been around for a considerably long time now, and the
buzz around the world has begun to catch on only very recently,
like about five to six years ago. Why now? And,
well, simply. But software complexity is increasing at an exponential
rate, and products are innovating in a crazy,
unpredictable direction. On the infrastructure side,
microservices, polyglot persistence and containers are enabling the
decomposition of monoliths into agile, complex, ever evolving
systems. Meanwhile, on the product side and platform
side, creative solutions are empowering users to do new
and exciting things. However, it creates challenges
for developers to make stable and reliable component
as recently as five years ago, I think most systems were much
simpler. You'd have a classic lamp stack,
one big database, an app tier, a web layer, a caching layer,
some basic load balancing, and you could predict most failures.
Craft a few dashboards that addressed nearly every
performance RCA that you might need
over the course of time, and your team did not spend a lot of
time chasing known unknowns and
it was relatively easy to find out what was
really going on behind the scenes. But now, with a platform
or a microservices architecture, or millions of unique
users or applications connected, you have a long fat
line of unique questions to answer all the time.
There are many more potential combinations of things going wrong, and sometimes
they connect in a manner that is hard to decipher.
Now why do we need day one observability?
Why are we solving a problem before it even exists, right when
environments are as complex as they are today, simply monitoring for
known problems doesn't understand the growing number of new issues that arise.
So these new issues, the unknown unknowns that we talked about earlier,
mean that an observable system in
an observable system you know what is causing the problem and you dont
have a standard starting point to find out.
So without deep observability, it is natural to
make assumptions about production system behavior,
including what we think may be potential performance bottlenecks or failure scenarios.
When failures do occur, we are often in the dark as
to whether they have occurred and both the impact and potential fixes.
This leads to wasted time and effort, like throughout the organization,
jumping from one theory to another, one change to another maybe.
And you really do not understand how
any of this is going to impact the system if customers are
impacted. Cost of the guesswork is exponentially
high and it can escalate as quickly
for any stage and in production. While Kubernetes
can help recover from some failures, we've seen it be
a lifesaver for a lot of organisations. But there are many scenarios where
that can cause system to run suboptimally or fail continuously.
Even and even when service availability is maintained,
performance bottlenecks can result in premature auto scaling,
resulting in the excessive use of costly cloud computing resources,
to say the very least. Indeed, there are cases where the first
signs of failure is an astronomically high cloud
computing bill. So now, without any further stalling,
let's move on to the meat of it building a complete open source solution for
extracting and shipping traces, metrics and logs and
correlation between them. So a few prerequisites
to fulfill this demo. First up we have the kind tool.
Kind is a tool for running local Kubernetes clusters using docker container
nodes. Kind was primarily designed for testing Kubernetes
itself. And yeah, if you have go 1.17
plus and Docker installed, all you have to do is run this command to
install and run kind locally. I'll be sharing
all links, snippets and resources I mentioned in the talk in a gist and
you can find the link for the same down below.
Next up we have Kubectl, the Kubernetes
command line tool which allows us to run commands against Kubernetes clusters.
And finally we have helps helps helps
us manage Kubernetes applications via hem charts which help
us define, install and upgrade even
the most complex case applications. So let's
move on to the observability backend as there is currently no one
database that can store all logs, traces and metrics.
We will deploy three different databases and the visualization
tool. So we have Grafana
for dashboards and visualization of the data
we have from Mythius. It records real time metrics in a
time series database. It allows for high dimensionality
with flexible queries and real time alerting.
We'll use Loki which is a horizontally scalable,
highly available and multi tenant log aggregation system inspired
by Prometheus. And finally we'll use tempo,
which is an easy to use and highly scalable
distributed tracing backend. Tempo is cost efficient
and requires only object storage to operate and
naturally it offers very deep integrations with Grafana, Prometheus and
Loki. So now we need the observability
control plane which will automatically instrument
our applications and for that we'll be using Odigos.
So Audiogos is an open source observability control plane that helps
us in two ways. Firstly, automatic instrumentation.
Odigos automatically instruments your applications and produces
distributed traces and metrics without any code changes.
And collector management which is audibles automatically deploys
and scales collectors according to application traffic. So we
don't need to spend a lot of time deploying and configuring collectors.
So now for this demonstration, our target application
will be a microservices based application written in
Java or Python. We'll be using a fork of the bank of
Anthos example application which was made by Google, and you can find
the link in the resource section. You can deploy
the target application by simply running the following code
and you should be good to go.
To install the observability backend,
we'll execute the following helm chart which deploys tempo, the traces
database, Prometheus, the metrics database and Loki,
the logs database, as well as a pre configured Grafana instance with
those databases as the data sources. So once
our test application is up and running, our observability databases
have been set up and ready to receive data. We'll install
add Audigos as the control plane to collect and transfer logs,
metrics and traces from our applications to the obsolete
database. So to install audios via the helm chart,
you just need to execute the following commands and after all
the odds in the audio system namespace are running, we need to
open the audios UI by running the port forward command.
Once that's done, all you need to do is navigate to localhost
3000 to see your audio setup and ready to go.
So now there are two ways to select which applications audiogo should
instrument. The first one is opting out basically
which will instrument everything, including every new application that will be deployed in
the future. You can still exclude any application that you
do not want to be instrumented, and the other way is opting in where
you select only a certain few applications that you want to be instrumented.
And for the purposes of this tutorial, we'll use the simple way,
but which is opt out and instrumenting everything.
So the next step is to tell audios how to reach the three databases
that we just deployed. And to do that we'll just add
the following three destinations
inside audiogos, Loki, Prometheus and tempo.
To do that, you just need to enter these URLs as mentioned over here.
These are also present in the resources section, so feel free to use them
from there.
And all we need to do after this is wait a few seconds for audios
to finish deploying the required collectors and instrument the target
applications. And now finally,
the last step is to explore observability
data in Grafana. We can now see and correlate metrics
to traces to logs in order to dive deeply into how our application behaves.
And to do that, we'll need to port forward a Grafana instance by
running the following command. Then we navigate to localhost,
enter admin as the default username, and for the password, you need to enter
the output of this following command. Please note that there's a percentage
symbol at the end of the string. Make sure that you remove that while you're
venturing it into the password.
And now time to see the power of data.
So just log on to your grafana instance and
we'll start by viewing a service graph of a microservices
application. To do that, just go to explore over here,
make sure tempo is selected on top,
choose the service graph and just run
the query. And there you go, you have a basic
node graph available to you with a single user service application.
Now let's use some metrics, right? So we'll click on the user service
and we'll choose request rate. And there
you go. A graph for all of the metrics that we want is present
to us. There are three kinds of metrics that Audigo supports.
Firstly, it's metrics related to running of the application,
that is number of HTTP requests, latency, DB concoction.
Secondly, metrics related to the language runtime, that is threads,
heaps, all of that and metrics related to the host environments,
which is cpu, memory and disk usage.
So now let's look at some traces. So this time
we'll click on the user service and just select histogram.
And in order to correlation metrics to traces we'll use a
feature called exemplars. To show exemplars we'll
need to click on options over here and just
enable exemplars. These tiny diamonds will be
now visible to you. You can just select any of these and click
on query with tempo. And there you go. A trace like
this should be present to you. You can see exactly how much time
each part of the entire request took. And digging into one of
the sections will show additional information such as database queries,
HTTP headers and. Yeah, now if
you want to drill down further, we'll go into logs,
which is through Loki, and you can just simply
query the relevant logs as we need it. So to that we'll just firstly
choose our namespace and
use this traces id as an identifier
and just run this query. And there you go, you have
all the data relevant to this particular trace and you can map
what it did end to end. And that is your observability
framework for you guys.
So we've learned how easy it can be to extract and ship
logs, traces and metrics using only open source solutions.
And in addition we were also able to generate traces, metrics and logs from an
application within minutes. We also have the ability to correlation
between different signals. We correlated metrics to traces and
traces to logs, and most importantly we have all the data we
need to quickly detect and fix production issues on
target applications. So great,
we will now be able to detect and have all the data to diagnose and
fix issues. But how do we approach incident management here for quicker resolution?
So an appropriately observable system is definitely a must for quick
resolution. But make sure that you don't lose the war against
time during firefights due to human errors or miscommunication.
Assess whether your organization could benefit from an incident response
tooling like Zend to keep responders on the same page,
harness context rich data from your observatory plane and enable
your team to jump back from issues as fast as they can.
And that's all my time. Thanks a lot for tuning in,
and I hope you take something valuable away from this session.
The resources link is right here. Again, feel free to reach out to me me
on Twitter or LinkedIn in case you have any questions. And have
a great day.