Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey guys, my name is Pradeep Gaddamedi.
I'm a site reliability engineer at a startup called Radar.
So, every day brings new challenges and victories.
And here I'll share my adventure in deploying monitoring and logging
platforms in the cloud without breaking the bank, as budget could be the
constraint for many startups and not everyone can afford paid versions.
so with that, I'm gonna go over my agenda today.
So, I just did my introduction, and I'm gonna, next time I'm
gonna de discuss about, you know, how to deploy the cost effective
monitoring and, logging solutions.
And then I'm gonna talk a little bit about the automation.
And then, you know how I did set up the S-L-A-S-L-O-S-L-I monitoring
dashboards, and I'm gonna discuss my experience with incidents.
you know, in my organization and then, about the toil and,
you know, time constraints and challenges and how I manage them.
So, let's diri dive into it.
So the first thing, you know, and I wanted to discuss is the software.
You know, open source software and the infrastructure we choose for
monitoring and logging infrastructure, or in fact, any other setup.
So, especially when your company or the startup which you're working for
is a budget constraint, obviously you want to choose, you're compelled
to choose the open source version.
So the main important, You know criteria we need to keep in mind is always research
a lot before we actually choose this software and also check mark all the
requirements which your organization needed that and then then choose
the software Let me take an example.
So I did choose elastic search For logging and I did set up elastic search
on virtual machines and everything went fine You there was, there
were some problems I encountered.
So, I cannot send Slack alerts from Elasticsearch because
only paid version, allows it.
So, since I was using open source, I can't send Slack alerts.
So, I have to, you know, use a workaround for that.
So, I wrote a Python script and then, I used the kibana server log and
then you know filtered the alerts and used a slack web hook and then you
know Sent it to the slack channel.
So it was kind of a Backdoor way.
So that's an example, but there could be something more important for your
organization So make sure like you check mark all those requirements
before choosing any open source, you know, and that way like, Yeah.
And there is, and, and, and next, like, choosing the right hardware.
So there is, there are certain ways where you can save some money,
by choosing the correct hardware.
you have to identify the correct platform to run your application in.
And, also the, you have to compare the cost with different,
components available in the cloud.
I mean to say different infrastructure available in the cloud and decide based
on the cost and, you know, You have to weigh in advantages versus cost when
I did elastic setup I actually read through their documentation and I found
like not all the instances require the same amount of memory or CPU.
So some instances consume less, some consume more, like master nodes doesn't
doesn't require a lot of, resources.
So I use the virtual machines which are of less skew for master nodes,
whereas the data nodes, they are CPU intensive because there are a lot of
data comes in there and then a lot of indexing and searching happens there.
So I have chosen virtual machines with high CPU.
So, you have to, look at, you had to weigh in the, advantages and
disadvantages and, you know, which costs you more and then choose
accordingly and try to save costs there.
That's all about, you know, how do you, select software
and hardware, to save the cost.
So let's talk about the automation.
So, so far I've talked about logging and I installed all of the monitoring
components I choose again for monitoring.
I have chosen open source.
Grafana, Prometheus, Mimir, InfluxDB.
So all those components, I installed it on a Kubernetes cluster.
either it could be monitoring or logging solutions, everything I use Terraform.
It's better to go with, you know, Terraform or Ansible.
Ansible I use to configure the Elastic instances.
So, actually creating, sometimes you feel like, okay, creating infrastructure
manually, you know, It's very easy, but you know in the long run, it's
not a good practice because you You might you tend to forget the whole
process how to create or if something happens it's going to take time And
the documentation could be tedious, you know, there are a lot of things.
so it's better to go with Uh, I mean any automation of any kind terraform
ansible or any other automation And for my dashboards for my grafana dashboards
You I, I use CICD, and then I, I checked in them, checked in the code in, Git
and then use CICD that way, like if anybody messes with Grafana dashboards.
I can recreate them, by running the CI CD pipeline.
And I used Helm Charts to deploy, you know, monitoring
components into Kubernetes.
So all this, you know, using automation at first could take some time, or, it
could seem, I mean, it could seem a little bit hard in the beginning, but, in the
long run, they are really useful, and That makes your system very resilient.
All right.
So, yeah on the same context I would like to highlight like
in my elastic search cluster.
Let's say i'm running out of I mean my Data nodes are growing.
There is a lot of logs coming in.
All I have to do is like adding some entries new nodes in the terraform and
run it And that way it just creates those data nodes in just few minutes.
So So automation makes things really easy Let me discuss a
little bit about SLA, SLO, and SLI.
I know these are very, important parameters in DevOps world.
Let me start with SLI.
SLI is basically, the metric that measures how well a, service performs.
Based on that metrics, you know, you basically set up SLOs and you
come to some you know, you also set up SLAs with the customers.
So how do you basically Go with setting up SLIs.
So, I mean I I did basically, Set up a lot of SLIs, but what I recommend is, you
know, start with the basic level So if you have VMs or if you have if you are
running some pipelines data pipelines, you know So set up the basic alerts first
instead of going with the fancy dashboards That way like you cover the basic
stuff, you know There are a lot of slis.
In fact, like you can set up availability latency error rate throughput saturation
traffic you know, they all help you decide on SLOs But when you're starting
up right, you know in a startup, I think saturation is the important SLI
We should first focus on saturation indicates how full a service is, you
know, it could be measured to various resources resource metrics such as cpu cpu
load memory usage or network bandwidth, you know saturation helps predict
and prevent potential You performance degradation, degradations this way.
we make sure our system is well designed to support the load before we consider
setting up other SLIs like error rate, availability, or throughput.
All right.
So let's move on to, incidents, you know, incidents, happens, there are basically,
in every, you know, in DevOps world.
And then There are two types of incidents, there are some self
inflicted and some natural incidents.
The reason I call self inflicted incidents is because, you
know, there could be some toil.
Toil is basically the repetitive manual work that could be automated, you know.
And, you know, that could prevent an incident, you know, occurrence of an
incident or setting up proper alerts for, you know, those, the, I mean, setting
up proper alerts for the infrastructure could prevent some incidents.
Let me take an example.
So I was really, I had time constraints and I have a lot to do.
So I miss setting up the, Alerts for disk on elastic and every time
the disk failed Slash temp disk.
I manually went in and you know cleaned it I could have automated but
I have my own reasons for not doing that Like I said, time constraint.
So one day, like an incident happened and, you know, I had to
manually go and fix the stuff.
So, so this repetitive task that's called toil.
So we should address toil.
So basically we could, there are various ways.
Maybe we could create a separate project, separate JIRA tickets and
work as a team to address toil.
and set up proper alerts to prevent these self infected events.
And there are some natural, incidents which, which actually occur.
So, when natural incidents occur, proper, you know, root cause,
root cause analysis and proper post mortems are really important.
And moreover, we should play blameless game.
So, because blamelessness is very important, to the success
of a team or organization.
So post mortem should be blameless.
And, for RCS, we generally follow a five, why methods you could search
for five, number five, why, root cause analysis in Atlassian JIRA,
or you could Google, you will, that basically applies to any scenarios.
So basically you ask the same question five times until you
get to the bottom of the problem.
So if the disc fails, you say white field, and then.
Okay, the disk was not scaled and white not scaled and there was no time white
but there was no time for the DevOps, then it could be a you know planning
issue then You will get to the bottom of the issue after you question five times
so you could follow such strategies To you know improve rcs and push buttons.
Finally, let me discuss about toil time constraints and challenges
So, workload, workload is, one important aspect, in DevOps.
So you might, you might be already, overburdened with a lot of, you know,
vast amount of work to get to be accomplished in limited timeframe,
commit only to what you can realistically handle and set clear expectations to
avoid the stress of overcommitting.
while it's not, always possible not to overcommit.
And if in any urgent task, you know requires you to implement a solution
with minimum functionality inform management about the limitations of
the setup, this change this ensures they do not, you know, perceive it
as a perfect and Resilient system.
So let me take an example So I have been given like a week or two to set up like
complete logging and monitoring system So I had to take a hard decision and set up
a system which was not resilient Which was a single point of failure but You know, I
made it clear to the team and management, like the system I'm going to design is,
you know, could fail back at any time.
So we were in an agreement, you know, that way, like, if some surprises come, I
mean, when the system fails, there are no surprises to the, you know, management.
So, also, you know, like proper tracking proper tracking where jira, system
jira tickets or you know Whatever the ticketing system you use, right?
that will help you track the issues that will help your track track your work and
then Even that will showcase your work that way like Management understands like
what you're working on and what is the workload you could take more and that
actually helps you You know, organize your work and reduce the stress level.
yeah, proper tracking and documentation are also critical.
So you basically document whatever you do and you basically transition
to other teammates that way, like you don't get disturbed when you, while
you're on vacation, or, you know, some of the important, You know, aspects you
should consider, in the DevOps world.
Alright, so these are some of the, you know, learnings,
I, learned in this startup.
like I know I'm still learning.
We are all, we all are learning.
So, please let me know if you guys have any questions.
please feel free to, send me an email or please feel free
to connect me on LinkedIn.
Alright, thank you very much for listening.
You guys have a good day.
Bye-bye.