Mastering the Chaos: A DevOps Engineer's Journey to Cost-Effective Cloud Monitoring and Logging

Video size:

Abstract

Explore how a budget-tight startup successfully implemented an open-source monitoring and logging system. I’ll share challenges, triumphs, and best practices from my journey, offering insights for any company weighing these options.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey guys, my name is Pradeep Gaddamedi. I'm a site reliability engineer at a startup called Radar. So, every day brings new challenges and victories. And here I'll share my adventure in deploying monitoring and logging platforms in the cloud without breaking the bank, as budget could be the constraint for many startups and not everyone can afford paid versions. so with that, I'm gonna go over my agenda today. So, I just did my introduction, and I'm gonna, next time I'm gonna de discuss about, you know, how to deploy the cost effective monitoring and, logging solutions. And then I'm gonna talk a little bit about the automation. And then, you know how I did set up the S-L-A-S-L-O-S-L-I monitoring dashboards, and I'm gonna discuss my experience with incidents. you know, in my organization and then, about the toil and, you know, time constraints and challenges and how I manage them. So, let's diri dive into it. So the first thing, you know, and I wanted to discuss is the software. You know, open source software and the infrastructure we choose for monitoring and logging infrastructure, or in fact, any other setup. So, especially when your company or the startup which you're working for is a budget constraint, obviously you want to choose, you're compelled to choose the open source version. So the main important, You know criteria we need to keep in mind is always research a lot before we actually choose this software and also check mark all the requirements which your organization needed that and then then choose the software Let me take an example. So I did choose elastic search For logging and I did set up elastic search on virtual machines and everything went fine You there was, there were some problems I encountered. So, I cannot send Slack alerts from Elasticsearch because only paid version, allows it. So, since I was using open source, I can't send Slack alerts. So, I have to, you know, use a workaround for that. So, I wrote a Python script and then, I used the kibana server log and then you know filtered the alerts and used a slack web hook and then you know Sent it to the slack channel. So it was kind of a Backdoor way. So that's an example, but there could be something more important for your organization So make sure like you check mark all those requirements before choosing any open source, you know, and that way like, Yeah. And there is, and, and, and next, like, choosing the right hardware. So there is, there are certain ways where you can save some money, by choosing the correct hardware. you have to identify the correct platform to run your application in. And, also the, you have to compare the cost with different, components available in the cloud. I mean to say different infrastructure available in the cloud and decide based on the cost and, you know, You have to weigh in advantages versus cost when I did elastic setup I actually read through their documentation and I found like not all the instances require the same amount of memory or CPU. So some instances consume less, some consume more, like master nodes doesn't doesn't require a lot of, resources. So I use the virtual machines which are of less skew for master nodes, whereas the data nodes, they are CPU intensive because there are a lot of data comes in there and then a lot of indexing and searching happens there. So I have chosen virtual machines with high CPU. So, you have to, look at, you had to weigh in the, advantages and disadvantages and, you know, which costs you more and then choose accordingly and try to save costs there. That's all about, you know, how do you, select software and hardware, to save the cost. So let's talk about the automation. So, so far I've talked about logging and I installed all of the monitoring components I choose again for monitoring. I have chosen open source. Grafana, Prometheus, Mimir, InfluxDB. So all those components, I installed it on a Kubernetes cluster. either it could be monitoring or logging solutions, everything I use Terraform. It's better to go with, you know, Terraform or Ansible. Ansible I use to configure the Elastic instances. So, actually creating, sometimes you feel like, okay, creating infrastructure manually, you know, It's very easy, but you know in the long run, it's not a good practice because you You might you tend to forget the whole process how to create or if something happens it's going to take time And the documentation could be tedious, you know, there are a lot of things. so it's better to go with Uh, I mean any automation of any kind terraform ansible or any other automation And for my dashboards for my grafana dashboards You I, I use CICD, and then I, I checked in them, checked in the code in, Git and then use CICD that way, like if anybody messes with Grafana dashboards. I can recreate them, by running the CI CD pipeline. And I used Helm Charts to deploy, you know, monitoring components into Kubernetes. So all this, you know, using automation at first could take some time, or, it could seem, I mean, it could seem a little bit hard in the beginning, but, in the long run, they are really useful, and That makes your system very resilient. All right. So, yeah on the same context I would like to highlight like in my elastic search cluster. Let's say i'm running out of I mean my Data nodes are growing. There is a lot of logs coming in. All I have to do is like adding some entries new nodes in the terraform and run it And that way it just creates those data nodes in just few minutes. So So automation makes things really easy Let me discuss a little bit about SLA, SLO, and SLI. I know these are very, important parameters in DevOps world. Let me start with SLI. SLI is basically, the metric that measures how well a, service performs. Based on that metrics, you know, you basically set up SLOs and you come to some you know, you also set up SLAs with the customers. So how do you basically Go with setting up SLIs. So, I mean I I did basically, Set up a lot of SLIs, but what I recommend is, you know, start with the basic level So if you have VMs or if you have if you are running some pipelines data pipelines, you know So set up the basic alerts first instead of going with the fancy dashboards That way like you cover the basic stuff, you know There are a lot of slis. In fact, like you can set up availability latency error rate throughput saturation traffic you know, they all help you decide on SLOs But when you're starting up right, you know in a startup, I think saturation is the important SLI We should first focus on saturation indicates how full a service is, you know, it could be measured to various resources resource metrics such as cpu cpu load memory usage or network bandwidth, you know saturation helps predict and prevent potential You performance degradation, degradations this way. we make sure our system is well designed to support the load before we consider setting up other SLIs like error rate, availability, or throughput. All right. So let's move on to, incidents, you know, incidents, happens, there are basically, in every, you know, in DevOps world. And then There are two types of incidents, there are some self inflicted and some natural incidents. The reason I call self inflicted incidents is because, you know, there could be some toil. Toil is basically the repetitive manual work that could be automated, you know. And, you know, that could prevent an incident, you know, occurrence of an incident or setting up proper alerts for, you know, those, the, I mean, setting up proper alerts for the infrastructure could prevent some incidents. Let me take an example. So I was really, I had time constraints and I have a lot to do. So I miss setting up the, Alerts for disk on elastic and every time the disk failed Slash temp disk. I manually went in and you know cleaned it I could have automated but I have my own reasons for not doing that Like I said, time constraint. So one day, like an incident happened and, you know, I had to manually go and fix the stuff. So, so this repetitive task that's called toil. So we should address toil. So basically we could, there are various ways. Maybe we could create a separate project, separate JIRA tickets and work as a team to address toil. and set up proper alerts to prevent these self infected events. And there are some natural, incidents which, which actually occur. So, when natural incidents occur, proper, you know, root cause, root cause analysis and proper post mortems are really important. And moreover, we should play blameless game. So, because blamelessness is very important, to the success of a team or organization. So post mortem should be blameless. And, for RCS, we generally follow a five, why methods you could search for five, number five, why, root cause analysis in Atlassian JIRA, or you could Google, you will, that basically applies to any scenarios. So basically you ask the same question five times until you get to the bottom of the problem. So if the disc fails, you say white field, and then. Okay, the disk was not scaled and white not scaled and there was no time white but there was no time for the DevOps, then it could be a you know planning issue then You will get to the bottom of the issue after you question five times so you could follow such strategies To you know improve rcs and push buttons. Finally, let me discuss about toil time constraints and challenges So, workload, workload is, one important aspect, in DevOps. So you might, you might be already, overburdened with a lot of, you know, vast amount of work to get to be accomplished in limited timeframe, commit only to what you can realistically handle and set clear expectations to avoid the stress of overcommitting. while it's not, always possible not to overcommit. And if in any urgent task, you know requires you to implement a solution with minimum functionality inform management about the limitations of the setup, this change this ensures they do not, you know, perceive it as a perfect and Resilient system. So let me take an example So I have been given like a week or two to set up like complete logging and monitoring system So I had to take a hard decision and set up a system which was not resilient Which was a single point of failure but You know, I made it clear to the team and management, like the system I'm going to design is, you know, could fail back at any time. So we were in an agreement, you know, that way, like, if some surprises come, I mean, when the system fails, there are no surprises to the, you know, management. So, also, you know, like proper tracking proper tracking where jira, system jira tickets or you know Whatever the ticketing system you use, right? that will help you track the issues that will help your track track your work and then Even that will showcase your work that way like Management understands like what you're working on and what is the workload you could take more and that actually helps you You know, organize your work and reduce the stress level. yeah, proper tracking and documentation are also critical. So you basically document whatever you do and you basically transition to other teammates that way, like you don't get disturbed when you, while you're on vacation, or, you know, some of the important, You know, aspects you should consider, in the DevOps world. Alright, so these are some of the, you know, learnings, I, learned in this startup. like I know I'm still learning. We are all, we all are learning. So, please let me know if you guys have any questions. please feel free to, send me an email or please feel free to connect me on LinkedIn. Alright, thank you very much for listening. You guys have a good day. Bye-bye.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Mastering the Chaos: A DevOps Engineer's Journey to Cost-Effective Cloud Monitoring and Logging

Video size:

Abstract

Summary

Transcript

Slides

Pradeep Gaddamidi

DevOps Engineer @ Evernote

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Mastering the Chaos: A DevOps Engineer's Journey to Cost-Effective Cloud Monitoring and Logging

Video size:

Abstract

Summary

Transcript

Slides

Pradeep Gaddamidi

DevOps Engineer @ Evernote

Join the community!