Shaping the Future of System Resilience: A DevOps Journey into Advanced Site Reliability Engineering

Video size:

Abstract

Discover how advanced Site Reliability Engineering (SRE) principles, combined with cutting-edge DevOps practices, can revolutionize system resilience and scalability. Learn real-world strategies in chaos engineering, AI-driven automation,

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. I'm Nagarjuna Malladi. I'm from Dallas, Texas. I work as a Principal Site Reliability Engineer in Oracle, America. Today, I welcome all of you to join in a CON42 DevOps 2025 conference. Today, I'm going to discuss about shaping the future of system resilience, a DevOps journey into advanced site reliability engineering. Welcome to a dynamic exploration of how advanced SRE principles are redefining system resilience in the DevOps era. This session will dive into the key technologies and methodologies shaping the future of scalable, secure, and reliable cloud native systems. Let's discuss about the convergence of DevOps and SRE. Shared principles. DevOps and SRE converge around transformative principles, radical collaboration, intelligent automation, and relentless commitment to systematic optimization. These strategic alignments points enable organizations to build adaptive, self healing technology ecosystems. Let me explain. When I When I was working, in systems, Cisco systems, you know, networking side at DNAC, I had to do, CI CD pipelines. And then, And the pipelines, with the pipelines, Jenkins pipelines, I need to deploy my code, in the virtual missions or AWS cloud, different locations, and then I have to go after deployment. I have to go there and I have to. monitor how my deployment was there and how the application is doing, how the CPU, memory, hard disk, all the good things about the system performance. So basically, I has to do, a complete end to end role. So that's how we know DevOps and, SRE. I can, I have to do the both things, for a internal, customers. So where you take the internal customers in a, any large organization, you have to, be in the both, most of the cases, I can say 99%, you have to do deal both DevOps side and, you know, SRE side as a engineer. And then, let me talk about, synergetic practices, strategic SRE methodologies, including sophisticated monitoring, predictive incident response, and dynamic capacity orchestration seamlessly integrate with DevOps workloads, like advanced CI CD pipelines and infrastructure as a code creating a host holistic approach to technological, resilience. As I mentioned, the, the, in the previous example, and if I tell you some more examples, you know, the, if every engineer, they are highly dependent on, ticketing system because, you know, it's highly impossible, for, for us, to go into a thousand plus, servers, VMs, environment and look for a, you know, issue or error, you know, even it's a DevOps or, you know, even it's SRE. So everything is based on the proper ticketing system and then incident responses. You know, if you are, depending on the Slack or Webex chats, you know, getting a messages into your message messaging system so that you can look at that, you can fix your code or, you know, you can feel that, you know, whatever you are doing is, perfect. And then you can go for the next steps. And you can go for the, you know, deployments in a strategic way, you know, going for a, lower environments to higher environments. So that's how I can see, you know, DevOps and SRE have same shared principles and, you know, synergetic practices. The cloud native infrastructure, the foundation of resilience, Serverless computing, leveraging serverless platforms like AWS, Lambda, or Google Cloud Functions allows for automatic scaling and reduces operational overhead. Container orchestration, Kubernetes and other container orchestration tools ensure the efficient deployment and management of container applications across a cluster of servers. Thank you. Infrastructure as a code, tools like Terraform and CloudFormation enable. Automated provisioning and configuration management of infrastructure, resources, reducing errors, and increasing consistency. I assume, I know if you are part of this, CON 42, A DevOps conference, I assume you already might have, come across or might be already working. I might be already a master, you know, in working with, Kubernetes, Docker, Terraform, Ansible, like a lot of, tools. I'll tell you, the advantage, you know, as a, let's say, if you think about any automation, you know, it's highly not possible. you know, we logging into a service and then, making, the changes are making the enhancements. So we have to go for, any tool, like Ansible, Terraform, or, whatever you want to call it. Puppet, chef, cookbooks, you know, we need to go there and then, make our automation or changes there so that, you know, it can talk with all our servers. And then if you see the example of Kubernetes. What is the advantage of that by having a kubernetes? You can make your application, run always, you know, it has a kubernetes has A lot of pods are running with just in case if there is if you want to do rollover or if you want to do A deployment you can do pod by pod So that you can make your application always running, and then you can do, the deployments also. That's the advantage, the Kubernetes, type of, clustering concept gives us so that you can make your application always, up and running. So let's talk about, key SRE strategies, building robust systems. Security. Now implementing security best practices like. Identifying and access management encryptions and vulnerability scanning is crucial for system resilience. Observability tools like Prometheus, Grafana, and Jagger provide comprehensive monitoring, alerting, and tracing capabilities, enabling rapid issue detection and resolution. Capacity planning. Proactively capacity planning based on historical data and anticipated growth ensure optimal resource allocation and prevents performance degradation. Challenge engineering. Introducing control failures and disruptions to systems helps identify vulnerabilities and test recovery mechanisms leading to greater resilience. for joining us. So when I'm discussing about this, mm, I think, see, Promus and Grafana, let's say. if you want to monitor anything, you know, Promus has so many exporters. Let's say, if you want to monitor URLs, if you want to monitor where, certificates, expiry date, or you know, your key tool, certificates, or if you want to monitor your, system performance. your network performance, you name it, you know, your, application performance, your services, your Linux services, your Linux applications, Windows services, anything, you know, with the Prometheus. You know, you can have a, exporter and then, with that exporter, you can able to monitor and then you can able to, take advantage of those, exporters like that's called observability. We can observe end to end what's happening. And then on top of this, we can visualize everything with the Grafana, you know, now. Grafana is a powerful visualization tool. And then, and you can directly send alerts from Grafana itself. You know, it can be integrating with any tools like Slack, Webex, you know, emails, ticketing systems, you know, we can, integrate any tool with the Grafana and then you can send alerts, tickets, whatever you want. So that's how you can make observability work for you. And then security, you know, now I work in healthcare side and then, healthcare, the security protocols are very important. And previously I used to work in, you know, mobile technologies, banking side and everywhere the security protocols are very important. And then we have to maintain being SRE or, you know, DevOps sites, you know, those security protocols. And, the biggest thing now for the companies is, cost management, which comes with the capacity planning. You know, is our, storage is scaling, is our, CPU is scaling, why it's scaling? It's scaling how much money it's cost. And then how many CPUs the system is using. And then, you know, can we re, reuse anywhere? So it's all the questions when we think about the capacity planning. So capacity planning is very important and monitoring the capacity with, tools like Prometheus and Grafana is more important. so that we can see what's, what's taking a lot of, your CPU and memory, you know, for the application and the same Chahar's engineering introducing, you know, control failures and disruptions to systems. Here, you know, we can think of, doing some, machine learning or AI to see at exactly what type of failures are happening and at what time stamp, these failures are happening so that we can go back and review, and then we can able to, find a pattern and we can able to fix them. So that's, Chavos engineering. Metrics that matter. Measuring system health and performance. 99. 99 percent availability. Represent the percentage of the time system is operational and accessible to users. 100 milliseconds latency. Measures the time it takes for the system. Do respond to a request impacting user experience and system performance. 100 throughput indicates the number of requests a system can handle per unit of the time, reflecting its progressing capacity. And zero error budget defines the acceptable level of system failures of errors, providing a framework for balancing reliability and innovation. So think about a platform where you are working. Let's say if I'm working on healthcare. And then they were, Trials are going down and then it's, not, 100 percent availability. think about the impact, of that on the healthcare, industry or, you know, on the patient, information, patient, requirements. If any You know, health care trials are going down and then, it's a huge impact on the, you know, health care industry. forget about the health care and then think about, You know systems going down on black friday, the walmart's or you know amazon's, all the best buys, you know, let's say if their website is Not working or maybe if the payment systems are not working there, you know, think about their loss On the you know in those days and then So some, you know, not only, you know, mobile industries, healthcare, you know, now, nowadays, all the industries are in need of site reliability engineering, and every industry has their own customers. And then, and then keeping that customer credibility is important with ongoing competition and a lot of things happening around. So for them, 99. 99, 100 percent availability is always important and then, they can't, Bear with a lot of errors, and then if that were happening with a lot of manual errors, that's very hard to take. So, with the DevOps and SRE technologies, automations, you know, we can able to minimize or eliminate those manual errors and then make systems highly available. So, the role of automations in SRE streamlining operations. Automated provisions, provisioning and configuration management for infrastructure resources. Automated monitoring and electing systems for proactive issues, detection and resolution. Automated deployment and rollback processes for application and services. Automated incident response and remediate workflows for faster recovery times. So, let me explain, with an example. So let's say, you know, I will take example of, a networking, company's, scenario. Let's say if you want to deliver a networking package and then you need to collect with, so many teams, like it has a subcomponents and then you need to collect with so many teams. How you proceed, you know, it's, it's highly not possible, you know, going to each team, and then getting the packages, testing it, and then validating and then merging. So nowadays with, in the fast going IT environment, it has to be everything automated. So the packages should be certified and automated. And then, they should be, properly monitored and then, They should be deployed, into a virtual missions or cloud, and then properly mounted with alerting systems, and then fire the alerts if needed. And then, and then if there, if there is any issue, if that is not, Passed in the basic testing or automation testing immediately, sent back to the development team and then ask them to remediate the issues for the fast trick for you. It has to be that much, you know, fast paced. nowadays, and then with the automation and SRE, CICDs DevOps, it's like all these, different entities, you know, clubbing together, you know, we can make, this kind of things, fastly, I mean, fix, very quickly and think about, transportation industry like Tesla. If you're going on a Tesla car and suddenly if you're, it sees an object and then if the car is not stopped in the autopilot, then that's a big issue. immediately the machine learning, has to catch that event and then update back to the Tesla's development team and they has to work and then fix that issue and, update, the software. So that's how, you know, these tasks are, Has to be like, high priority and, you know, has to be automated, end to end, for the faster, you know, you know, for the faster resilience and, you know, to make, you know, operations streamlined. AI and machine learning in SRE. Enhancing predictive insights. Anomaly detection. AI algorithms identify deviations from expected behavior and enabling early detection of potential problems. Predictive maintenance. AI models predict system failures or errors. Resource bottlenecks allowing for proactive intervention and prevention. Automated root cause analysis. A, provide tools to help pinpoint the underlying causes of incidents leading to faster resolution and improved system understanding. So just now I mentioned about Tesla's example if you see, A networking example, like let's say if there is a, you know, 100 Walmart's, in the Texas area, and then, if there is a main Walmart center is in Dallas area and from there, you know, all the networking signals, routers, and everything is being monitored from that area. Think about, a Walmart router is going down in, Houston area. If somebody has to go from Dallas to Houston, they have to drive or fly and then fix that, A router issue. until that, the Walmart, won't be having any operations they have to close. so let's say, if that is, automated, you know, if the routers, everything is, you know, software is automated, and then the person who is looking at the bo at the signals, I know routers, health, if they got the notification that who's. Houston, router, is, down, in Houston router is, you know, signals are getting dropped, you know, the firmware need to be updated. Software need to be updated. Then we can, he can push the button and then Houston router, firmware will be updated and then without any downtime, they can able to fix it. And then, the business will run normally. And then, same thing, you know, if you, Think example of Apple stores or, you know, multiple locations where, you know, any franchise places where it has a common internet or common routers or, you know, the networking, you know, the, you know, servers to be up is, you know, network to be up is pretty important. In those places You can think of having, you know, AI and, machine learning, to see the issues and then, fix it automatically. And then, automatic automated root cause analysis. Let's say, think about this, you know, if the. database is, full, database is because of the database, you know, if any healthcare or, you know, anything, trials are going down, and then until, somebody comes and then, fix it, the database, what about auto restarting the database? You know, we can, try that as a first step, auto restart the database and then get the error messages as a separate log so that somebody can debug it later and then, bring the trials up or maybe, you know, before it's going down, maybe auto tune and then, make the performance up or probably try to remove unwanted, events or logs, you know, to decrease the, storage. So something like that we can able to do. I know, and then. we can fix the, you know, root cause, you know, we can get the root causes and then, try to fix them. So all these are, pretty much important. And then these all involved with, a proper SRE, dealing with, all, all the issues are, you know, outages with the help of development team and, you know, DevOps. So that's how we can able to enhance, you know, predictive insights. SRE best practices from theory to action. Embrace continuous improvement. Continuously monitor system performance. Identify areas for improvement. And iterate on solutions, foster collaboration, promote open communication and collaboration between development operations and SR teams, invest in observability, build robust monitoring and tracing capabilities for comprehensive system insights and rapid issue resolution. Prioritize automation, automate as many processes as possible, streamlining operations, reducing errors and freeing up valuable resources. So let's discuss on this one, embracing a continuous improvement. So I would, I would tell that, you know, have regular, requirements gathering and then meetings. And then, you know, I mean, nothing is, too small. Like if you see a Paw Patrol, you know, the Paw Patrol guy says that nothing is no task is too small task, you know, same for LS for SRE, you know, nothing is too small, you know, you have to, get what, what are the requirement issues and then, plan it properly. And then, so you need to be on, finding the root cause and then, you know, fixing it and then, and then it has to be automated way because. a manual, manual action sometimes, causes the, you know, manual errors, human errors. Nobody is perfect. So we have to keep, you know, automated way, so that, it, we can eliminate a lot of, human errors. And then, you know, Observability of a lot of things, you know, not only system health, you know, you know, observability in terms of logs using Elkstack, Splunk, or, whatever the tool that your company is using, you know, OpenSearch. Yeah. And then using those tools and collaborating between operations and then, Development teams and SRE teams are pretty important. You know, development team, has to, you know, give proper directions to the operations. And then, operations and development team has to give proper requirements to the SRE teams. So these are all go in a circle, like everything, is dependent on the other things. So we have to make sure that, end to end, workflow is, Perfect. so that's, That's, I mean, on the day one, it's not possible, but when the complete group, want to be SRE focused and SRE principal focused, eventually we can achieve that and then we can have a proper SRE teams and groups, and then we can have good, DevOps to, SRE journeys, there in those, teams as a successful, as a success mantra. So, industry specific considerations adapting SRE to unique requirements. So, if you see the sales, healthcare, or financial, as I mentioned before, so all these, You know, industries are pretty much, have unique requirements and then, healthcare with, you know, protocols and then, sales with a hundred percent, like always it has to be up, and then, And then it has to be very quick, you know, let's say if you're trying to buy a MacBook and then, if it is loading, five, 10 minutes time, you know, for you to enter the card and check out, do you be still interested to get that MacBook? You know, you'll be like log off, log back, you know, you cannot wait for that long, that long. And then, you know, banking, you know, so the system has to be always secured and then, it, you know, think about, if any hacks are happening in some bank and then do you keep your money there? So, so like that, you know, all these industries are different. Unique, but it has a unique requirements and then they all need SREs for those industries. So future trends, SRE in age of AI and edge computing, AI driven automation. Further advancements in AI will lead more sophisticated automation, optimizing system performance and resilience. Edge computing, SRE will be crucial for managing distributed systems at the edge, ensuring reliability and low latency for users. Serverless and microservices, SRE provides practices will continue To evolve alongside the adoption of serverless architecture and, you know, microservices. So, so yeah, this, this definitely a future trends, as we discussed before, Kubernetes and, you know, serverless, deployments, microservices, and then, AI driven automation. So everything. so now if you touch every company everywhere, it's all AI driven. so AI driven because. you know, we can do more work, with the less time, with the, having a advantage, think about, this, if I'm writing a book of, a hundred pages with, SRE DevOps, a lot of things, and then, you asking about AI, give me, a book of a hundred pages. with, SRE and, DevOps, including, a lot of things. so that will give in a few minutes where I do, spend a month or two months, you know, to get those, pages. So that's the difference. And then, at the end, you can do the same thing. You know, you can get the book, correct the data, and then you can publish, or, you know, you can wait one or two months and then write your book. So with that AI driven automation, you can make. Are things faster and then, key takeaways, a resilient future by embracing advanced SRA principles and integrating them seamlessly with DevOps practices, organizations can build systems that are not only reliable, but also highly scalable. Secure and adaptable to ever changing technology landscape. The future of system resilience lies in the convergence of these powerful methodologies enabling business to innovate and grow with confidence. So these are the, key takeaways. yeah, I, I mean to say that, this is, the time, you know, everybody, whatever the provision you are, you know, you are DevOps development or you know, you are a manager vp or if you're a even, you are a, it sales are no a program management. Whatever I provision is. You always need to have a, SRE mindset and you always need to have a, your service or system has to be up a hundred percent. Now, and then, you always need to be looking for root cause, for any issues, like how we can, You know, instead of a quick fix, how we can fix the root cause and then fix permanently. So, so that what, you know, my takeaway and then, you know, to get a resilient future in IT systems. Thank you very much, for the time, you know, thank you for listening. If you have any questions or anything, you can reach out to me on my LinkedIn or, you know, you can reach out to me on any social media platforms and, we can discuss more, on any, questions or doubts, you have, and, thank you. Thanks for your time. And then, looking forward for other speakers.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Shaping the Future of System Resilience: A DevOps Journey into Advanced Site Reliability Engineering

Video size:

Abstract

Summary

Transcript

Slides

Nagarjuna Malladi

Principal Software Engineer SRE @ Oracle

Join the community!

Featured event

2026

2025

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Shaping the Future of System Resilience: A DevOps Journey into Advanced Site Reliability Engineering

Video size:

Abstract

Summary

Transcript

Slides

Nagarjuna Malladi

Principal Software Engineer SRE @ Oracle

Join the community!