Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone.
I'm Nagarjuna Malladi.
I'm from Dallas, Texas.
I work as a Principal Site Reliability Engineer in Oracle, America.
Today, I welcome all of you to join in a CON42 DevOps 2025 conference.
Today, I'm going to discuss about shaping the future of system
resilience, a DevOps journey into advanced site reliability engineering.
Welcome to a dynamic exploration of how advanced SRE principles are redefining
system resilience in the DevOps era.
This session will dive into the key technologies and methodologies shaping
the future of scalable, secure, and reliable cloud native systems.
Let's discuss about the convergence of DevOps and SRE.
Shared principles.
DevOps and SRE converge around transformative principles,
radical collaboration, intelligent automation, and relentless commitment
to systematic optimization.
These strategic alignments points enable organizations to build adaptive,
self healing technology ecosystems.
Let me explain.
When I When I was working, in systems, Cisco systems, you know, networking side
at DNAC, I had to do, CI CD pipelines.
And then, And the pipelines, with the pipelines, Jenkins pipelines, I need to
deploy my code, in the virtual missions or AWS cloud, different locations, and
then I have to go after deployment.
I have to go there and I have to.
monitor how my deployment was there and how the application is doing, how the
CPU, memory, hard disk, all the good things about the system performance.
So basically, I has to do, a complete end to end role.
So that's how we know DevOps and, SRE.
I can, I have to do the both things, for a internal, customers.
So where you take the internal customers in a, any large organization, you
have to, be in the both, most of the cases, I can say 99%, you have
to do deal both DevOps side and, you know, SRE side as a engineer.
And then, let me talk about, synergetic practices, strategic SRE
methodologies, including sophisticated monitoring, predictive incident
response, and dynamic capacity orchestration seamlessly integrate
with DevOps workloads, like advanced CI CD pipelines and infrastructure
as a code creating a host holistic approach to technological, resilience.
As I mentioned, the, the, in the previous example, and if I tell you some more
examples, you know, the, if every engineer, they are highly dependent
on, ticketing system because, you know, it's highly impossible, for, for us, to
go into a thousand plus, servers, VMs, environment and look for a, you know,
issue or error, you know, even it's a DevOps or, you know, even it's SRE.
So everything is based on the proper ticketing system
and then incident responses.
You know, if you are, depending on the Slack or Webex chats, you know,
getting a messages into your message messaging system so that you can look
at that, you can fix your code or, you know, you can feel that, you know,
whatever you are doing is, perfect.
And then you can go for the next steps.
And you can go for the, you know, deployments in a strategic way,
you know, going for a, lower environments to higher environments.
So that's how I can see, you know, DevOps and SRE have same shared principles
and, you know, synergetic practices.
The cloud native infrastructure, the foundation of resilience, Serverless
computing, leveraging serverless platforms like AWS, Lambda, or Google
Cloud Functions allows for automatic scaling and reduces operational overhead.
Container orchestration, Kubernetes and other container orchestration tools
ensure the efficient deployment and management of container applications
across a cluster of servers.
Thank you.
Infrastructure as a code, tools like Terraform and CloudFormation enable.
Automated provisioning and configuration management of
infrastructure, resources, reducing errors, and increasing consistency.
I assume, I know if you are part of this, CON 42, A DevOps conference, I
assume you already might have, come across or might be already working.
I might be already a master, you know, in working with, Kubernetes, Docker,
Terraform, Ansible, like a lot of, tools.
I'll tell you, the advantage, you know, as a, let's say, if you
think about any automation, you know, it's highly not possible.
you know, we logging into a service and then, making, the changes
are making the enhancements.
So we have to go for, any tool, like Ansible, Terraform, or,
whatever you want to call it.
Puppet, chef, cookbooks, you know, we need to go there and then, make our
automation or changes there so that, you know, it can talk with all our servers.
And then if you see the example of Kubernetes.
What is the advantage of that by having a kubernetes?
You can make your application, run always, you know, it has a kubernetes has A lot
of pods are running with just in case if there is if you want to do rollover
or if you want to do A deployment you can do pod by pod So that you can make
your application always running, and then you can do, the deployments also.
That's the advantage, the Kubernetes, type of, clustering concept gives
us so that you can make your application always, up and running.
So let's talk about, key SRE strategies, building robust systems.
Security.
Now implementing security best practices like.
Identifying and access management encryptions and vulnerability scanning
is crucial for system resilience.
Observability tools like Prometheus, Grafana, and Jagger provide
comprehensive monitoring, alerting, and tracing capabilities, enabling
rapid issue detection and resolution.
Capacity planning.
Proactively capacity planning based on historical data and anticipated growth
ensure optimal resource allocation and prevents performance degradation.
Challenge engineering.
Introducing control failures and disruptions to systems helps identify
vulnerabilities and test recovery mechanisms leading to greater resilience.
for joining us.
So when I'm discussing about this, mm, I think, see, Promus and Grafana, let's say.
if you want to monitor anything, you know, Promus has so many exporters.
Let's say, if you want to monitor URLs, if you want to monitor where,
certificates, expiry date, or you know, your key tool, certificates, or if you
want to monitor your, system performance.
your network performance, you name it, you know, your, application performance,
your services, your Linux services, your Linux applications, Windows services,
anything, you know, with the Prometheus.
You know, you can have a, exporter and then, with that exporter, you can
able to monitor and then you can able to, take advantage of those, exporters
like that's called observability.
We can observe end to end what's happening.
And then on top of this, we can visualize everything with
the Grafana, you know, now.
Grafana is a powerful visualization tool.
And then, and you can directly send alerts from Grafana itself.
You know, it can be integrating with any tools like Slack, Webex, you
know, emails, ticketing systems, you know, we can, integrate any tool with
the Grafana and then you can send alerts, tickets, whatever you want.
So that's how you can make observability work for you.
And then security, you know, now I work in healthcare side and then, healthcare,
the security protocols are very important.
And previously I used to work in, you know, mobile technologies, banking
side and everywhere the security protocols are very important.
And then we have to maintain being SRE or, you know, DevOps sites,
you know, those security protocols.
And, the biggest thing now for the companies is, cost management, which
comes with the capacity planning.
You know, is our, storage is scaling, is our, CPU is scaling, why it's scaling?
It's scaling how much money it's cost.
And then how many CPUs the system is using.
And then, you know, can we re, reuse anywhere?
So it's all the questions when we think about the capacity planning.
So capacity planning is very important and monitoring the capacity with, tools like
Prometheus and Grafana is more important.
so that we can see what's, what's taking a lot of, your CPU and memory, you know,
for the application and the same Chahar's engineering introducing, you know, control
failures and disruptions to systems.
Here, you know, we can think of, doing some, machine learning or AI to
see at exactly what type of failures are happening and at what time
stamp, these failures are happening so that we can go back and review,
and then we can able to, find a pattern and we can able to fix them.
So that's, Chavos engineering.
Metrics that matter.
Measuring system health and performance.
99.
99 percent availability.
Represent the percentage of the time system is operational
and accessible to users.
100 milliseconds latency.
Measures the time it takes for the system.
Do respond to a request impacting user experience and system performance.
100 throughput indicates the number of requests a system can
handle per unit of the time, reflecting its progressing capacity.
And zero error budget defines the acceptable level of system failures
of errors, providing a framework for balancing reliability and innovation.
So think about a platform where you are working.
Let's say if I'm working on healthcare.
And then they were, Trials are going down and then it's,
not, 100 percent availability.
think about the impact, of that on the healthcare, industry
or, you know, on the patient, information, patient, requirements.
If any You know, health care trials are going down and then, it's a huge impact
on the, you know, health care industry.
forget about the health care and then think about, You know systems going
down on black friday, the walmart's or you know amazon's, all the best buys,
you know, let's say if their website is Not working or maybe if the payment
systems are not working there, you know, think about their loss On the you
know in those days and then So some, you know, not only, you know, mobile
industries, healthcare, you know, now, nowadays, all the industries are in need
of site reliability engineering, and every industry has their own customers.
And then, and then keeping that customer credibility is important
with ongoing competition and a lot of things happening around.
So for them, 99.
99, 100 percent availability is always important and then, they can't, Bear
with a lot of errors, and then if that were happening with a lot of manual
errors, that's very hard to take.
So, with the DevOps and SRE technologies, automations, you know, we can able to
minimize or eliminate those manual errors and then make systems highly available.
So, the role of automations in SRE streamlining operations.
Automated provisions, provisioning and configuration management
for infrastructure resources.
Automated monitoring and electing systems for proactive issues,
detection and resolution.
Automated deployment and rollback processes for application and services.
Automated incident response and remediate workflows for faster recovery times.
So, let me explain, with an example.
So let's say, you know, I will take example of, a
networking, company's, scenario.
Let's say if you want to deliver a networking package and then you need
to collect with, so many teams, like it has a subcomponents and then you
need to collect with so many teams.
How you proceed, you know, it's, it's highly not possible, you know,
going to each team, and then getting the packages, testing it, and
then validating and then merging.
So nowadays with, in the fast going IT environment, it has
to be everything automated.
So the packages should be certified and automated.
And then, they should be, properly monitored and then, They should be
deployed, into a virtual missions or cloud, and then properly
mounted with alerting systems, and then fire the alerts if needed.
And then, and then if there, if there is any issue, if that is not, Passed in
the basic testing or automation testing immediately, sent back to the development
team and then ask them to remediate the issues for the fast trick for you.
It has to be that much, you know, fast paced.
nowadays, and then with the automation and SRE, CICDs DevOps, it's like all
these, different entities, you know, clubbing together, you know, we can
make, this kind of things, fastly, I mean, fix, very quickly and think about,
transportation industry like Tesla.
If you're going on a Tesla car and suddenly if you're, it sees an object
and then if the car is not stopped in the autopilot, then that's a big issue.
immediately the machine learning, has to catch that event and then update
back to the Tesla's development team and they has to work and then fix
that issue and, update, the software.
So that's how, you know, these tasks are, Has to be like, high priority and, you
know, has to be automated, end to end, for the faster, you know, you know, for
the faster resilience and, you know, to make, you know, operations streamlined.
AI and machine learning in SRE.
Enhancing predictive insights.
Anomaly detection.
AI algorithms identify deviations from expected behavior and enabling
early detection of potential problems.
Predictive maintenance.
AI models predict system failures or errors.
Resource bottlenecks allowing for proactive intervention and prevention.
Automated root cause analysis.
A, provide tools to help pinpoint the underlying causes of incidents
leading to faster resolution and improved system understanding.
So just now I mentioned about Tesla's example if you see, A networking
example, like let's say if there is a, you know, 100 Walmart's, in the Texas
area, and then, if there is a main Walmart center is in Dallas area and
from there, you know, all the networking signals, routers, and everything
is being monitored from that area.
Think about, a Walmart router is going down in, Houston area.
If somebody has to go from Dallas to Houston, they have to drive or fly
and then fix that, A router issue.
until that, the Walmart, won't be having any operations they have to close.
so let's say, if that is, automated, you know, if the routers, everything is, you
know, software is automated, and then the person who is looking at the bo at
the signals, I know routers, health, if they got the notification that who's.
Houston, router, is, down, in Houston router is, you know, signals
are getting dropped, you know, the firmware need to be updated.
Software need to be updated.
Then we can, he can push the button and then Houston router, firmware
will be updated and then without any downtime, they can able to fix it.
And then, the business will run normally.
And then, same thing, you know, if you, Think example of Apple stores or, you
know, multiple locations where, you know, any franchise places where it has
a common internet or common routers or, you know, the networking, you know, the,
you know, servers to be up is, you know, network to be up is pretty important.
In those places You can think of having, you know, AI and, machine learning, to see
the issues and then, fix it automatically.
And then, automatic automated root cause analysis.
Let's say, think about this, you know, if the.
database is, full, database is because of the database, you know, if any healthcare
or, you know, anything, trials are going down, and then until, somebody comes
and then, fix it, the database, what about auto restarting the database?
You know, we can, try that as a first step, auto restart the database and then
get the error messages as a separate log so that somebody can debug it later and
then, bring the trials up or maybe, you know, before it's going down, maybe auto
tune and then, make the performance up or probably try to remove unwanted, events or
logs, you know, to decrease the, storage.
So something like that we can able to do.
I know, and then.
we can fix the, you know, root cause, you know, we can get the root
causes and then, try to fix them.
So all these are, pretty much important.
And then these all involved with, a proper SRE, dealing with, all, all the issues
are, you know, outages with the help of development team and, you know, DevOps.
So that's how we can able to enhance, you know, predictive insights.
SRE best practices from theory to action.
Embrace continuous improvement.
Continuously monitor system performance.
Identify areas for improvement.
And iterate on solutions, foster collaboration, promote open
communication and collaboration between development operations and
SR teams, invest in observability, build robust monitoring and tracing
capabilities for comprehensive system insights and rapid issue resolution.
Prioritize automation, automate as many processes as possible, streamlining
operations, reducing errors and freeing up valuable resources.
So let's discuss on this one, embracing a continuous improvement.
So I would, I would tell that, you know, have regular, requirements
gathering and then meetings.
And then, you know, I mean, nothing is, too small.
Like if you see a Paw Patrol, you know, the Paw Patrol guy says that nothing
is no task is too small task, you know, same for LS for SRE, you know, nothing
is too small, you know, you have to, get what, what are the requirement
issues and then, plan it properly.
And then, so you need to be on, finding the root cause and then, you
know, fixing it and then, and then it has to be automated way because.
a manual, manual action sometimes, causes the, you know,
manual errors, human errors.
Nobody is perfect.
So we have to keep, you know, automated way, so that, it, we can
eliminate a lot of, human errors.
And then, you know, Observability of a lot of things, you know, not only system
health, you know, you know, observability in terms of logs using Elkstack,
Splunk, or, whatever the tool that your company is using, you know, OpenSearch.
Yeah.
And then using those tools and collaborating between operations
and then, Development teams and SRE teams are pretty important.
You know, development team, has to, you know, give proper
directions to the operations.
And then, operations and development team has to give proper
requirements to the SRE teams.
So these are all go in a circle, like everything, is
dependent on the other things.
So we have to make sure that, end to end, workflow is, Perfect.
so that's, That's, I mean, on the day one, it's not possible, but when the
complete group, want to be SRE focused and SRE principal focused, eventually
we can achieve that and then we can have a proper SRE teams and groups,
and then we can have good, DevOps to, SRE journeys, there in those, teams
as a successful, as a success mantra.
So, industry specific considerations adapting SRE to unique requirements.
So, if you see the sales, healthcare, or financial, as I mentioned before, so all
these, You know, industries are pretty much, have unique requirements and then,
healthcare with, you know, protocols and then, sales with a hundred percent,
like always it has to be up, and then, And then it has to be very quick, you
know, let's say if you're trying to buy a MacBook and then, if it is loading,
five, 10 minutes time, you know, for you to enter the card and check out, do you
be still interested to get that MacBook?
You know, you'll be like log off, log back, you know, you cannot
wait for that long, that long.
And then, you know, banking, you know, so the system has to be always secured
and then, it, you know, think about, if any hacks are happening in some bank
and then do you keep your money there?
So, so like that, you know, all these industries are different.
Unique, but it has a unique requirements and then they all
need SREs for those industries.
So future trends, SRE in age of AI and edge computing, AI driven automation.
Further advancements in AI will lead more sophisticated automation, optimizing
system performance and resilience.
Edge computing, SRE will be crucial for managing distributed systems
at the edge, ensuring reliability and low latency for users.
Serverless and microservices, SRE provides practices will continue To evolve
alongside the adoption of serverless architecture and, you know, microservices.
So, so yeah, this, this definitely a future trends, as we discussed
before, Kubernetes and, you know, serverless, deployments, microservices,
and then, AI driven automation.
So everything.
so now if you touch every company everywhere, it's all AI driven.
so AI driven because.
you know, we can do more work, with the less time, with the, having a advantage,
think about, this, if I'm writing a book of, a hundred pages with, SRE DevOps, a
lot of things, and then, you asking about AI, give me, a book of a hundred pages.
with, SRE and, DevOps, including, a lot of things.
so that will give in a few minutes where I do, spend a month or two
months, you know, to get those, pages.
So that's the difference.
And then, at the end, you can do the same thing.
You know, you can get the book, correct the data, and then you can publish,
or, you know, you can wait one or two months and then write your book.
So with that AI driven automation, you can make.
Are things faster and then, key takeaways, a resilient future by
embracing advanced SRA principles and integrating them seamlessly
with DevOps practices, organizations can build systems that are not only
reliable, but also highly scalable.
Secure and adaptable to ever changing technology landscape.
The future of system resilience lies in the convergence of these powerful
methodologies enabling business to innovate and grow with confidence.
So these are the, key takeaways.
yeah, I, I mean to say that, this is, the time, you know, everybody, whatever
the provision you are, you know, you are DevOps development or you know, you are
a manager vp or if you're a even, you are a, it sales are no a program management.
Whatever I provision is.
You always need to have a, SRE mindset and you always need to have a, your service
or system has to be up a hundred percent.
Now, and then, you always need to be looking for root cause, for any issues,
like how we can, You know, instead of a quick fix, how we can fix the
root cause and then fix permanently.
So, so that what, you know, my takeaway and then, you know, to get
a resilient future in IT systems.
Thank you very much, for the time, you know, thank you for listening.
If you have any questions or anything, you can reach out to me on my LinkedIn
or, you know, you can reach out to me on any social media platforms and, we
can discuss more, on any, questions or doubts, you have, and, thank you.
Thanks for your time.
And then, looking forward for other speakers.