Transcript
This transcript was autogenerated. To make changes, submit a PR.
A recent survey done by KPMG found 77%
of their participants thought Jinya is
going to have the largest impact on their businesses out
of all emerging technologies. And out
of this survey 73 participants believe Genei will
have a larger impact on increasing their productivity.
And interestingly 71% of the participants of the view
that they got to implement a Jennai solution within next two
years and 65% believe
Jennai will help their organization gain a competitive
advantage. So these are some of the numbers which prove
the impact we are having on with gene eyes. So Jenny is
no longer hype cycle, it's here to stay and that will have the
largest positive impacts for any organization.
So if you look at the typical business operation functions
and where JAi can typically implemented,
the highest placed are your IT tech and operation.
So that is by far standout when compared to other aspects like marketing,
sales or customer management, product development or R and
D, finance, accounting, HR or risk and legal. So this
prove that everyone, the organizations
and especially the participant of the KPMG survey was
viewed that the most of the Genai benefits
they are going to leverage part of IT tech and operations hello
everyone, my name is Indiki Vimilasurier. Welcome to
SRA 2024 organized by Conf 42.
As part of my presentation I will discuss about SRE
2.0 which is about leveraging this genai
and how we can amplify reliability.
Genei is here to stay and Gene AI definitely allow
you to increase the productivity and the reliability
of your SRE implementation.
As part of this presentation we will discuss about what are the challenges
and how what are the roles Denea is playing and
the impact on the key pillars and some of the Genai use cases,
potential benefits and some of the implementation strategies
and what are the best practices and some of the pitfalls you
need to avoid. Quick intro. About myself my
name is Indigremera Surya. I'm based out of Colombo, Sri Lanka.
I'm a reliable engineering advocate and a practitioner.
My specializations are in site reliability engineering,
observability, aiops and generative AI.
I am a passionate technical trainer and an energetic technical
blogger. You can find me writing at dev two. I am a
proud AWS community builder under cloud
operations and a very proud ambassador at DevOps
Institute. If you look at the Gartner hype cycle for
SRE, you realize that it's a journey.
It's about taking you from understanding the
innovation triggers like what is required, what is driving you a SRE
journey. And it can be like adapting service level objectives
coming up with monitoring as a code, solution,
infrastructure orchestration, finops, or it
can be more for chaos engineering. And while
you go through this journey, you will go through a certain
set of tasks and then finally you finish up with probably
APA and they will say Cox and what
I would like to walk through as part of
this presentation is that some of these areas
with implementation of site reliability
engineering, we can really expedite and we
can gain higher impact for higher productivity out by
leveraging Gene A's. And if I take a moment and
just to like brief that we are all in the same page,
site reliability engineering. Some of the key principles
are how we can reduce our organization silos.
That's mainly I think how we can get everyone
in the organization to be responsible
for the customer experience. Because something sometimes
lack in organization is how who is
going to own the customer experience and the person, the team who
is how to do that had to have the technical skills
on that as well. And then it's about
accepting failures as normal. No longer our targets are achieving
the 100% reliability or availability or 100%
of anything. It's about considering the business
needs, it's about considering our deployment architecture
and some of the limitations. And then we identify what is
the acceptable service level objectives.
And then we follow gradual changes.
We not planning to do any big band changes.
These gradual changes help us not only to ensure
there's a decent volatility, the number of change,
the increased number of changes we do in production environment,
as well as in case of any issue,
we are able to revert back these changes. And of
course we want to leverage tooling automation.
It's about automating the sales job away and we want
to bring in our automation mindset for anything and everything we
are doing, and then finally measure everything. It's about
having your service level indicators, it's about making them actionable,
it's about building your service level objectives. And error budgets
make the area budgets consequences driven so that
you have the customer experience or the service
level objectives driving you entire,
not only the operations, but entire your organization.
And some of the key principles when you are trying to achieve these
overall, the fundamentals are you want to
work it, work on your observability. The world has moved from monitoring
to observability. So we want to make sure that you have a comprehensive,
solid observability platform. And then you have to identify what are
the service level indicators, service level objectives which are
directly correlated with your customer experience.
So that in case of any issue that you are able to identify part
of your SLI or SL. And then
you have to define your error budgets accordingly. Here you have
to practice accept failures as normal concept as
well. You will have to have defined
some consequences driven approach when it comes to area budgets.
And then you will have to look at how you're going to improve your system
architecture, the deployment architecture and what are
the recovery objectives like RPO RPO area, what are
your objectives. And then it's about how we're going
to work on your release manage release engineering,
incident engineering. It's about having the CFCD pipelines,
it's about integrating all your test automations
and integrate with the pipelines. It about the code
quality, it about all those things that is ensuring a
higher degree of reliable releases. And it's
about making sure that you have the correct and
agreeable incident automation workflows so that you can automate
some of those remediations. So automation is
a key it start from building infrastructure,
leveraging things like infrastructure as a code or observability as
a code or deployment automation, using your CI CD pipelines,
ensuring is to manage your capacity growth by
like embracing techniques like autoscaling.
And then finally it's about resilience engineering, doing KMS engineering,
understand what are the failure scenarios
and here you will have to define and do the discoveries,
then understand what are your steady state and then come up with the,
the hypothesis or the failure scenarios and the test cases.
And then you go through it's a continuous manner while observing
your sister. And finally the organization, culture and awareness
is very important. You will have to have a blameless culture where
you understand there's a failures are normal and every
issue you are trying to understand how you can improve and how
you can build reliability and the fault tolerance into your system.
The modern, the distributed systems are very complex and
one of the reason is we have transformed from monolith
to microservices and we have transformed from
standard monitoring to observability, which is your logs,
metrics and traces and events. And we have
transformed from on premise to cloud. And it can be
you are hosted in multiple clouds where we call poly cloud,
or you have a combination of hybrid approach where it's
on premise plus cloud. So the microservices,
the observability layers and your cloud
responsible for generating a huge amount of data
and thus has opened up expansion of
data sources. This exponential growth has
resulted in, while it's a technical advancement,
we are able to manage
our systems and manage our customer experience and we can deliver things
much faster and we can identify things much faster.
But we have built a lot of failure scenarios into
this system now. So there are a lot of touch points where it
can go wrong. So this is a challenge, and this is
a challenge for service site reliability engineering.
This is the area where everyone is focusing on to
come up with some innovation solutions. And as
you might know, when you are starting your SRE journey,
you interpret it as operation is a
software problem. So I used to always think
that if our software is in the perfect shape,
we might not need a large operations
teams or we might not need even to the extent of your
high degree of sres. Why? Because usually
what happen is your, the incident management, your problem management,
your capacity, and most of the aspects of work coming related
to operations are down to software has not been in great
shape or the tactical, or some form of toil
or the manual work we have introduced, which would
have been built to the system itself.
So idea is that if your system is in a better
shape, then you have a limited
scopes when it comes to operations. So this down to
a fundamentally about the quality of the code.
Usually we are writing and one of the surveys,
it found that by 2026,
50% of the code in our
systems code generated through generative
AI. So that is a big number
that mean that half of the code in future will be
written by generative AI. And understandably it will have
less manual mistakes, it will align
to some better code practices. And while
generated to AI might not be able to exceed
human creativity, it might definitely ensure that it's
following a process and practices and there will be some sort of a higher
degree of code quality. So what this result is that
we are able to go into that Nuswana state,
where our software will work
more reliable in future, and that
this resulted in, we want to ensure that
there are more ways we can leverage generative AI.
Not only we provide this benefit or the reliability from the software
perspective, but the other aspect for SRE can
also leverage generative AI so that
we can amplify this reliability. So that's why I firmly
believe SRE 2.0, which is adapting
generative AI, will amplify your reliability.
So generative AI is about letting
your machine learning models coming up with new creative
content, so it can be form of text messages
or images for audio. So this is very powerful
because the combinations of and different aspects
of the way we can leverage this, have opened
up lot of opportunities in software development
and maintenance area. So this is one of the
nice aspects of what LLM can do considered
by Gartner. So what we can see here is that
we are able to input natural language structured
data for multilingual text and transcription.
And what are the capabilities of flat language models?
They are able to come up with text or code generation,
text compilation, text classification,
text summarization, text translation,
sentiment analysis, text correction, text manipulation,
name energy recognition, question and answering
style translation format translation and
simple analytics. And the outputs are naturally,
it can be your natural language text or structured data.
Again, a multilingual text or computer code.
So if you can see there's a lot of work, it can be done
using computer code. And this literally means lot of
coding. Like it's not only form of the typescript
or your Java, it's about like Python and it's about the
automation, it's about your shell scripting. So lot of these aspect
we are able to leverage Genaid and this open
up lot of opportunities. So once you go through
in detail, these are some of the great capabilities which
we are able to leverage to our,
to amplify our reliability,
while large language models are coming up with lot of opportunities.
Before going to the areas how we can use them to
amplify our reliability, I want to flag and highlights
some of the risk they're having as well. So when it comes
to llms, so models may have some model bias.
So model bias is based on the training data set. It can gen,
it can build a certain degree of bias, not to a
different site, and there can be a misinformation, lack of
context, creativity, and then when it comes to misuse,
it can be in cyberbullying or fraud or other
malware. And then there can be some usage related risk as well.
So while look at it, some of these are obviously genuine,
and some of these risks are obviously you are able to
eliminate by taking some of those best practices
and following some of the guidelines and ethical genai implementation
workflows, and what few limitations
and challenges in geneis, we are able to
handle using three aspects of generative AI capabilities,
which is known as RaG, or using a knowledge base.
And about second one is about how we can leverage LLMS
LLM agents and then obviously how we
can work with our prompt engineering. So Rag,
which is also known as retrieval augmented generation,
it's allowing us to keep our llms
up to date. Generally what happened is that LLMS
has been trained with certain data set, and once we are trying
to deploy it in our organization and trying to
have it work for us based on our
organization data, we might see a problem
where sometimes the trained data set is not
inherently correlated with what we want.
So here what we can do is we can integrate
a knowledge base we call the vector database with all
our organization data and it can be observability
data design document and architecture diagrams
and your ITSm data CMDB and all
the data and we can feed that t. So in that way
when we are making that prompt to the large language model,
we are able to first go and set the knowledge base
to retrieve the relevant information. And this relevant
information is able to improve the context.
When we are going to LLM, we are able to go with the the prompt
and then improve order enhance context so
that will resulted in a better output
generation from LLV. So this will ensure that
you are able to give more up to date and relevant
data to LLM. So example, you want to build a code
automation and you already have a code repository for script
readers. You are building a remediation engine, you have like hundreds
of remediation engines, you have scripts, your teams has built,
you can introduce them part of your the knowledge base where
the then LLM is aware of that capability
aspect or LLM might be already aware,
it might be able to do a good job. But you give that context,
you give that the aspects of what you already have. So this is
a great way the rack will allow you to keep your llms
up to date and then do enhance the context and
other aspects is we have something called LLM
ages. So example, sometimes you need
some runtime real time data when you are dealing with some
of this use case implementation. So here what
are the option is this is the representation using
AWS bedrock where when we are giving a task to
the agent and then the agent link
in turn break into a chain of thoughts where you have
a step, one to step in and each step you can make API calls.
We can, you can again connect to a knowledge base.
So what idea is that? But because you are able to make API
calls, you are able to connect to different systems.
You can get more up to date run time real time data
so that the response the llms are generating
are even more accurate. So what
happened is once the chain of thoughts, the actions are followed,
these results are again feed into the agent and with the
task plus the result agent will approach the LLF.
So yeah, now the context is improved, it has more background
information and it is able to come up with a meta response.
So this is like allowing us to ensure that
the responses are we are getting not
only have greater context, greater relevance, but we
can make it accurate. And in some areas in real
time. Now example, if you are creating a remediation script
but you want to know the exact IP address, you are going to execute
this. Then you are able to use some of these API calls into
your CMDB to identify what are the correct server and what
are the other details likewise. So this is a great way to
ensuring that you are the when you need to make
your results more accurate and some of the
data obtained wire unties. Here you
are again keeping your data accurate and relevant
while leveraging the capabilities provided by llms.
And finally, obviously working with llms you have
to get your prompts right. So we have the whole set of prompt engineering.
It's about giving the clear objectives, it's about providing the
context and it's about evaluating continuously going
with iterative refinement. And that will
help you to ensure you get the best out of your llms.
These three aspects of rag and then
leveraging LLM agents and from engineering best practices will
allow you to build a comprehensive and a better solution
which can in some aspects mitigate some
of the challenges you are having with llms and
you can amplify the results. And some
of these other properties, the from properties
I have listed as well, things like temperature, the top fee
and top tokens, those things, all the
combinations and different ways using will have a greater benefit.
And finally, we are at the important part of my
presentation where now that we understand
our site reliability objectives and some of the challenges,
the modern distributed systems are presenting
us capabilities of large language models and
how we can even improve LLMs three
SaaS by using how we can leverage the from
properties. So with this all, I firmly believe
we are able to going to the next level of site reliability
engineering by leveraging Gene AI. I call it SRE
2.0. And we will look at in
next part of my presentation how we
can use Genei to positively impact your observability,
how you can leverage it to improve the identifying
and measuring and tracking your Sli, SLI and error
budgets. What are the ways we can use gene for system architecture
and recovery objective areas and how typically we can
use Genei in release and incident engineering and automation
resilience and even at blameless post motors.
And what how I will do is each of these pillars I
will come up with a set of use cases which I have identified
and we will discuss them part of the implementation
feasibility and the business benefits. And then we
will pick one of the use case and we will see we go little
details into the implementation.
So starting off observability is a very important
aspects of your site reliability engineering. It's about using the
telemetry data such as your logs, metrics and phrases to
identify internal system state. So when you are starting
your observability journey you definitely have some of the challenges where Jennair
can be the answers and it can expedite your
adoption of observability and even improve
the results. So what are some of the use cases
I think can be handy. So I'm looking at
the feasibility and the business value. So high
feasibility, high business value. Some of the use cases are automatically
generate anomaly detection models for monitoring system metrics.
If you are want to come up with your models
and you are in the mode to come up develop this, then llms
have the capabilities and you can cut down the development effort
and some of the implementation effort by leveraging Jenny as and
you are able to use the jennair to predict potential
system bottlenecks and recommended proactive optimizations.
Here you are able to we will go in detail, we can feed
in a lot of data and Jennair can do a better job. And we are
able to use Jenna to analyze log data to automatically
identify root causes and performance issues. So this day and age we
no longer required humans to spend more time
on going through and identifying root causes. We are
definitely tapping to the Llms to that as
well. And of course llms are able to
come up with high predictive features. How high predict
predicts future resource utilization trends and recommended
scaling strategies. So in my mind these are the high feasibility,
high business value use cases. And some of the other
things are identify correlation between different system
metrics to enhance troubleshootings or to analyze network traffic patterns.
Or you can even go into automate the correlation of comprehensive
dashboards to tailor user needs.
And finally such things such as recommend dynamic adjustments to
monitoring configuration based on workload changes. So those are the
aspects where by giving lot of meaningful
input you are able to get lot of things out of llms
to improve. When you are setting up your observability,
fine tuning your observability, and when you are trying to adjust and
do the continuous improvement into observability, you are definitely
able to tap into these use cases and this will help you
amplify implementation of observability.
So here we I have pick up this one use case.
So example analyze log data to automatically
identify root causes to performance issues. So this is traditionally
a manual work where our SRE teams will do,
but what are the options we have? We are able to input our logs
and with some of other various system components needed and llms
able to perform and identify patterns, root causes
and performance issue. So this is the capability LLM
can and we can definitely impact include some feedback
loops and go through this in iterative way to improve
the final output. And the ways we can improve this
implementation is enhancing some of the algorithms.
You can pick up a LLM which is more suitable
for this work and then you can use some rack concepts
to provide more domain specific data so that in case,
if required you can use the non errors and you have
the best practices. You can also embed it into
your LLM and provide effective feedback.
And with some collaborative feedback loops you are able to
improve the outputs in over period of time.
So this is a great use case where you can provide obtain
a lot of benefits. So moving on to the second
pillar which is service level indicators,
slos and error budgets. So if you look at again
the high feasibility, high business value use cases. So one
of the use cases recommend optimal error budget allocations
based on business priorities and user expectation. If that's
always a challenge to understand what is that correct target.
And then we have the option of predicting potential violations
of service level objectives and recommended proactive measures to
prevent that. So identifying what are the preventive measures.
So llms are able to not only come
up do a better job, it's able to tap into the vast data
to provide a lot of better options and
recommendation and fixed details. And then some use case like
analyze user satisfaction metrics to determine the impact of
SLA violations on customer experience. So whenever
there are SLA violations are happening, we are able to get ll names
to do the business impact assessment which is again very important because our
as service level objectives we want to ensure we correlated with
customer experience and we now have a better way of predicting
those the customer experience as well. And we are
able to use automate the tracking and visualization of error budgets,
burn down grades and things like recommend adjustment
to error budget based on usage patterns and system performance.
For generate insight on relationship between slIs,
slos and business KPI's this is again a very important
thing. We always want to ensure our slos are reflect
a true customer experience. So we want to ensure if there
are some specific misses KP's how we see
that correlation going then identifying and prioritize
critical and service level indicators based on their impact on
user experience and business objectives. Those are typical
use cases which we are able to leverage that will have
a amplifying effect in implement when
you are implementing this pillar. So here I have
picked one of the important and interesting use case it's about
recommend optimal error budget allocations based on business prioritize
and use expectation. So this is always a very challenging
thing how you come up with that, your error budgets and what are your
SLO targets. So what we can do is we again doing
input our error budget definitions, business priorities, user expectations
and then let llms come up with make a smart decision
which is like relevance to everyone. So again,
we are able to use the feedback loop and we can get the stakeholder
feedback here which is very important and that can be iterative way
to improve our final output. We can
change and adapt, let our models adapt to based on priorities
the business objectives whenever we have the whenever
it's time to relook at the error budgets and other targets.
So it's iterative way. Obviously the continuous refinement
will help you in achieving your desired objective.
Moving on, the third pillar of our SRE is
system architecture and recovery objectives. Here again,
generator, we have lot of use cases which you can use
which will help you to really have a take it out like
hit this out of the park. And in the games we say
like really get the benefit of Genai.
So some of the use cases are ability to predict the different failure
scenarios on the system availability and performance so that you
can have better RP or RTO and you can build resilient into your
system architectures recommend proactive measures to enhance
system reliability and minimize downtime predict potential
recovery times for different type of incidents based on historical data.
So this will really help you in determine your
RPO RTO and then build resilience and
fault tolerance into the system architecture and few things
other things are recommend resilience improvements to system architecture
based on failure mode, automate the creation of disaster
recovery plans and analyze effectiveness of recovery
strategies and recommend optimizations based on fast historical data.
These are some of the great user cases which are high feasible
and will give provide you some business value as well.
And then there are some other less feasible high business
value generate use cases as well. Things like generate recovery
objective based on business requirement and SLA commitments,
analyze historical data to identify patterns and trend in
system failures and generate personalized recovery playbooks
for common incident scenarios. So these are very powerful
use cases and without Genai we may not
be able to fulfill at all. So here
one of the use case I picked it up is predict the
impact of different failure scenarios on system availability and performance.
So one the inputs we have to provide is the historical
failure data, system architecture related data and the performance
of the telemetry data. This will allow LLM to
predict the impact of various failure scenarios which can have
in our system and what can have impact on our availability and
performance and the various ways we can improve
the output tasks. Obviously incorporating the feedback loop
enhancing model with additional influencing factors like
what are the other factors which can improve impact?
Your availability and the reliability and the performance. And then of
course the the continuous refinement of this approach and
moving on. The next pillar of site reliability engineering is your release
and incident engineering. This is a very important aspect
and generative AI. Here again have lots of use
cases which can have a lasting impact.
If you look at the high feasible, high business impact use cases,
some of them are ability to automate the creation of incident response
front books and playbooks for efficient resolution. You can automate
the entire workflow using Genai predict
potential incident severity based on incoming alert and
historical data. You can do that, the priority classification,
you can understand the impact and so the capabilities are very
high. Provide real time incident response recommendation
based on the current situation and historical data.
And some of the other things are finick potential release risk based
on historical release data and quot quality matrix. So this
is a very important aspect. You are able to come
up with a risk factor not based on the manual way, but you are
able to tap into generative way and come up with something smart.
And then you can do the reverse engineering to reduce those risks
and then analyze the impact of releases on user experience and satisfaction
matrix. From the developing of design
to development of the code to the testing, you can have it
and then have that impact on the release measures,
the recommended optimized release cycles and promotion
strategies based on system performance and one of the other use
cases. And if you look at other use cases, analyze past
incident risk reports and postpartum analysis to identify
common failure patterns. Recommend preventive measures to mitigate the
risk of incident during failures and generate insights on root causes
of incidents and recommend long term solution to
prevent the reoccurrence. So those are about how you can improve
your release engineering workflows, how you can predict some
of the touch points which can have impact on your customer experience
and then how you can take some measurements to ensure
your workflows are safety and you are doing we are delivering
quality outputs to our customers. So this is some
great use cases. Generative AI is opening up
and one of the use case I have picked it up is provide real time
incident response recommendation based on the current situation and
historical data. So that is very important in the
incident workflows when you are trying to automate your incident management.
So what are the inputs we can give? We can provide the current
incident data, historical data and the system status
and various telemetry data. And we can expect the
output of lamps to come up and recommend us on real
time recommendations for incident response and actions.
And what are the other like the user impacts,
business impact statements and even to prevention of reoccurrences.
And even I have tested you are able to get llms to do five.
I swear you can do a blameless postmortem
with LLM by giving you giving all the data.
So again, real time feedback loops, tendencing models with
the configurations and providing better context into problem statement
continuous refinement are few ways we can improve
the final output. Automation, as we discussed
earlier, plays a key roles in site reliability engineering journey.
It's about automating the this job aware it's not
only base of automation, but identify toil so
in high level automation about writing UI scripts,
writing automations to automate some of the workarounds
or identify some of the tactical or strategic work which you
are doing. You embedded those work and do your the systems
itself. So the standard automations writing your
shell script, python script and no automation with lot of scripting.
The the effort you put to develop those scripts
you can now outsource into Jenny so that Jenny can
do a faster job, more productive job and you can simply
leverage it and then do the changes and get it up and running very quickly.
So the content creation aspect of JNA, you can really use that powerfully
here. And apart from that, some of the the use cases
I have come up with are analyze historical automation
data to identify opportunities for process optimization.
Those are high feasibility, high value use case
automate the deployment and management of infrastructure resources.
You are able to obviously use DNAI to come up with your
infrastructure as a code and other CI CD pipeline.
The development work, you can accelerate that using analyze
the effectiveness of automation workflows and recommend improvements
based on performance metric. Some other use case and
a few others are recommend automated testing framework
and tools to improve release call quality, predict the
impact of automation on operational efficiency and resource utilization
and generate personalized automation playbooks for different operational scenarios
and some of the challenge to implement which is like law
feasible but high again high business value use case.
Recommend new automation opportunities based on manual workflow and
look at the repetitive tasks. Predict potential bottlenecks in manual process
and suggest automation solutions. Automate the identification
and prioritization of repetitive tasks for the toil.
So these are some of the great use cases generative AI is able to
facilitate that can really accelerate and
probably for us to achieve our high power automation
aspirations. So one of the use case I have picked
up here is analyze the effectiveness of automation workflows
and recommend improvements based on performance metrics. We are very good at
identifying some of the automation use cases and then we do automation.
But after that we sometimes lack that ability to understand
the effectiveness and how to measure it, how to even come
up with some other improvements on top of that to really
drive it home. So here we can input
the automation workflow data, performance metric and historical other data
where we can able to get LLMs to provide
analysis of automation workflow effectiveness and
recommendations for improvements based on the vast knowledge it's
habit and obviously we have the option of
improving this output as well. We can provide a
more context driven feedback and LLM
fine tuning and continuous learning will obviously help us
to improve the output. Moving on.
Resiliency engineering is one of the important aspects. We want
to understand what our failure scenarios and then we want to
ensure we do that failure testing or the chaos testing so that we
can understand the and improve our system resilience
and fault errors. So this is a great way of implementing
generative AI in a chaos engineering workflow.
I'm not going to discuss too much, but if you have look at
my previous presentation as part of Chaos Engineering 2024
in con 42, I did a autonomous chaos engineering workflow
presentation where I discussed how we can leverage Genai
from the start to end, how we can leverage Genaii to do the system discovery.
What are the how we can use it to understand your system dependencies
and then come up with, then come up with the
system steady status, and then come up with the hypothesis,
the failure scenarios, the test cases. You can even get the
generative AI to automate your test creation, and then you can
do the test execution so that you can create autonomous workflow
and integrate that with your c and CD pipeline.
So this is again a very powerful area where you are able to
leverage a generative AI and you probably
automate the entire chaos engineering workflow.
Here again, let me pick up one of the use
case. Automate the execution of chaos experiments based on
the identified risk factor and failure scenarios. So probably once you
have your test cases identified, you can share your test cases,
identify risk factors, failure scenarios, system architectures,
other data, and here then cure scene. The LLMs
are able to automate the execution. It's able to come up with structure
as a code for other automation script or solution
we can promptly use it to run so that you cut down the time
to develop your the execution script.
So this is a great way and this is in future will help to
for us to realize autonomous chaos engineering workflows
in future. Moving on. Final aspects if
how you can use it to improve your culture and the awareness when it
comes to site reliability engineering. So Jennai can definitely come up
with a big bandwidth here. It can have chatbots which
can improve their knowledge and awareness. It can improve and
get itself planted in different aspects of our process so
that it's able to provide better feedback. And one aspect is
genes have the capability of look at the and our incident
data and come up with blameless post motors. In one of the
examples we were able to automate the entire bliss
force motor to up to a certain level,
fully automate with gene edge and how we
have done it. You can see you have a major incident now all the
data, what has happened? You captured part of your ticket update data.
So if it is service now, the ServiceNow ticket will have all the data.
What you can do is you can feed this data with other information
to the your large language model. And with
that it's able to do a proper incident prevention
of reoccurrence analysis where it's
able to understand what is the business impact, it's able to understand the
workflow, what has happened, the task, how you have fixed the issue.
It's able to go through five ways to understand the root cause.
It's able to come up with what are the preventive measures and suggest the
short term and long term fixes which you can drive.
So these are some of the great use cases you can go and you
can see here as well some of the great use cases you can
implement. One of the use case I have picked it up is analyze
historical postmortem data to identify recurrent patterns and
trends in incident. So this is one of the more challenging aspects.
We are very good at doing force modems, but how you
compare analyze lot of force modems
or the certain force modems you have done to understand the recurrent patterns.
So here you can provide llms, historical force quantum
data, incident reports, root cause analysis, other aspects
to llM, and then it's able to perform the identification
of recurrent patterns and trends and then identify incident
reoccurrence and root causes. So here again we can
include some of the context driven observability data and other
system context and those things so that llms can come
up with the beta outputs we can provide a lot of feedback and build feedback
loop to get a better output as well.
With this we have closed and going through our seven pillars
and we have looked at what are the use cases and then we have look
at particular high value use case how we
can implement as well. So finally what are
the benefits like we this all has to tangible impact into benefits.
So this will amplify your reliability targets.
So that is what I firmly believe UI once you
embrace genai and you embrace it to amplify your
SRE implementations, then you must be able to
improve your service level objectives aligned to area budgets and
your ability to increase the change frequency,
reduce change failure rates and ability to
increase the lead time. The reduce the lead time for changes because
now that when you have that requirement of your developers able to do a faster
deep development using gene and you are also able to ensure
you manage the risk of that particular change, you have ability of
tracking that and then you de risk that and deploy
it into production and obviously mean time to detect,
mean time to repair. Meantime between failures can be positively
impacted by leveraging genais and what
are the best practices. While generative AI is bringing in lot
of opportunities for you to amplify your site
reliability experience, you have to be ensure that you align some of
the best practices, have a clear objective, ensure that
llms are as good as the training dataset. You have to leverage
some of the technique which I have provided you in the early about a
leveraging drag or the knowledge base aging LLM agents and
using the proper the from configurations so
that you can provide better context and get better outputs and the
feedback loops. The continuous evaluation is a must. If you have seen
almost all the use cases, I have flagged that the
feedback loops and the fact that we have to provide more context and
more refinements which will help us to get the better output.
And you have to be very careful and consider about the ethical considerations
and ensuring that you understand the
ethical generative AI implementation as well. So these
are the best practices that will obviously help you in the long run.
And finally, what are the pitfalls, which is which you
want to avoid? And you want to first ensure that the ethical consideration,
because that is a big part. So you understand
the mems which have gone in the town. It says that
don't ask a lady their age, gentlemen about the salary
and LLM about their training data set.
So that's one of the challenge. Like the how
we have trained our LLM training data or how
we have trained our llms. So there has to be a lot of consideration
of ethical aspects so that you are you will not fall
apart. So this is a something you have to be mindful
from the day one and when you're coming up with your solution
and the design and you have to have a proper validation plan.
So whatever you are getting from the output as generative AI, you validate
it and build some food back loops. So that is very important.
And then what you have to do is you want to
avoid treating generative AI model as statistic
solutions. Instead you want to have regular update,
refine them, adapt to evolving requirements and environments.
So you have to understand this is evolving thing and
you have to provide a lot of context and go through
these feedback loops and continuous improvements. And that has
to be built in. It doesn't matter that you understand, you accept
that, but that has to be built into your solution workflows so
that you can leverage in future.
Definitely, generative AI is obviously already have a
full fledged impact on the way we are working. I firmly believe
by leveraging generative AI smartly you
are able to implement SRE 2.0 which will amplify
your reliability targets with
that. Thank you very much for listening in. If you have any feedback
you can comment this video or you can search Indica
Vimanasuria in LinkedIn and you can
contact with in touch with me. I'm happy to hear your feedback,
your thoughts and collectively let's amplify our
SRe journey.