Conf42 Site Reliability Engineering (SRE) 2024 - Online

SRE 2.0: Amplifying Reliability with GenAI

Video size:

Abstract

Elevate your SRE 2.0 practices with GenAI integration. Discover how GenAI addresses traditional challenges, amplifying reliability, scalability, and efficiency. Explore real-world use cases, implementation strategies, and future trends for navigating reliability enhancements in the digital age.

Summary

  • A recent survey done by KPMG found 77% of their participants thought Jinya is going to have the largest impact on their businesses out of all emerging technologies. 73 participants believe Genei will have a larger impact on increasing their productivity. 71% of the participants of the view that they got to implement a Jennai solution within next two years.
  • Indigremera Surya is a reliable engineering advocate and a practitioner. His specializations are in site reliability engineering, observability, aiops and generative AI. He will discuss about SRE 2.0 which is about leveraging this genai and how we can amplify reliability.
  • The modern distributed systems are very complex. SRE 2.0, which is adapting generative AI, will amplify your reliability. By 2026, 50% of the code in our systems code will be written by AI. This opens up opportunities in software development and maintenance area.
  • observability is a very important aspects of your site reliability engineering. Some of the use cases are automatically generate anomaly detection models for monitoring system metrics. These are the high feasibility, high business value use cases. This will help you amplify implementation of observability.
  • So example analyze log data to automatically identify root causes to performance issues. With some collaborative feedback loops you are able to improve the outputs in over period of time. This is a great use case where you can obtain a lot of benefits.
  • So moving on to the second pillar which is service level indicators, slos and error budgets. One of the use cases recommend optimal error budget allocations based on business priorities and user expectation. And then some use case like analyze user satisfaction metrics to determine the impact of SLA violations on customer experience.
  • The third pillar of our SRE is system architecture and recovery objectives. We have a lot of use cases which you can use which will help you to really have a take it out. Some of the use cases are ability to predict the different failure scenarios on the system availability and performance.
  • The next pillar of site reliability engineering is your release and incident engineering. Here again have lots of use cases which can have a lasting impact. Some of the high feasible, high business impact use cases.
  • Generative AI is opening up and one of the use case is providing real time incident response recommendation based on the current situation and historical data. Automation plays a key roles in site reliability engineering journey. These are some of the great use cases generative AI can facilitate that can really accelerate.
  • This is a great way of implementing generative AI in a chaos engineering workflow. Automate the execution of chaos experiments based on the identified risk factor and failure scenarios. In future will help to for us to realize autonomous chaos engineering workflows in future.
  • Jennai can definitely come up with a big bandwidth here. It can have chatbots which can improve their knowledge and awareness. One of the use case is analyze historical postmortem data to identify recurrent patterns and trends in incident. These are some of the great use cases you can go.
  • Once you embrace genai and you embrace it to amplify your SRE implementations, then you must be able to improve your service level objectives aligned to area budgets. You have to be very careful and consider about the ethical considerations and ensuring that you understand the ethical generative AI implementation as well.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
A recent survey done by KPMG found 77% of their participants thought Jinya is going to have the largest impact on their businesses out of all emerging technologies. And out of this survey 73 participants believe Genei will have a larger impact on increasing their productivity. And interestingly 71% of the participants of the view that they got to implement a Jennai solution within next two years and 65% believe Jennai will help their organization gain a competitive advantage. So these are some of the numbers which prove the impact we are having on with gene eyes. So Jenny is no longer hype cycle, it's here to stay and that will have the largest positive impacts for any organization. So if you look at the typical business operation functions and where JAi can typically implemented, the highest placed are your IT tech and operation. So that is by far standout when compared to other aspects like marketing, sales or customer management, product development or R and D, finance, accounting, HR or risk and legal. So this prove that everyone, the organizations and especially the participant of the KPMG survey was viewed that the most of the Genai benefits they are going to leverage part of IT tech and operations hello everyone, my name is Indiki Vimilasurier. Welcome to SRA 2024 organized by Conf 42. As part of my presentation I will discuss about SRE 2.0 which is about leveraging this genai and how we can amplify reliability. Genei is here to stay and Gene AI definitely allow you to increase the productivity and the reliability of your SRE implementation. As part of this presentation we will discuss about what are the challenges and how what are the roles Denea is playing and the impact on the key pillars and some of the Genai use cases, potential benefits and some of the implementation strategies and what are the best practices and some of the pitfalls you need to avoid. Quick intro. About myself my name is Indigremera Surya. I'm based out of Colombo, Sri Lanka. I'm a reliable engineering advocate and a practitioner. My specializations are in site reliability engineering, observability, aiops and generative AI. I am a passionate technical trainer and an energetic technical blogger. You can find me writing at dev two. I am a proud AWS community builder under cloud operations and a very proud ambassador at DevOps Institute. If you look at the Gartner hype cycle for SRE, you realize that it's a journey. It's about taking you from understanding the innovation triggers like what is required, what is driving you a SRE journey. And it can be like adapting service level objectives coming up with monitoring as a code, solution, infrastructure orchestration, finops, or it can be more for chaos engineering. And while you go through this journey, you will go through a certain set of tasks and then finally you finish up with probably APA and they will say Cox and what I would like to walk through as part of this presentation is that some of these areas with implementation of site reliability engineering, we can really expedite and we can gain higher impact for higher productivity out by leveraging Gene A's. And if I take a moment and just to like brief that we are all in the same page, site reliability engineering. Some of the key principles are how we can reduce our organization silos. That's mainly I think how we can get everyone in the organization to be responsible for the customer experience. Because something sometimes lack in organization is how who is going to own the customer experience and the person, the team who is how to do that had to have the technical skills on that as well. And then it's about accepting failures as normal. No longer our targets are achieving the 100% reliability or availability or 100% of anything. It's about considering the business needs, it's about considering our deployment architecture and some of the limitations. And then we identify what is the acceptable service level objectives. And then we follow gradual changes. We not planning to do any big band changes. These gradual changes help us not only to ensure there's a decent volatility, the number of change, the increased number of changes we do in production environment, as well as in case of any issue, we are able to revert back these changes. And of course we want to leverage tooling automation. It's about automating the sales job away and we want to bring in our automation mindset for anything and everything we are doing, and then finally measure everything. It's about having your service level indicators, it's about making them actionable, it's about building your service level objectives. And error budgets make the area budgets consequences driven so that you have the customer experience or the service level objectives driving you entire, not only the operations, but entire your organization. And some of the key principles when you are trying to achieve these overall, the fundamentals are you want to work it, work on your observability. The world has moved from monitoring to observability. So we want to make sure that you have a comprehensive, solid observability platform. And then you have to identify what are the service level indicators, service level objectives which are directly correlated with your customer experience. So that in case of any issue that you are able to identify part of your SLI or SL. And then you have to define your error budgets accordingly. Here you have to practice accept failures as normal concept as well. You will have to have defined some consequences driven approach when it comes to area budgets. And then you will have to look at how you're going to improve your system architecture, the deployment architecture and what are the recovery objectives like RPO RPO area, what are your objectives. And then it's about how we're going to work on your release manage release engineering, incident engineering. It's about having the CFCD pipelines, it's about integrating all your test automations and integrate with the pipelines. It about the code quality, it about all those things that is ensuring a higher degree of reliable releases. And it's about making sure that you have the correct and agreeable incident automation workflows so that you can automate some of those remediations. So automation is a key it start from building infrastructure, leveraging things like infrastructure as a code or observability as a code or deployment automation, using your CI CD pipelines, ensuring is to manage your capacity growth by like embracing techniques like autoscaling. And then finally it's about resilience engineering, doing KMS engineering, understand what are the failure scenarios and here you will have to define and do the discoveries, then understand what are your steady state and then come up with the, the hypothesis or the failure scenarios and the test cases. And then you go through it's a continuous manner while observing your sister. And finally the organization, culture and awareness is very important. You will have to have a blameless culture where you understand there's a failures are normal and every issue you are trying to understand how you can improve and how you can build reliability and the fault tolerance into your system. The modern, the distributed systems are very complex and one of the reason is we have transformed from monolith to microservices and we have transformed from standard monitoring to observability, which is your logs, metrics and traces and events. And we have transformed from on premise to cloud. And it can be you are hosted in multiple clouds where we call poly cloud, or you have a combination of hybrid approach where it's on premise plus cloud. So the microservices, the observability layers and your cloud responsible for generating a huge amount of data and thus has opened up expansion of data sources. This exponential growth has resulted in, while it's a technical advancement, we are able to manage our systems and manage our customer experience and we can deliver things much faster and we can identify things much faster. But we have built a lot of failure scenarios into this system now. So there are a lot of touch points where it can go wrong. So this is a challenge, and this is a challenge for service site reliability engineering. This is the area where everyone is focusing on to come up with some innovation solutions. And as you might know, when you are starting your SRE journey, you interpret it as operation is a software problem. So I used to always think that if our software is in the perfect shape, we might not need a large operations teams or we might not need even to the extent of your high degree of sres. Why? Because usually what happen is your, the incident management, your problem management, your capacity, and most of the aspects of work coming related to operations are down to software has not been in great shape or the tactical, or some form of toil or the manual work we have introduced, which would have been built to the system itself. So idea is that if your system is in a better shape, then you have a limited scopes when it comes to operations. So this down to a fundamentally about the quality of the code. Usually we are writing and one of the surveys, it found that by 2026, 50% of the code in our systems code generated through generative AI. So that is a big number that mean that half of the code in future will be written by generative AI. And understandably it will have less manual mistakes, it will align to some better code practices. And while generated to AI might not be able to exceed human creativity, it might definitely ensure that it's following a process and practices and there will be some sort of a higher degree of code quality. So what this result is that we are able to go into that Nuswana state, where our software will work more reliable in future, and that this resulted in, we want to ensure that there are more ways we can leverage generative AI. Not only we provide this benefit or the reliability from the software perspective, but the other aspect for SRE can also leverage generative AI so that we can amplify this reliability. So that's why I firmly believe SRE 2.0, which is adapting generative AI, will amplify your reliability. So generative AI is about letting your machine learning models coming up with new creative content, so it can be form of text messages or images for audio. So this is very powerful because the combinations of and different aspects of the way we can leverage this, have opened up lot of opportunities in software development and maintenance area. So this is one of the nice aspects of what LLM can do considered by Gartner. So what we can see here is that we are able to input natural language structured data for multilingual text and transcription. And what are the capabilities of flat language models? They are able to come up with text or code generation, text compilation, text classification, text summarization, text translation, sentiment analysis, text correction, text manipulation, name energy recognition, question and answering style translation format translation and simple analytics. And the outputs are naturally, it can be your natural language text or structured data. Again, a multilingual text or computer code. So if you can see there's a lot of work, it can be done using computer code. And this literally means lot of coding. Like it's not only form of the typescript or your Java, it's about like Python and it's about the automation, it's about your shell scripting. So lot of these aspect we are able to leverage Genaid and this open up lot of opportunities. So once you go through in detail, these are some of the great capabilities which we are able to leverage to our, to amplify our reliability, while large language models are coming up with lot of opportunities. Before going to the areas how we can use them to amplify our reliability, I want to flag and highlights some of the risk they're having as well. So when it comes to llms, so models may have some model bias. So model bias is based on the training data set. It can gen, it can build a certain degree of bias, not to a different site, and there can be a misinformation, lack of context, creativity, and then when it comes to misuse, it can be in cyberbullying or fraud or other malware. And then there can be some usage related risk as well. So while look at it, some of these are obviously genuine, and some of these risks are obviously you are able to eliminate by taking some of those best practices and following some of the guidelines and ethical genai implementation workflows, and what few limitations and challenges in geneis, we are able to handle using three aspects of generative AI capabilities, which is known as RaG, or using a knowledge base. And about second one is about how we can leverage LLMS LLM agents and then obviously how we can work with our prompt engineering. So Rag, which is also known as retrieval augmented generation, it's allowing us to keep our llms up to date. Generally what happened is that LLMS has been trained with certain data set, and once we are trying to deploy it in our organization and trying to have it work for us based on our organization data, we might see a problem where sometimes the trained data set is not inherently correlated with what we want. So here what we can do is we can integrate a knowledge base we call the vector database with all our organization data and it can be observability data design document and architecture diagrams and your ITSm data CMDB and all the data and we can feed that t. So in that way when we are making that prompt to the large language model, we are able to first go and set the knowledge base to retrieve the relevant information. And this relevant information is able to improve the context. When we are going to LLM, we are able to go with the the prompt and then improve order enhance context so that will resulted in a better output generation from LLV. So this will ensure that you are able to give more up to date and relevant data to LLM. So example, you want to build a code automation and you already have a code repository for script readers. You are building a remediation engine, you have like hundreds of remediation engines, you have scripts, your teams has built, you can introduce them part of your the knowledge base where the then LLM is aware of that capability aspect or LLM might be already aware, it might be able to do a good job. But you give that context, you give that the aspects of what you already have. So this is a great way the rack will allow you to keep your llms up to date and then do enhance the context and other aspects is we have something called LLM ages. So example, sometimes you need some runtime real time data when you are dealing with some of this use case implementation. So here what are the option is this is the representation using AWS bedrock where when we are giving a task to the agent and then the agent link in turn break into a chain of thoughts where you have a step, one to step in and each step you can make API calls. We can, you can again connect to a knowledge base. So what idea is that? But because you are able to make API calls, you are able to connect to different systems. You can get more up to date run time real time data so that the response the llms are generating are even more accurate. So what happened is once the chain of thoughts, the actions are followed, these results are again feed into the agent and with the task plus the result agent will approach the LLF. So yeah, now the context is improved, it has more background information and it is able to come up with a meta response. So this is like allowing us to ensure that the responses are we are getting not only have greater context, greater relevance, but we can make it accurate. And in some areas in real time. Now example, if you are creating a remediation script but you want to know the exact IP address, you are going to execute this. Then you are able to use some of these API calls into your CMDB to identify what are the correct server and what are the other details likewise. So this is a great way to ensuring that you are the when you need to make your results more accurate and some of the data obtained wire unties. Here you are again keeping your data accurate and relevant while leveraging the capabilities provided by llms. And finally, obviously working with llms you have to get your prompts right. So we have the whole set of prompt engineering. It's about giving the clear objectives, it's about providing the context and it's about evaluating continuously going with iterative refinement. And that will help you to ensure you get the best out of your llms. These three aspects of rag and then leveraging LLM agents and from engineering best practices will allow you to build a comprehensive and a better solution which can in some aspects mitigate some of the challenges you are having with llms and you can amplify the results. And some of these other properties, the from properties I have listed as well, things like temperature, the top fee and top tokens, those things, all the combinations and different ways using will have a greater benefit. And finally, we are at the important part of my presentation where now that we understand our site reliability objectives and some of the challenges, the modern distributed systems are presenting us capabilities of large language models and how we can even improve LLMs three SaaS by using how we can leverage the from properties. So with this all, I firmly believe we are able to going to the next level of site reliability engineering by leveraging Gene AI. I call it SRE 2.0. And we will look at in next part of my presentation how we can use Genei to positively impact your observability, how you can leverage it to improve the identifying and measuring and tracking your Sli, SLI and error budgets. What are the ways we can use gene for system architecture and recovery objective areas and how typically we can use Genei in release and incident engineering and automation resilience and even at blameless post motors. And what how I will do is each of these pillars I will come up with a set of use cases which I have identified and we will discuss them part of the implementation feasibility and the business benefits. And then we will pick one of the use case and we will see we go little details into the implementation. So starting off observability is a very important aspects of your site reliability engineering. It's about using the telemetry data such as your logs, metrics and phrases to identify internal system state. So when you are starting your observability journey you definitely have some of the challenges where Jennair can be the answers and it can expedite your adoption of observability and even improve the results. So what are some of the use cases I think can be handy. So I'm looking at the feasibility and the business value. So high feasibility, high business value. Some of the use cases are automatically generate anomaly detection models for monitoring system metrics. If you are want to come up with your models and you are in the mode to come up develop this, then llms have the capabilities and you can cut down the development effort and some of the implementation effort by leveraging Jenny as and you are able to use the jennair to predict potential system bottlenecks and recommended proactive optimizations. Here you are able to we will go in detail, we can feed in a lot of data and Jennair can do a better job. And we are able to use Jenna to analyze log data to automatically identify root causes and performance issues. So this day and age we no longer required humans to spend more time on going through and identifying root causes. We are definitely tapping to the Llms to that as well. And of course llms are able to come up with high predictive features. How high predict predicts future resource utilization trends and recommended scaling strategies. So in my mind these are the high feasibility, high business value use cases. And some of the other things are identify correlation between different system metrics to enhance troubleshootings or to analyze network traffic patterns. Or you can even go into automate the correlation of comprehensive dashboards to tailor user needs. And finally such things such as recommend dynamic adjustments to monitoring configuration based on workload changes. So those are the aspects where by giving lot of meaningful input you are able to get lot of things out of llms to improve. When you are setting up your observability, fine tuning your observability, and when you are trying to adjust and do the continuous improvement into observability, you are definitely able to tap into these use cases and this will help you amplify implementation of observability. So here we I have pick up this one use case. So example analyze log data to automatically identify root causes to performance issues. So this is traditionally a manual work where our SRE teams will do, but what are the options we have? We are able to input our logs and with some of other various system components needed and llms able to perform and identify patterns, root causes and performance issue. So this is the capability LLM can and we can definitely impact include some feedback loops and go through this in iterative way to improve the final output. And the ways we can improve this implementation is enhancing some of the algorithms. You can pick up a LLM which is more suitable for this work and then you can use some rack concepts to provide more domain specific data so that in case, if required you can use the non errors and you have the best practices. You can also embed it into your LLM and provide effective feedback. And with some collaborative feedback loops you are able to improve the outputs in over period of time. So this is a great use case where you can provide obtain a lot of benefits. So moving on to the second pillar which is service level indicators, slos and error budgets. So if you look at again the high feasibility, high business value use cases. So one of the use cases recommend optimal error budget allocations based on business priorities and user expectation. If that's always a challenge to understand what is that correct target. And then we have the option of predicting potential violations of service level objectives and recommended proactive measures to prevent that. So identifying what are the preventive measures. So llms are able to not only come up do a better job, it's able to tap into the vast data to provide a lot of better options and recommendation and fixed details. And then some use case like analyze user satisfaction metrics to determine the impact of SLA violations on customer experience. So whenever there are SLA violations are happening, we are able to get ll names to do the business impact assessment which is again very important because our as service level objectives we want to ensure we correlated with customer experience and we now have a better way of predicting those the customer experience as well. And we are able to use automate the tracking and visualization of error budgets, burn down grades and things like recommend adjustment to error budget based on usage patterns and system performance. For generate insight on relationship between slIs, slos and business KPI's this is again a very important thing. We always want to ensure our slos are reflect a true customer experience. So we want to ensure if there are some specific misses KP's how we see that correlation going then identifying and prioritize critical and service level indicators based on their impact on user experience and business objectives. Those are typical use cases which we are able to leverage that will have a amplifying effect in implement when you are implementing this pillar. So here I have picked one of the important and interesting use case it's about recommend optimal error budget allocations based on business prioritize and use expectation. So this is always a very challenging thing how you come up with that, your error budgets and what are your SLO targets. So what we can do is we again doing input our error budget definitions, business priorities, user expectations and then let llms come up with make a smart decision which is like relevance to everyone. So again, we are able to use the feedback loop and we can get the stakeholder feedback here which is very important and that can be iterative way to improve our final output. We can change and adapt, let our models adapt to based on priorities the business objectives whenever we have the whenever it's time to relook at the error budgets and other targets. So it's iterative way. Obviously the continuous refinement will help you in achieving your desired objective. Moving on, the third pillar of our SRE is system architecture and recovery objectives. Here again, generator, we have lot of use cases which you can use which will help you to really have a take it out like hit this out of the park. And in the games we say like really get the benefit of Genai. So some of the use cases are ability to predict the different failure scenarios on the system availability and performance so that you can have better RP or RTO and you can build resilient into your system architectures recommend proactive measures to enhance system reliability and minimize downtime predict potential recovery times for different type of incidents based on historical data. So this will really help you in determine your RPO RTO and then build resilience and fault tolerance into the system architecture and few things other things are recommend resilience improvements to system architecture based on failure mode, automate the creation of disaster recovery plans and analyze effectiveness of recovery strategies and recommend optimizations based on fast historical data. These are some of the great user cases which are high feasible and will give provide you some business value as well. And then there are some other less feasible high business value generate use cases as well. Things like generate recovery objective based on business requirement and SLA commitments, analyze historical data to identify patterns and trend in system failures and generate personalized recovery playbooks for common incident scenarios. So these are very powerful use cases and without Genai we may not be able to fulfill at all. So here one of the use case I picked it up is predict the impact of different failure scenarios on system availability and performance. So one the inputs we have to provide is the historical failure data, system architecture related data and the performance of the telemetry data. This will allow LLM to predict the impact of various failure scenarios which can have in our system and what can have impact on our availability and performance and the various ways we can improve the output tasks. Obviously incorporating the feedback loop enhancing model with additional influencing factors like what are the other factors which can improve impact? Your availability and the reliability and the performance. And then of course the the continuous refinement of this approach and moving on. The next pillar of site reliability engineering is your release and incident engineering. This is a very important aspect and generative AI. Here again have lots of use cases which can have a lasting impact. If you look at the high feasible, high business impact use cases, some of them are ability to automate the creation of incident response front books and playbooks for efficient resolution. You can automate the entire workflow using Genai predict potential incident severity based on incoming alert and historical data. You can do that, the priority classification, you can understand the impact and so the capabilities are very high. Provide real time incident response recommendation based on the current situation and historical data. And some of the other things are finick potential release risk based on historical release data and quot quality matrix. So this is a very important aspect. You are able to come up with a risk factor not based on the manual way, but you are able to tap into generative way and come up with something smart. And then you can do the reverse engineering to reduce those risks and then analyze the impact of releases on user experience and satisfaction matrix. From the developing of design to development of the code to the testing, you can have it and then have that impact on the release measures, the recommended optimized release cycles and promotion strategies based on system performance and one of the other use cases. And if you look at other use cases, analyze past incident risk reports and postpartum analysis to identify common failure patterns. Recommend preventive measures to mitigate the risk of incident during failures and generate insights on root causes of incidents and recommend long term solution to prevent the reoccurrence. So those are about how you can improve your release engineering workflows, how you can predict some of the touch points which can have impact on your customer experience and then how you can take some measurements to ensure your workflows are safety and you are doing we are delivering quality outputs to our customers. So this is some great use cases. Generative AI is opening up and one of the use case I have picked it up is provide real time incident response recommendation based on the current situation and historical data. So that is very important in the incident workflows when you are trying to automate your incident management. So what are the inputs we can give? We can provide the current incident data, historical data and the system status and various telemetry data. And we can expect the output of lamps to come up and recommend us on real time recommendations for incident response and actions. And what are the other like the user impacts, business impact statements and even to prevention of reoccurrences. And even I have tested you are able to get llms to do five. I swear you can do a blameless postmortem with LLM by giving you giving all the data. So again, real time feedback loops, tendencing models with the configurations and providing better context into problem statement continuous refinement are few ways we can improve the final output. Automation, as we discussed earlier, plays a key roles in site reliability engineering journey. It's about automating the this job aware it's not only base of automation, but identify toil so in high level automation about writing UI scripts, writing automations to automate some of the workarounds or identify some of the tactical or strategic work which you are doing. You embedded those work and do your the systems itself. So the standard automations writing your shell script, python script and no automation with lot of scripting. The the effort you put to develop those scripts you can now outsource into Jenny so that Jenny can do a faster job, more productive job and you can simply leverage it and then do the changes and get it up and running very quickly. So the content creation aspect of JNA, you can really use that powerfully here. And apart from that, some of the the use cases I have come up with are analyze historical automation data to identify opportunities for process optimization. Those are high feasibility, high value use case automate the deployment and management of infrastructure resources. You are able to obviously use DNAI to come up with your infrastructure as a code and other CI CD pipeline. The development work, you can accelerate that using analyze the effectiveness of automation workflows and recommend improvements based on performance metric. Some other use case and a few others are recommend automated testing framework and tools to improve release call quality, predict the impact of automation on operational efficiency and resource utilization and generate personalized automation playbooks for different operational scenarios and some of the challenge to implement which is like law feasible but high again high business value use case. Recommend new automation opportunities based on manual workflow and look at the repetitive tasks. Predict potential bottlenecks in manual process and suggest automation solutions. Automate the identification and prioritization of repetitive tasks for the toil. So these are some of the great use cases generative AI is able to facilitate that can really accelerate and probably for us to achieve our high power automation aspirations. So one of the use case I have picked up here is analyze the effectiveness of automation workflows and recommend improvements based on performance metrics. We are very good at identifying some of the automation use cases and then we do automation. But after that we sometimes lack that ability to understand the effectiveness and how to measure it, how to even come up with some other improvements on top of that to really drive it home. So here we can input the automation workflow data, performance metric and historical other data where we can able to get LLMs to provide analysis of automation workflow effectiveness and recommendations for improvements based on the vast knowledge it's habit and obviously we have the option of improving this output as well. We can provide a more context driven feedback and LLM fine tuning and continuous learning will obviously help us to improve the output. Moving on. Resiliency engineering is one of the important aspects. We want to understand what our failure scenarios and then we want to ensure we do that failure testing or the chaos testing so that we can understand the and improve our system resilience and fault errors. So this is a great way of implementing generative AI in a chaos engineering workflow. I'm not going to discuss too much, but if you have look at my previous presentation as part of Chaos Engineering 2024 in con 42, I did a autonomous chaos engineering workflow presentation where I discussed how we can leverage Genai from the start to end, how we can leverage Genaii to do the system discovery. What are the how we can use it to understand your system dependencies and then come up with, then come up with the system steady status, and then come up with the hypothesis, the failure scenarios, the test cases. You can even get the generative AI to automate your test creation, and then you can do the test execution so that you can create autonomous workflow and integrate that with your c and CD pipeline. So this is again a very powerful area where you are able to leverage a generative AI and you probably automate the entire chaos engineering workflow. Here again, let me pick up one of the use case. Automate the execution of chaos experiments based on the identified risk factor and failure scenarios. So probably once you have your test cases identified, you can share your test cases, identify risk factors, failure scenarios, system architectures, other data, and here then cure scene. The LLMs are able to automate the execution. It's able to come up with structure as a code for other automation script or solution we can promptly use it to run so that you cut down the time to develop your the execution script. So this is a great way and this is in future will help to for us to realize autonomous chaos engineering workflows in future. Moving on. Final aspects if how you can use it to improve your culture and the awareness when it comes to site reliability engineering. So Jennai can definitely come up with a big bandwidth here. It can have chatbots which can improve their knowledge and awareness. It can improve and get itself planted in different aspects of our process so that it's able to provide better feedback. And one aspect is genes have the capability of look at the and our incident data and come up with blameless post motors. In one of the examples we were able to automate the entire bliss force motor to up to a certain level, fully automate with gene edge and how we have done it. You can see you have a major incident now all the data, what has happened? You captured part of your ticket update data. So if it is service now, the ServiceNow ticket will have all the data. What you can do is you can feed this data with other information to the your large language model. And with that it's able to do a proper incident prevention of reoccurrence analysis where it's able to understand what is the business impact, it's able to understand the workflow, what has happened, the task, how you have fixed the issue. It's able to go through five ways to understand the root cause. It's able to come up with what are the preventive measures and suggest the short term and long term fixes which you can drive. So these are some of the great use cases you can go and you can see here as well some of the great use cases you can implement. One of the use case I have picked it up is analyze historical postmortem data to identify recurrent patterns and trends in incident. So this is one of the more challenging aspects. We are very good at doing force modems, but how you compare analyze lot of force modems or the certain force modems you have done to understand the recurrent patterns. So here you can provide llms, historical force quantum data, incident reports, root cause analysis, other aspects to llM, and then it's able to perform the identification of recurrent patterns and trends and then identify incident reoccurrence and root causes. So here again we can include some of the context driven observability data and other system context and those things so that llms can come up with the beta outputs we can provide a lot of feedback and build feedback loop to get a better output as well. With this we have closed and going through our seven pillars and we have looked at what are the use cases and then we have look at particular high value use case how we can implement as well. So finally what are the benefits like we this all has to tangible impact into benefits. So this will amplify your reliability targets. So that is what I firmly believe UI once you embrace genai and you embrace it to amplify your SRE implementations, then you must be able to improve your service level objectives aligned to area budgets and your ability to increase the change frequency, reduce change failure rates and ability to increase the lead time. The reduce the lead time for changes because now that when you have that requirement of your developers able to do a faster deep development using gene and you are also able to ensure you manage the risk of that particular change, you have ability of tracking that and then you de risk that and deploy it into production and obviously mean time to detect, mean time to repair. Meantime between failures can be positively impacted by leveraging genais and what are the best practices. While generative AI is bringing in lot of opportunities for you to amplify your site reliability experience, you have to be ensure that you align some of the best practices, have a clear objective, ensure that llms are as good as the training dataset. You have to leverage some of the technique which I have provided you in the early about a leveraging drag or the knowledge base aging LLM agents and using the proper the from configurations so that you can provide better context and get better outputs and the feedback loops. The continuous evaluation is a must. If you have seen almost all the use cases, I have flagged that the feedback loops and the fact that we have to provide more context and more refinements which will help us to get the better output. And you have to be very careful and consider about the ethical considerations and ensuring that you understand the ethical generative AI implementation as well. So these are the best practices that will obviously help you in the long run. And finally, what are the pitfalls, which is which you want to avoid? And you want to first ensure that the ethical consideration, because that is a big part. So you understand the mems which have gone in the town. It says that don't ask a lady their age, gentlemen about the salary and LLM about their training data set. So that's one of the challenge. Like the how we have trained our LLM training data or how we have trained our llms. So there has to be a lot of consideration of ethical aspects so that you are you will not fall apart. So this is a something you have to be mindful from the day one and when you're coming up with your solution and the design and you have to have a proper validation plan. So whatever you are getting from the output as generative AI, you validate it and build some food back loops. So that is very important. And then what you have to do is you want to avoid treating generative AI model as statistic solutions. Instead you want to have regular update, refine them, adapt to evolving requirements and environments. So you have to understand this is evolving thing and you have to provide a lot of context and go through these feedback loops and continuous improvements. And that has to be built in. It doesn't matter that you understand, you accept that, but that has to be built into your solution workflows so that you can leverage in future. Definitely, generative AI is obviously already have a full fledged impact on the way we are working. I firmly believe by leveraging generative AI smartly you are able to implement SRE 2.0 which will amplify your reliability targets with that. Thank you very much for listening in. If you have any feedback you can comment this video or you can search Indica Vimanasuria in LinkedIn and you can contact with in touch with me. I'm happy to hear your feedback, your thoughts and collectively let's amplify our SRe journey.
...

Indika Wimalasuriya

Associate Director, Senior Systems Engineering Manager @ Virtusa

Indika Wimalasuriya's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)