Conf42 DevOps 2025 - Online

- premiere 5PM GMT

DevSecOps at Scale: Multi-Agent System for Automating Build Failures with LLMs in Large Enterprise IT Ecosystems

Video size:

Abstract

Build failures in large-scale DevSecOps environments cause costly delays. We developed a multi-agent system using LLMs to automate build failure resolution within Infosys, supporting 100+ apps. This led to a 45% reduction in MTTR, 50% less troubleshooting time, and a 40% productivity boost.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my presentation is on DevOps at scale. we have used a multi agent systems to automate the build failure in a very large enterprise setup. this enterprise has about 315, 000 employees, 100 applications, and a wide variety of stack as you see in the table above, right? We have different deployment models. Some of it is on prem, some of it is cloud native, some of it is a hybrid solution, a variety of programming languages, frameworks, databases, you know, some of it is COTS products, some of us, and we also have different kinds of containers. So you see the variety, there's quite a bit of, difference in their heterogeneity in the character. Similarly, when you look at When we actually looked at the system and studied a bit more about the distribution of what caused the build failure. And this we studied using five years log, which was a goldmine, right? We realized some things, right? Some things about the commit size. certain source lines of code, if the commits lines is above that, the failure rate increased. certain pieces of information was related to, you know, how these build failures happen in certain times of the day. Like weekends, we used to have more issues. End of the day, we used to have more issues. and there were things of that nature. Which is what we have listed, right? Developer experience, the kind of CIC CD pipeline maturity in each of these systems, because some of these systems were old systems, legacy systems, some of them were relatively new, right? And the project maturity level was also different. So that is why you see this variation. Now, when we looked at it, we, we were also curious to understand that given these bend failures, How does it look across various internal platforms? For example, we had the finance platform, we have healthcare, we have certain things related to employee engagement, we have, things like developer productivity. So when you look at it, the finance applications in this case, followed by, had very high build failures. And also those related to developer productivity or employee engagement and human resource, right? So this is a distribution of the build failures when we looked at from a domain perspective, but we also looked at it from a perspective of tech, right? So Java, for instance, if you see has a highest build failure rate JavaScript and so on. So this is an interesting fact, but let's get into the next level of details. When we looked into the next. Level of details. here. I want to step back and talk about our DevOps journey, right? So, as I said, we were very fortunate. Not everybody has this, but we had five years worth logs, which is about 500, 000 build logs. Over five years. And, and as I said in my initial introduction, these were a hundred enterprise applications, which were running across 56 countries. And there were a lot of users, right? About 300 and 300, 000 users were a part of the system. So we took the logs and the first step is to actually process these logs. Now, what would you do when you find out? process these logs. First and foremost, all of these were build logs. We wanted to isolate those logs which were related to the failures and we got approximately 10, 000 logs related to these build failures. Right. And then once we got that, we were curious about the distribution of the error pattern, right? What are these errors? Where are they occurring? What are these other kinds of metrics related to MTTR or cost and so on, right? and also we wanted to use this, source of data. We cleaned it up. We removed all, PII data, and we used this to also create an eval dataset, which formed the basis for our evaluation chassis. so these are things which we did with our logs. The next step which we worked on was actually to build a agentic system, right? But the agentic system did not happen day one. First, we were thinking that, you know, why not use the frontier LLMs? And use that for root cost analysis. So we began with Azure OpenAI 3. 5 as the first model wherein we build prompts using these logs. So when we got this 10k logs, we actually did a lot of post processing. Which would help us understand the domain specifics using which we would use prompt and prompt engineering, right? This was the first quick method, which we employed. Then we use drag to capture the domain specific, right? So then we improved that a little bit more. The third thing which we did is that we realized that soon after, you know, these models were continuously changing. So we had, GPT 4. Right. And we were also trying with Claude onthropic, right? So in this case, we wanted to, first of all, be able to support the model swappability, right? We wanted to make sure that every time we knew this is a time when models are perishable, every day there is a better and better model. So we wanted to build that infrastructure which would allow us to swap these models. And the other outcome of that was every time you change a model, even within the same family, the prompts may not work. So we wanted to have a way in which these prompts would work. So those are things we started doing. The next step, we realized that If you really look at DevOps, right, it is a case when there is, we need a lot of reasoning abilities. And I will talk about, you know, in a DevOps situation, how there is a lot of uncertainties and we do need a reasoning system, in order to handle it effectively. So we built a multi agent system. Both for RCA and also for issue resolution. Let's take a step back and talk about, first question I want to say is, What is it that you need in an agentic AI? You do not straight away jump to an agentic AI because it is going to cost you more. Instead of one particular LLM, you're going to have a set of LLMs doing many different things. There were two things I think we need to be aware of. One is these cost of these models is going down year on year, right? That's the first thing to notice. And the second thing is all of these model context length is increasing. So today enterprises actually can use frontier LLMs, much more effectively. That's one. The second part that's related to the viability. The second aspect which it is related to is a feasibility, right? So when you look at this graph. Here are the various maturity levels, right? On one axis, it is the generality and the second aspect is the performance. So today when we looked at this system, we were operating in this phase, right? Wherein we built an agentic LLM, right, with all the reasoning abilities. So, that is where, and DevOps as I said is a partially observable domain, which means that there is always this uncertainty and you may have to take decisions based on uncertainty. And when you have these kinds of system where the data is incomplete. And there's evolving, changes, right? It could be a threats, it could be performance issues, which you hadn't expected, keep coming, right? You do need advanced reasoning capabilities to manage these belief networks, right? So that is also one of the reasons which motivated us to move to agent AI. The third part, what, what did we land up doing, right? So when you talk about an agent system, one is. As you see on the top, right? So you do have the traditional agents, which can actually take the build failure and do a set of things. We will double click on what were the various agents we had, but these agents must work with traditional tools. For example, if you think about. Static code analysis, code smell tools. You could use a sonar cube, right? You could use a tool for providing insights from the Git commit changes, right? You could have a JIRA tickets and conference to find historic solutions, which people have used. You could use things for privacy protection, right? So you want to take off from these log anything, which is, you know, has PII. to make sure that that information is masked, right? So you may have to use a variety of tools, which is what we used, right? On the top, you have, in this case, we have four different agents. Orchestrator is the master agent, which orchestrates, the other agents. And if you take the build agent and I have double clicked, right, what does a build agent do? First of all, the build agent actually analyzes, has many sub agents, right? In this case, a build agent has a log analysis agent, which will look at the logs continuously, monitor the logs to extract any failures, right? Then there is a diagnosis agent, which goes and analyzes this information, which has been extracted, tries to classify the information, Or categorize the type of problem, which is a performance issue, security issue, why is it happening, and so on, right. So it determines that. The third part is a resolution agent, which actually uses the log and the variety of artifacts to say. Hey, you know what? This is a kind of resolution which needs to be taken, right? And the fourth is a notification agent. So we build this complete, multi agent system to address this, right? Now what were the challenges? Let's talk about some of the challenges. we faced while building the system over this, period in time, right? First of all was this, today the LLM really impacts the accuracy, right? And every day, so the choice of LLM is very important. Second is every LLM you change, the prompts are no more effective here. If you see, when we use 3. 5, our accuracy was 82%. Of course, we had other metrics we measured. when we moved to four, the accuracy improved, but when we moved back to GPT 4. 0 mini, which the cost is very less, our accuracy dropped because the prompts would no more work. Right. So it's very important for us to make sure that the prompts would work. The second is. Human evaluations are not scalable. You need to have an automated way for evaluation and that is very important, right? And for all of this, reasoning is a key. You cannot have a system which does not support reasoning because you're dealing with a DevOps problem, right? And Frontier models, remember, cannot be fine tuned. So if you really need to fine tune, you need to create, An SLM, small language model, where you can train your, LLM with your custom data, right? So keep that in mind. So what we really built is, you know, while we had the agent architecture, but we also built certain components. So first of those is an LLM hub, right? What does an LLM hub do? Helps you in selecting It helps you into swapping these models. So today you have a model, you change the model, it helps you to change, right? Besides this, you could also build your own, find you in your own model, right? So it helps you. That's that's the first purpose. The second is when you talk about prompt management, as I said, you know, here, when I walked you through, so prompt operations. Prompt evaluations is very important wherein you need to make sure the prompts do not fail with every change of the underlying LLM. That is important. And the finally, the most important thing, which I think all of you must have to build trustworthy AI systems is an evaluation chassis, right? Evaluation chassis actually does the model evaluation, In an automatic fashion, and here we introduce a concept called as LLM as an evaluator or LLM as a judge. I will talk about that soon. And this particular evaluation chassis has an eval benchmark which we created with the historic data after pre processing it. Right, so these are some of the things we will talk about very soon. And I have left you some links which you can always refer. Okay, having seen this Let's come to LLM as an evaluator, right? So on the left hand side, what you see is, let's suppose that you have an input prompt, right? And using one, the first LLM as a candidate, you get some kind of output, right? Now as I spoke earlier, the human eval, And human evaluation is not a very scalable way. So we are looking at alternate ways in which LLM themselves can be a judge, right? So here the second LLM comes into play. and actually produces an evaluation. Now this evaluation has two metrics, right? The first metric is actually a score which says that, you know, what is the, what is the probability that this, output produced by the first, LLM is right or wrong, right? And the second one is the supporting evidence. You just don't give a score. You always give a supporting evidence. So when you look at it and here I'm taking, you know, this paper, when you look at it, it has a benchmark. It's a very interesting reading, right? Let's suppose that, first LLM was used to create a code, right? And let's suppose that you had used Claude Sonnet 3. 5, right? And that is why you have a solver. Since it's a coding task, right? And you use Claude Sonnet solver. Let's suppose that, the accuracy is 88. 1. Right. Now, what really happens is that if you had another Claude Sornet as a judge, right, to critique the way the first, first, Claude model solved it, then, this is a critique agent, right? Critique is also a judge, is also an evaluator. to judge it, then you get a different score, which is 64. 29. Right. So the same LLM when used as a solver versus worst, when used as a judge could have different performance. And this performance also varies between different kinds of tasks, knowledge, reasoning, math, coding, you know, all of them are very different. Claude is best for coding. And if you look at the newer versions of GPD. 4 0 1, right? They may actually have much better scores, right? One of the key observation is that the ability to judge, to verify the solution pairs, solution pairs is highly correlated with its ability to solve the problem. So if something is good at coding, it will also be good at judging, right? That's the key outcome. Go to the last slide, which is, I would do a key conclusion and results and, you know, between Jan 2024 to September 2024 is when we were actually measuring the impact of the new DevOps with agentic AI over the hundred applications. So when we looked at it, the first metric which came down is the mean time to resolve, right, which was average of four hours. This came down to 2. 2 hours, right? This is a big deal. The second thing is, you know, these problems are never technical. It is about humans, right? imagine the DevOps team. They do a lot of repetitive work. Sometimes, you know, escalations come in during, Christmas times, Thanksgiving times. Right. When this in, especially in retail kind of scenario, there are performance issues which plague the systems, you know, the team is always very stressed, right? They do not know when a problem will happen. So in that case, the team satisfaction went up dramatically. that is a very important indicator for us. The third thing is the quality of resolution, right? What is the chance that the response or the resolution you gave indeed worked? That is, the resolution quality substantially went up by about 30%, right? And the last of those is, Cost, right? So you may think that today without DevOps, an agentic AI, let's suppose that you had certain cost in terms of people, right? And today, because of having agentic AI, you change the cost structure, right? But it becomes an operation expense. And that came down by 25%. Right? This is also because with time, As I walked you through in the earlier slides, the cost of these models is dramatically reducing year on year. with this, I conclude, thank you so much for this opportunity to present.
...

Rajeshwari Ganesan

Distinguished Technologist ML/AI @ Infosys

Rajeshwari Ganesan's LinkedIn account

Chidambaram Ganapathy Sundaram

Architect @ Infosys

Chidambaram Ganapathy Sundaram's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)