Conf42 Site Reliability Engineering 2022 - Online

Transformation & Cultural Shift using Site Reliability Engineering (SRE) & Data Science

Video size:

Abstract

Transformation of Elephant style monolithic organization is a difficult task. As part of current role to lead the transformation to an SRE cultural is a challenge, this is a journey and as an organization we shall continue to evolve. SRE Implementation was not focused on only having SRE team for the organization but a bootstrap concept was introduced which would manage the SREs who are in each and every unit (Every team) under the guidance of Bootstrap. This has been effective to ensure penetration in day to day working of the organization and that’s when the wheel starts turning. The outcome has been every tribe has a designated set up of SRE across the organization. The usage of AI/ML for infrastructure based analysis of ITSM data as well as Infrastructure log data to ensure the SREs are armed with best predictions for the infra and component to take the best decisions. This has helped in cost reduction but also helped in ensuring that the legacy tooling is slowly shaved out. This has been a data driven approach to systematic roll out of SRE set up. In house tooling culture was put as part of the release train to ensure the demand & the interest from the operations team is in tact. The tooling cannot be suddenly changed over night but had to accommodate the old and the new to ensure no service disruption. This was achieved by having an architecture which combines old and the new, ensuring the old set up of automation works while the new age DevOps based tooling and integrate API solutions are also integrated. The predication based set up would Definitely change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .(More about it later ;))

Problems are plenty and especially for an organization which provides services, as revenue defines each big customer and sometimes these units which cater to a particular customer starts acting like individual companies and tries to define standards and ways of working.

While SRE rollout in industry is not unique but surely the set up is unique. The usage of AI/ML for ITSM and Infra data is the uniqueness. The prediction using the logs of different parameters of the system and combining with corraltion matrix of close to 30 odd parameters of a machine . This in house built set up, combined with other parameters makes a unique experience gives SRE an edge already to perform what they are good at . The analysis is there , its about applying the principles. This was unique as a monolithic organizations try for an immediate benefit and we were able to provide it with combination of re structure in the internal services & ensuring SREs are not just introduced as a concept from top down but going to designate SRE culture at every tribe level. This was orchestrated by a bunch of best SREs ( under the SRE bootstrap) who provided constant guidance to the tribe designated SREs and also lead the bigger tooling and AI based Usage initiatives.

The predication based set up would Definite change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .This set up works well as then culturally, at every team level we are changing and penetrating. It had it own set of challenges and push backs which is out of learning .

Modernization using SRE - The entire detailed set up is an amalgamation of SRE + usage of AIOPS based structure. While SREs part is more about following the culture and principles , the analytics based tech set up is like providing them with ammunitions to be proactive and predictive in problem solving. The above thought process covers the 2 most important principles, Monitoring ( advanced & Simplicity ( something which is an underlying horizontal). Emphasis is on visibility engineering, to ensure everything is seen, can be tracked, predicted and can be controlled. This is not just limited to the landscapes, ecosystem only but is being used for ITSM components as well. Process simplification has been kick started based on data analysis of why and how this time can be reduced which ties back to the another Principle, elimination of toil.

ToolChain Set Architecture : The tool chain is a unique amalgamation of following areas

  1. ITSM Automation based – > Analytics on ITSM/ITIL based process
  2. Monitoring (Analysis of data) à Visibility engineering
  3. Advanced Monitoring (AIOPS)
  4. Automation modernization

While analytics is a great weapon for ITIL based set up, for example : analysis to see where the service is spending most of the time on, Why a particular type of change gets stuck in a particular stage etc etc. The Incident based , Machine Learning based set up can always provide trend pattern, correlation and automatic error categorization based on historic data, it helps to faster resolution, know what is coming and what pattern is being followed . For us it was stage 1 to reach as it gives instant benefits due to correlation and detailed part. An element of Visibility engineering was added to ensure SRE s benefit from it and are not always drowned in excel and reports. This also helped them in determining the Error budgeting and Embracing Risk as they have first hand , clear cut data analysis available.

Visibility engineering takes a different step when we take it in the direction of log analytics. The power to use ARIMA ( just as an example) for prediction on Infra based data. E.g. Logs of Memory utilization for a server , what would happen in next 3-5 days. The capturing of data at every 5 minute interval. Correlation of 20 odd parameters to arrive at a matrix to predict failure.

A key element is Automation , which is considered the last part of the chain but the set up cannot be legacy to attain best result. It has to be a combination, the entire old house cannot be torn down to build a new one, that’s too much of disruption but a design has to be put in place to accommodate old and new. This can be seen in using modern – Open source tools while we still ensure the set up supports the old tooling .This makes the set up extremely Integra table

Data Science - AI/ML/IOT : The SRE principles revolve around the monitoring, automation, toil reduction etc and one of the potent weapons in that aspect is advance monitoring using analytics based set up, auto decision making ( self healing scenario) set up which all account to toil reduction and further streamlining the service. A lot of data is generated by configurable Items ( Cis) in the IT ecosystem, while some of them are really used for finding the root cause, a lot of it can be used for prediction, finding areas of improvements etc. The last thing we want SREs to do is just aim in the dark by going through plethora of data and scratching through number of excel sheets. The data which they analyze can further by simplified by using AI/ML/ data science based solutions which can not only save time for SREs, but ensure that SREs focus on the real problem solving rather than search what is the actual problem.

AIOPS based set up is going to ensure SREs get a lot of information before hand for them to make quick decisions, While trend and pattern can be derived for sure from the historic data, a key aspect is predictability . The proactiveness would come if we provide out of box solutions to SREs which can help them predict problems, show they trends and patterns of what they are looking for.

One of the basic example is infrastructure failure prediction, this can be based on the log based prediction analysis combined with correlation matrix of each and every parameter which is coming out of the data. This way, with certain accuracy, the prediction mechanism can show if the infra would fail in near future . This cannot be achieved alone by incident based and alert based analysis as these are static thresholds but more dynamic, mathematical , moving average based threshold needs to be defined which can with more accuracy can predict the breakdown.

This in long run would modify the way in which operations is done when the entire ITIL process would require a change. The concept of static threshold would be gone (alert based mechanism to incident) & there would be no incidents as prediction would take care of lot of problems. Even more, with elastic and hyperscale Infra set up, moving averages and dynamic threshold would be the future. This would change the way Operations is done today by no incident based set up. This adoption would take time as it requires changing the way we work in the kitchen but some useful UCs, like triaging based on Machine learning is definitely being implemented.

New addition, usage and definition of Error Budget definition, how proper , guideline based error budget can help

Summary

  • Today we are going to talk about transformation and cultural shift using site reliability engineering and data science. How is it as a transformation driver? How does it tackle culture technology? And at the end we will talk about the challenges which we faced based on our experience.
  • Site reliability engineering is not only about doing automation, it cis about also looking at resilient system, system design and things of that sort. The measurement is no longer around slas, it's no longer about contracts, it'm around improvement.
  • The most important one for cytreliability engineering is cultural shift. The other part is obviously technology refresh. We also will talk about the concept of visibility engineering. The last part is the process and the automation.
  • There are aspects which we have to keep in mind from an SRE standpoint, embracing risk, the first one, and error budgeting. Monitoring is what we call as advanced monitoring and the concept of visibility engineering automation. What to automate is very important and how the use cases are nurtured.
  • SRE penetration is so important in the organization. We wanted to ensure that there are sres identified in each and every squad. This was the visibility engineering architecture at a very quick level. We are still implementing for some of the logs part.
  • CIS: You have to inculcate the culture of failure. We have to take error budget concept seriously. It is one of the key ingredients of site reliability engineering. Industry is changing and it's time to take some bold decisions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Rohit and today we are going to talk about transformation and cultural shift using site reliability engineering and data science. We'll cover the aspects of mostly around site reliability engineering. How is it as a transformation driver? How does it tackle culture technology and how we can cause data science and leverage a lot of data ops and aiops in these entire journey of doing the transformation using site reliability engineering, we are also going to talk about defining processes, the risk taking abilities, the unique process which we tried to bring in place. And at the end we are going to surely talk about the challenges which we faced based on our experience. To start off with, a lot of people have a question around what is can sre? It was obviously coined by Google, Ben Logs coined thats term. Site reliability engineers often have background in software engineering and system engineering and then it was about making sure thats you put in a person from a software development background into an ops and see how it works out. Site reliability engineering is not only about doing automation, it cis about also looking at resilient system, system design and things of that sort. Is it same as DevOps? Yes, more or less. But there is a major difference being that site reliability engineering is applying on something which CIS post production is SRE about. Automation. Automation is one of the key drivers and we'll talk about it when we get to the principles which we talk about and how it is a cultural shift more than anything else. That is what we have experienced when we try to roll it out for a services based organization. And how do I measure the success? The measurement is no longer around slas, it's no longer around contracts, it's around improvement. You measure improvement and that is what cis SRE all about, right? So if we go to the transformation drivers, as I talked about, and I'm going to spend some time here in the four drivers which we perceive as the most important one for cytreliability engineering, the first and foremost, and I cannot stress enough on it, is cultural shift. It is the most difficult thing to achieve when we go on this transformation drive, right? It revolves around building better teams, it revolves around saying a lot of not my problem out of the window. It is about pushing people to a different level. It is about reenhancing and reskilling them up to a point. It is about making SRe you change the was of working right. Cultural shift is about a lot to do around training, a lot to do around making SRE that the normal ways of doing work is going to change because we are going to use a lot of technology. We are going to use a lot of data into it. And there will be example later in the presentation where we'll talk about how we tried to challenge the current setup of the processes which were there in place, including the ITL setup which we had. The other part is obviously technology refresh. Now, if we want cytreliability engineering to be successful and we want the sres to be really working in real areas, we need to have a certain degree of acceptance, certain degree of freedom from an SRE standpoint. We also will talk about the concept of visibility engineering, also where it is about technology refresh. We'll talk about SRE's focus on doing tooling, making in house tools or using different tools across a little bit of shift towards open source tooling also is what we perceive then a lot to do about data visibility engineering. Now, data analytics is not new. It's not something which is new. The entire world is obsessed, especially in the operations world. They're obsessed about ITIL data analytics, which is important. I don't deny that. But there is so much more we can do when we do analytics on top of components, we do analytics on top of toil dashboards, real tribe ones, usage of machine learning and usage of data ops to get to things which is not visible to naked eye. And we'll talk about it where we look at doing predictions based on the components and then deriving outputs out of it and then enacting on top of it. Then the last part is the process and the automation. We'll again touch on that when we go to the reliability principles. But processes have to be measured. Processes sometimes have to be changed or at least foregone at ties to bring simplicity into the setup. Sometimes structures become shackles when we put too many processes in place. And we'll try to look at how sres can help in that quickly to the SRE reliability principles, and I will go through each one of them very quickly. There are aspects which we have to keep in mind from an SRE standpoint, embracing risk, the first one, and error budgeting. Now, embracing risk is a cultural thing. We do serve customers, we do have our contractual agreements, and we have to ensure that we are within that. But having said that, how do you manage risk? How do you bring in a culture of risk taking ability? Right? You bring in a risk tolerance and tolerance level through close. So that's what we try to achieved. It's a cultural thing. You have to trust the team to take a certain degree of risk, obviously not falling through the crack, but within the calculated risk framework which we have defined, then you have the concept of error budgeting. Now, error budgeting is sometimes very much confused with people tend to confuse it with all the SLas and the KPIs. No, it's not. It's a KPI for themselves. It's a API to reach higher, to reach better. It's about disruption. It's about making sure that error budget violation is considered an equivalent automation of what we do for an SLA, even though the slos or the error budget will be much, much higher than what the slas are. Right. Managing toil this is an area which is primarily untapped and that's why you need sres to be inculcated into the team members. That concept, because toil necessarily cannot always be calculated in numbers. There are so many things thats teams do outside what is captured and what can be analyzed. And we have to look at it and revisit it at different times to see how we can ensure thats the SREs make sure that the operations teams the workload is much much lesser. Okay. Sres are no separate team. That is also a misconception. They are the operations team who ensure that most 50% of their time they care only working on making sure that they automate any mundane tasks or ensure that the automation is going to the level where they are no longer required to sit on call and do stuff. Right. Monitoring for me is the based of it. Now there are traditional long term monitoring tools which care there. But what we try to bring in over here and that's where the data science comes into picture, is trying to make sure that we try to align this with more predictive stuff. Okay. A lot of logs analysis will happen because the alerting system through the monitoring setup which we already have across our application is sometimes very reactive. By the time it gets to us, things have already happened and we wanted to change that. So monitoring is what we call as advanced monitoring and the concept of visibility engineering automation, well, is an output always probably the most overused word in the industry, automation, cis, always an output. Automation is after you know what you need to do and then automation does what it does. So while automation is an output, analysis is much more important to understand what we need to automate. So we do give weightage to the automation, but what to automate is very important and how the use cases are nurtured. Release engineering this is more from more advanced or more modern setups where we have the release engine train which can be run through from a DevOps practice. Traditional infrastructure teams are still a little bit behind in that. But when it comes to container based teams mostly, or public cloud based teams, release engineering concept principles work pretty well. A lot of automation goes in there. A lot of predictions also go in there. The simplicity is the core of it. It cis, again, a cultural thing. You don't want to overcomplicate things. You don't want to get five different process stage gates to ensure that the things are all going through. You got to make simple solutions. We are not here to solve world Harunga's problem in one day. Okay, so we go step by step again on the same side, what all things we have done, how the SRE setup was done. And this brings me to the point where we want to talk about how the SRE penetration is so important in the organization. Now, in a product based organizations with a central team monitoring the entire architecture and these technology part of it, it's a little bit straightforward that we have a concept of having can SRE team and inculcating it in each and every team members. Now, in case of services based organizations, it gets slightly tricky because we have each and every service for each different customers. Now, each of these different customer units act and have their own revenue, own PNL, and they act in a different way. Right now, these same services being done for two different units will have slight difference. Some might be relatively new to this, some might be quite advanced when it comes to tooling or SRE adoption or any of that. In order to do that, we defined a concept called as bootstrap site reliability engineer. A group of ten to twelve sres, whom mostly we hired from the market, who have done this in past, and they were dedicatedly assigned, who were like ideological mentor to the entire SRE community across the organization. We call them the horizontal SRE and the vertical SRE was that we asked each and every items, each and every squad for each and every customer unit whom we care working for. And we are a $5 billion company. So there are plenty. We wanted to ensure that there are sres identified in each and every squad. Okay. People who are working in SRE. Ideally we want everybody to be can SRE, but it's more of a maturity target than actual reality. Right? So what we wanted to do was we wanted to keep sres there in each and every squad. Obviously, we have to train them. And that's what was the part of the horizontal. The bootstrap sres and bootstrap SRe also would slowly induce the new tooling culture, the visibility engineering, where we have the usage of AI driven intelligence and making sure that we work on predictions. We call it ITSM data plus component data plus system data plus error budgeting, and coming back with a framework thats we define error budget for each and every squad. And obviously error budget cannot only be defined to cater to a certain number. It was defined keeping in mind what the customer wants, what the end user wants. That's when the sres are the most effective ones. Right. This was the visibility engineering architecture at a very quick level. If I can get through this, which was around trying to make sure we have a stream of data, we tried to get it all into data ops set up. Okay, this is itil a work in progress. We are still not 100% there. We have been able to do it for some of them. We are still implementing for some of the logs part because there are so many customers, so many toolings, so many different ways of doing GDPR laws apply and you can't move everything to a centralized setup. But the point was, it was never to do a centralized ever. We are not here to make a data lake. What we are trying to do is we have, can integration cater across so that we have all the data ingesting in which can be used. And then we have separate pipelines. For example, we have an ITIl based machine turning model, which gives runtime technology error categories, right? So for every customer it's different because their incident based thresholds are different, their incident ways of working of the teams care different. So we have separate pipelines, ML pipelines which run through and churn out their dashboards. But at the same time there are prediction based setup also which are there. And I'll show a little bit of that in the subsequent slide. And then based on that AI hub, we tried to bring in new models and try to utilize the data hubs concept. That's basically what we have from a visibility engineering set standpoint. Now this is a very cliche slide to be very honest, because data is the new ML. Everybody knows that. So the first stage was obviously itsm analysis. Everybody does that. We wanted to bring in a flavor of ML of what the problem is rather than where the problem is. In a traditional setup, we usually have a flood of incidents coming in and people trying to do a lot of analysis around it by giving it in excel to our SMEs. We wanted to make sure that thats happens automatically. So we wrote a machine learning model which has different pipelines, obviously for different customers, where the problem itself is shown based on the historic record, which is excellent. And the second thing was the ITsm. This was our starting point, the ITSM analysis. The next was we get into log analysis where we care more talking about dynamic thresholds because static thresholds care. The ones which develop alerting. You put a configuration, the alerting happens, you get an alert or an incidents or whatever. But the log analysis is more like a moving averages concept, more like a dynamic threshold. And that's based on predictions. So we are not anymore waiting for when it will happen or when it happens. Then I'll take an action. We are looking at predictions and this was one of the most paradigm shifts which we wanted to do. Again, it's a work in progress. We have still not reached there. But the point was we wanted to get to a point where we can start predictions, not only failures. Failures is probably the first step, but looking at how it's going to affect business. So we do a lot of predictions on monitoring, then we do a lot of correlation, which is based on dependent variables, non dependent variables. I don't want to get more into details about mathematical part of it, but more to do with how we use these predictions to come out with an output that was number one. But the number two point which was very important was how does it change? And it all brings down the data science part where we have predictions, the technology which we are using, the processes have to be implemented. And I'll give you two instances where the processes were differently implemented. Number one was if we have a prediction set up, that means the static threshold based alerting, if we are very accurate with the prediction, will no more longer be applicable. The point was how does it change the ways of working of the team, which brings down to the point of how the process change and how the mindset and the cultural shift happens. The point was that if the engineers or the operators or the operations engineers are going to look at predictions, what are they supposed to do if they see something is going to happen, say 8 hours from now, what is the standard procedure to be followed? What is the SOP to be followed? Because right now more or less the entire ITL is based on mostly around change and incident management, which is right. But the problem management is very reactive and we wanted to bring that to proactiveness. So if my prediction says that something is going to go wrong in eight to 10 hours from now, what will be the next step? And that is where the SRe kicks in. This is how we in our organizations tries to use the site. Reliability engineering is we want to run a project which is at 75% of the time where the prediction shows that it will go wrong. Okay. And that project is led and executed by the site reliability engineer and ensure that the trend starts going down. And that can only be done when we have log based predictions. We take it seriously and we have great accuracy in that for which the organizations need to invest on data science to ensure that we get to that level. The other part was about change management on an application side, which we had where we put in a concept called as failure budget, where we gives individual budget based on the historic data for each and every team and how they have been performing with the changes. And based on that, we gave them some credits. And those credits would exhaust if they breach the error budgets or the close which we have defined till they are not, they are free to get through all the cause and all the quality gates. The moment they breach it once or twice, they're back to the quality cater. So that's how we tries to inculcate the risk taking ability in the organization on all the items. As I said, SRE is a more cultural and data shift, rather than actually a technology shift, which is a big, I should say, misconception. And also when you dig deep, you'll see that automation is just the final thing. It is not the driver, it is what you do at the end. These drivers are all of this which we talked about very quickly to go with challenges. There were plenty. Transformations are always challenging, we know that. So one of the most important thing was people tend to use this transformation to attain numbers. Yes, it will give it, but in the long run, the ROI is in the long run, we don't get instant gratification, right. Cannot have too much of heavy duty governance which will become can issue inculcating the cultural. The wheel will only turn when you start looking at the culture of failure rather than consequence management, where something goes wrong, then you look at who is the culprit and try to have consequences for that. The consequence management is not going to work. You have to inculcate the culture of failure. Non technology centric approach. People tend to start working. Everything on can excel that doesn't work and we should be able to take build decisions. It's a changing time. The industry is changing and it's time that at some point of time we have to take some bold decisions. And the final thing, CIS, we have to take error budget concept seriously. It's like the smartwatch which you wear. You wear the smartwatch you run every day. You look at the numbers and why do you look at the numbers? To see how you have improved. Error budget is exactly the same. You see how you want to improve and it has to be taken very seriously. It is one of the key ingredients of site reliability engineering. Having said that, here are my blogs and my link in. Feel free to connect. And the blog is all about site reliability engineering. That's where the framework I have put in. How did we come to an error budget concept which can be a separate topic? So that's all I had. Thanks a lot for your time and patience in hearing me out.
...

Rohit Sinha

Director- Cloud Application Services @ T-Systems International

Rohit Sinha's LinkedIn account Rohit Sinha's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways