Transformation & Cultural Shift using Site Reliability Engineering (SRE) & Data Science
Video size:
Abstract
Transformation of Elephant style monolithic organization is a difficult task. As part of current role to lead the transformation to an SRE cultural is a challenge, this is a journey and as an organization we shall continue to evolve. SRE Implementation was not focused on only having SRE team for the organization but a bootstrap concept was introduced which would manage the SREs who are in each and every unit (Every team) under the guidance of Bootstrap. This has been effective to ensure penetration in day to day working of the organization and that’s when the wheel starts turning. The outcome has been every tribe has a designated set up of SRE across the organization. The usage of AI/ML for infrastructure based analysis of ITSM data as well as Infrastructure log data to ensure the SREs are armed with best predictions for the infra and component to take the best decisions. This has helped in cost reduction but also helped in ensuring that the legacy tooling is slowly shaved out. This has been a data driven approach to systematic roll out of SRE set up. In house tooling culture was put as part of the release train to ensure the demand & the interest from the operations team is in tact. The tooling cannot be suddenly changed over night but had to accommodate the old and the new to ensure no service disruption. This was achieved by having an architecture which combines old and the new, ensuring the old set up of automation works while the new age DevOps based tooling and integrate API solutions are also integrated. The predication based set up would Definitely change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .(More about it later ;))
Problems are plenty and especially for an organization which provides services, as revenue defines each big customer and sometimes these units which cater to a particular customer starts acting like individual companies and tries to define standards and ways of working.
While SRE rollout in industry is not unique but surely the set up is unique. The usage of AI/ML for ITSM and Infra data is the uniqueness. The prediction using the logs of different parameters of the system and combining with corraltion matrix of close to 30 odd parameters of a machine . This in house built set up, combined with other parameters makes a unique experience gives SRE an edge already to perform what they are good at . The analysis is there , its about applying the principles. This was unique as a monolithic organizations try for an immediate benefit and we were able to provide it with combination of re structure in the internal services & ensuring SREs are not just introduced as a concept from top down but going to designate SRE culture at every tribe level. This was orchestrated by a bunch of best SREs ( under the SRE bootstrap) who provided constant guidance to the tribe designated SREs and also lead the bigger tooling and AI based Usage initiatives.
The predication based set up would Definite change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .This set up works well as then culturally, at every team level we are changing and penetrating. It had it own set of challenges and push backs which is out of learning .
Modernization using SRE - The entire detailed set up is an amalgamation of SRE + usage of AIOPS based structure. While SREs part is more about following the culture and principles , the analytics based tech set up is like providing them with ammunitions to be proactive and predictive in problem solving. The above thought process covers the 2 most important principles, Monitoring ( advanced & Simplicity ( something which is an underlying horizontal). Emphasis is on visibility engineering, to ensure everything is seen, can be tracked, predicted and can be controlled. This is not just limited to the landscapes, ecosystem only but is being used for ITSM components as well. Process simplification has been kick started based on data analysis of why and how this time can be reduced which ties back to the another Principle, elimination of toil.
ToolChain Set Architecture : The tool chain is a unique amalgamation of following areas
- ITSM Automation based – > Analytics on ITSM/ITIL based process
- Monitoring (Analysis of data) à Visibility engineering
- Advanced Monitoring (AIOPS)
- Automation modernization
While analytics is a great weapon for ITIL based set up, for example : analysis to see where the service is spending most of the time on, Why a particular type of change gets stuck in a particular stage etc etc. The Incident based , Machine Learning based set up can always provide trend pattern, correlation and automatic error categorization based on historic data, it helps to faster resolution, know what is coming and what pattern is being followed . For us it was stage 1 to reach as it gives instant benefits due to correlation and detailed part. An element of Visibility engineering was added to ensure SRE s benefit from it and are not always drowned in excel and reports. This also helped them in determining the Error budgeting and Embracing Risk as they have first hand , clear cut data analysis available.
Visibility engineering takes a different step when we take it in the direction of log analytics. The power to use ARIMA ( just as an example) for prediction on Infra based data. E.g. Logs of Memory utilization for a server , what would happen in next 3-5 days. The capturing of data at every 5 minute interval. Correlation of 20 odd parameters to arrive at a matrix to predict failure.
A key element is Automation , which is considered the last part of the chain but the set up cannot be legacy to attain best result. It has to be a combination, the entire old house cannot be torn down to build a new one, that’s too much of disruption but a design has to be put in place to accommodate old and new. This can be seen in using modern – Open source tools while we still ensure the set up supports the old tooling .This makes the set up extremely Integra table
Data Science - AI/ML/IOT : The SRE principles revolve around the monitoring, automation, toil reduction etc and one of the potent weapons in that aspect is advance monitoring using analytics based set up, auto decision making ( self healing scenario) set up which all account to toil reduction and further streamlining the service. A lot of data is generated by configurable Items ( Cis) in the IT ecosystem, while some of them are really used for finding the root cause, a lot of it can be used for prediction, finding areas of improvements etc. The last thing we want SREs to do is just aim in the dark by going through plethora of data and scratching through number of excel sheets. The data which they analyze can further by simplified by using AI/ML/ data science based solutions which can not only save time for SREs, but ensure that SREs focus on the real problem solving rather than search what is the actual problem.
AIOPS based set up is going to ensure SREs get a lot of information before hand for them to make quick decisions, While trend and pattern can be derived for sure from the historic data, a key aspect is predictability . The proactiveness would come if we provide out of box solutions to SREs which can help them predict problems, show they trends and patterns of what they are looking for.
One of the basic example is infrastructure failure prediction, this can be based on the log based prediction analysis combined with correlation matrix of each and every parameter which is coming out of the data. This way, with certain accuracy, the prediction mechanism can show if the infra would fail in near future . This cannot be achieved alone by incident based and alert based analysis as these are static thresholds but more dynamic, mathematical , moving average based threshold needs to be defined which can with more accuracy can predict the breakdown.
This in long run would modify the way in which operations is done when the entire ITIL process would require a change. The concept of static threshold would be gone (alert based mechanism to incident) & there would be no incidents as prediction would take care of lot of problems. Even more, with elastic and hyperscale Infra set up, moving averages and dynamic threshold would be the future. This would change the way Operations is done today by no incident based set up. This adoption would take time as it requires changing the way we work in the kitchen but some useful UCs, like triaging based on Machine learning is definitely being implemented.
New addition, usage and definition of Error Budget definition, how proper , guideline based error budget can help
Summary
-
Today we are going to talk about transformation and cultural shift using site reliability engineering and data science. How is it as a transformation driver? How does it tackle culture technology? And at the end we will talk about the challenges which we faced based on our experience.
-
Site reliability engineering is not only about doing automation, it cis about also looking at resilient system, system design and things of that sort. The measurement is no longer around slas, it's no longer about contracts, it'm around improvement.
-
The most important one for cytreliability engineering is cultural shift. The other part is obviously technology refresh. We also will talk about the concept of visibility engineering. The last part is the process and the automation.
-
There are aspects which we have to keep in mind from an SRE standpoint, embracing risk, the first one, and error budgeting. Monitoring is what we call as advanced monitoring and the concept of visibility engineering automation. What to automate is very important and how the use cases are nurtured.
-
SRE penetration is so important in the organization. We wanted to ensure that there are sres identified in each and every squad. This was the visibility engineering architecture at a very quick level. We are still implementing for some of the logs part.
-
CIS: You have to inculcate the culture of failure. We have to take error budget concept seriously. It is one of the key ingredients of site reliability engineering. Industry is changing and it's time to take some bold decisions.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Rohit and today we are
going to talk about transformation and cultural shift using
site reliability engineering and data science.
We'll cover the aspects of mostly around site reliability
engineering. How is it as a transformation driver?
How does it tackle culture technology and how
we can cause data science and leverage a lot of data ops and aiops
in these entire journey of doing the transformation using site
reliability engineering, we are also going to talk about defining
processes, the risk taking abilities, the unique process which we tried
to bring in place. And at the end we are going to surely talk about
the challenges which we faced based on our
experience. To start off with,
a lot of people have a question around what is can sre?
It was obviously coined by Google, Ben Logs coined thats term.
Site reliability engineers often have background in software engineering
and system engineering and then it was about
making sure thats you put in a person
from a software development background into an ops and see how it works
out. Site reliability engineering is not only about doing automation,
it cis about also looking at resilient system,
system design and things of that sort. Is it same
as DevOps? Yes, more or less. But there is
a major difference being that site reliability engineering is applying
on something which CIS post production is
SRE about. Automation. Automation is one of the key drivers and we'll talk about it
when we get to the principles which we
talk about and how it is a cultural shift more than
anything else. That is what we have experienced when we try to roll it
out for a services based organization. And how do I
measure the success? The measurement is no longer around
slas, it's no longer around contracts, it's around
improvement. You measure improvement and that is
what cis SRE all about, right? So if we go to
the transformation drivers, as I talked about, and I'm going to
spend some time here in the four drivers which we perceive as
the most important one for cytreliability engineering, the first
and foremost, and I cannot stress enough on it, is cultural shift.
It is the most difficult thing to achieve when we go on
this transformation drive, right? It revolves
around building better teams, it revolves around saying a lot
of not my problem out of the window. It is about pushing people
to a different level. It is about reenhancing and reskilling
them up to a point. It is about making SRe
you change the was of working right.
Cultural shift is about a lot to do around training, a lot to
do around making SRE that the normal ways
of doing work is going to change because we are going to use a lot
of technology. We are going to use a lot of data into it. And there
will be example later in the presentation where we'll talk about
how we tried to challenge the current setup of
the processes which were there in place, including the ITL setup
which we had. The other part is obviously technology refresh.
Now, if we want cytreliability engineering to be successful
and we want the sres to be really working in real areas,
we need to have a certain degree of acceptance,
certain degree of freedom from
an SRE standpoint. We also will talk about the concept of
visibility engineering, also where it is about technology refresh.
We'll talk about SRE's focus on doing tooling,
making in house tools or using different tools across a
little bit of shift towards open source tooling also is what we perceive
then a lot to do about data visibility engineering.
Now, data analytics is not new.
It's not something which is new. The entire world is obsessed,
especially in the operations world. They're obsessed about ITIL data
analytics, which is important. I don't deny that.
But there is so much more we can do when we do analytics on top
of components, we do analytics on top of toil
dashboards, real tribe ones, usage of
machine learning and usage of data ops to get to
things which is not visible to naked eye. And we'll talk about it where
we look at doing predictions based on the components
and then deriving outputs out of it and then
enacting on top of it. Then the last part is the process and
the automation. We'll again touch on that when we go to the reliability principles.
But processes have to be measured.
Processes sometimes have to be changed or at least
foregone at ties to bring simplicity into the
setup. Sometimes structures become shackles when we put too many processes
in place. And we'll try to look at how sres can help
in that quickly to the SRE reliability principles,
and I will go through each one of them very quickly. There are
aspects which we have to keep in mind from an SRE standpoint, embracing risk,
the first one, and error budgeting. Now, embracing risk is a cultural thing.
We do serve customers,
we do have our contractual agreements,
and we have to ensure that we are within that.
But having said that, how do you manage risk? How do you
bring in a culture of risk taking ability? Right? You bring in a
risk tolerance and tolerance level through close. So that's what
we try to achieved. It's a cultural thing. You have to trust the team
to take a certain degree of risk, obviously not falling through the crack, but within
the calculated risk framework which we have defined,
then you have the concept of error budgeting. Now, error budgeting is
sometimes very much confused with people
tend to confuse it with all the SLas and
the KPIs. No, it's not. It's a KPI for themselves.
It's a API to reach higher, to reach better. It's about
disruption. It's about making sure that error budget violation is
considered an equivalent automation of what we do
for an SLA, even though the slos or the error budget will be much,
much higher than what the slas are. Right. Managing toil
this is an area which is primarily untapped and
that's why you need sres to be inculcated into the team members.
That concept, because toil necessarily cannot always be calculated
in numbers. There are so many things thats teams do outside
what is captured and what can be analyzed. And we have
to look at it and revisit it at different times to see how we
can ensure thats the SREs make sure that the operations teams
the workload is much much lesser. Okay. Sres are
no separate team. That is also a misconception. They are the operations
team who ensure that most 50% of their time they care only working on
making sure that they automate any mundane tasks or
ensure that the automation is going to the level where they are no longer
required to sit on call and do stuff. Right.
Monitoring for me is the based of it. Now there
are traditional long term monitoring tools which care there.
But what we try to bring in over here and that's where the data science
comes into picture, is trying to make sure that we try to align this
with more predictive stuff. Okay. A lot of logs
analysis will happen because the alerting system
through the monitoring setup which we already have across our
application is sometimes very reactive.
By the time it gets to us, things have already happened and we wanted to
change that. So monitoring is what we call as advanced monitoring
and the concept of visibility engineering automation, well,
is an output always probably the most overused word in the
industry, automation, cis, always an output.
Automation is after you know what you need to do and then automation does
what it does. So while automation is an output,
analysis is much more important to understand what we need to automate.
So we do give weightage to the automation,
but what to automate is very
important and how the use cases are nurtured. Release engineering this is more from
more advanced or more modern setups
where we have the release engine train which can be run through from a DevOps
practice. Traditional infrastructure teams are still a
little bit behind in that. But when it comes to container based teams
mostly, or public cloud based teams, release engineering concept principles
work pretty well. A lot of automation goes in there. A lot of
predictions also go in there. The simplicity is the core
of it. It cis, again, a cultural thing. You don't want to
overcomplicate things. You don't want to get five
different process stage gates to ensure that the things are all
going through. You got to make simple solutions. We are not here to solve world
Harunga's problem in one day. Okay, so we go step by step again
on the same side, what all things we have done, how the SRE
setup was done. And this brings me to the
point where we want to talk about how the SRE penetration is so
important in the organization. Now,
in a product based organizations with a central
team monitoring the entire architecture and these technology part of
it, it's a little bit straightforward
that we have a concept of having can SRE team
and inculcating it in each and every team members. Now, in case
of services based organizations, it gets slightly tricky because
we have each and every service for
each different customers. Now, each of these different customer
units act and have their own revenue,
own PNL, and they act in a different way.
Right now, these same services being done for two different
units will have slight difference. Some might
be relatively new to this, some might be quite advanced
when it comes to tooling or SRE adoption or any of that. In order to
do that, we defined a concept called as bootstrap site reliability engineer.
A group of ten to twelve sres, whom mostly we hired
from the market, who have done this in past, and they
were dedicatedly assigned, who were like ideological
mentor to the entire SRE community across the organization.
We call them the horizontal SRE and the vertical SRE was that
we asked each and every items, each and every
squad for each and every customer unit whom we care working
for. And we are a $5 billion company. So there are plenty.
We wanted to ensure that there are sres identified in each and every squad.
Okay. People who are working in SRE. Ideally we want everybody to be can SRE,
but it's more of a
maturity target than actual reality. Right? So what we
wanted to do was we wanted to keep sres there in
each and every squad. Obviously, we have to train them.
And that's what was the part of the horizontal. The bootstrap sres and bootstrap
SRe also would slowly induce the new
tooling culture, the visibility engineering, where we
have the usage of AI driven intelligence and making
sure that we work on predictions. We call it ITSM
data plus component data plus system data plus error
budgeting, and coming back with a framework thats we define
error budget for each and every squad.
And obviously error budget cannot only be defined to cater to
a certain number. It was defined keeping in mind what the customer wants,
what the end user wants. That's when the sres are the most effective ones.
Right. This was the visibility engineering architecture at a very quick
level. If I can get through this, which was
around trying to make sure we have a stream of data,
we tried to get it all into data ops
set up. Okay, this is itil a work in progress. We are still
not 100% there.
We have been able to do it for some of them. We are still implementing
for some of the logs part because there are so many customers, so many toolings,
so many different ways of doing GDPR laws apply and
you can't move everything to a centralized setup. But the point was, it was
never to do a centralized ever. We are not here to make a data lake.
What we are trying to do is we have, can integration cater across so
that we have all the data ingesting in
which can be used. And then we have separate pipelines. For example,
we have an ITIl based machine turning
model, which gives runtime technology error categories,
right? So for every customer it's different because their
incident based thresholds are different, their incident ways
of working of the teams care different. So we have separate
pipelines, ML pipelines which run through and churn out
their dashboards. But at the same time there are prediction based setup
also which are there. And I'll show a little bit of that in
the subsequent slide. And then based on
that AI hub, we tried to bring in new models and try to
utilize the data hubs concept. That's basically what we have from a visibility
engineering set standpoint. Now this is a very cliche slide
to be very honest, because data is the new ML. Everybody knows that. So the
first stage was obviously itsm analysis. Everybody does that. We wanted to bring
in a flavor of ML of what the problem is rather than where the problem
is. In a traditional setup, we usually have a flood
of incidents coming in and people trying to do a lot of analysis around
it by giving it in excel to
our SMEs. We wanted to make
sure that thats happens automatically. So we wrote a machine learning model
which has different pipelines, obviously for different customers,
where the problem itself is
shown based on the historic record, which is excellent. And the
second thing was the ITsm. This was our starting point, the ITSM analysis.
The next was we get into log analysis where we care
more talking about dynamic thresholds because static thresholds
care. The ones which develop alerting. You put a configuration,
the alerting happens, you get an alert or an incidents or
whatever. But the log analysis is more like a
moving averages concept, more like a dynamic threshold. And that's
based on predictions. So we are not anymore waiting for
when it will happen or when it happens. Then I'll take
an action. We are looking at predictions and this was one
of the most paradigm shifts which we wanted to do.
Again, it's a work in progress. We have still not reached there. But the point
was we wanted to get to a point where we can start predictions, not only
failures. Failures is probably the first step, but looking at how it's going to affect
business. So we do a lot of predictions on monitoring, then we
do a lot of correlation, which is based on dependent variables,
non dependent variables. I don't want to get more into details about mathematical part
of it, but more to do with how we
use these predictions to come out with an
output that was number one. But the number two point which
was very important was how does it change? And it
all brings down the data science part where we have predictions,
the technology which we are using, the processes have to
be implemented. And I'll give you two instances where the processes were differently implemented.
Number one was if we have a prediction
set up, that means the static threshold based
alerting, if we are very
accurate with the prediction, will no more longer be applicable.
The point was how does it change the ways of working of the team,
which brings down to the point of how the process change and how the mindset
and the cultural shift happens. The point was
that if the engineers or the operators
or the operations engineers are going to look at predictions,
what are they supposed to do if they see something is going to happen,
say 8 hours from now, what is the standard procedure to
be followed? What is the SOP to be followed? Because right now more or
less the entire ITL is based on mostly around change and incident management,
which is right. But the problem management is very reactive
and we wanted to bring that to proactiveness. So if my prediction
says that something is going to go wrong in eight to 10 hours
from now, what will be the next step? And that is where the SRe kicks
in. This is how we in our organizations tries to use
the site. Reliability engineering is we want
to run a project which is at 75%
of the time where the prediction shows that it
will go wrong. Okay. And that project is led and
executed by the site reliability engineer and ensure
that the trend starts going down. And that can only be done when
we have log based predictions. We take it seriously
and we have great accuracy in that for which the organizations
need to invest on data science to
ensure that we get to that level. The other part was about change management
on an application side, which we had where we put in a concept called
as failure budget, where we gives individual budget
based on the historic data for each and every team and how they have been
performing with the changes. And based on
that, we gave them some credits. And those credits would exhaust
if they breach the error budgets or
the close which we have defined till they are not,
they are free to get through all the cause and all the quality
gates. The moment they breach it once or twice, they're back to
the quality cater. So that's how we tries to inculcate the risk taking
ability in the organization on all the items.
As I said, SRE is a more cultural and
data shift, rather than actually a technology shift, which is a big,
I should say, misconception. And also when you
dig deep, you'll see that automation is just the final thing.
It is not the driver, it is what you do at the end.
These drivers are all of this which we talked about very
quickly to go with challenges. There were plenty.
Transformations are always challenging, we know that. So one
of the most important thing was people tend to use this
transformation to attain numbers. Yes, it will give it,
but in the long run, the ROI is in the long run, we don't get
instant gratification, right. Cannot have
too much of heavy duty governance which
will become can issue inculcating the cultural.
The wheel will only turn when you start looking
at the culture of failure rather than consequence management, where something goes wrong,
then you look at who is the culprit and try to have consequences for
that. The consequence management is not going to work. You have to inculcate the culture
of failure. Non technology centric approach. People tend
to start working. Everything on can excel that doesn't work
and we should be able to take build decisions.
It's a changing time. The industry is changing and it's time
that at some point of time we have to take some bold decisions.
And the final thing, CIS, we have to take error budget concept seriously. It's like
the smartwatch which you wear. You wear the smartwatch
you run every day. You look at the numbers and why
do you look at the numbers? To see how you have improved. Error budget is
exactly the same. You see how you want to improve and it has to
be taken very seriously. It is one of the key ingredients of site reliability engineering.
Having said that, here are my blogs and my link in.
Feel free to connect. And the blog is all about site reliability engineering.
That's where the framework I have put in. How did we come to an error
budget concept which can be a separate topic? So that's all I had.
Thanks a lot for your time and patience
in hearing me out.