Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRe?
A developer? A quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account at
Chaos native Litmus Cloud hi,
my name is Vishnu Vardhan Chikoti and in this talk
I am going to introduce about Arctic Arctic is a
new SRE adoption framework that has been recently conceptualized
by me to help with SRE adoption at enterprises.
About me I have about 16 years of experience
which is a diverse experience across site reliability engineering,
product development and business analysis. For the initial
part of my career I was in a product development business
analyst tech ba kind of roles and then I did a career pivot
towards the site reliability engineering area. Currently I work as
a senior manager SRE at Fanatics Inc. And prior
to fanatics I have worked with Broadridge bank of America,
Tetora Consulting and DBS Bank. I also
co authored a book by name,
hands on site reliable two engineering and it has been published very recently
in July 2021. There is a blog
that I hold, it's xfgeek.com and it has content across
capital markets which was my initial part of my career.
There is technology and then agile. You can look up if
you're interested. And from a location perspective, I'm in
Hyderabad almost all the time from the last 20
years, maybe like few months. I am not in Hyderabad but otherwise I'm always
in Hyderabad and coming to Arctic
I always start with a question of why and I would like this talk with
also a question of why. So why do we actually
need a framework for SRE adopting now
when it comes to SRE, there sre different views on what is SRE.
For some people SRE is about availability, for some others SRE
is about golden signals. SRE is about automation
for operations, SRE is about infrastructure automation or
SRE is just a new title for production support analyst.
The list goes on but I have given few examples here for
some of them and then there SRE different questions for SRE.
How is SRE different from Itil? How is SRE
different from DevOps? How do we structure SRE teams?
Is capacity planning taken care by sres as it is already done
during PNV testing? I have already done those capacity planning
in PNV testing now what is that sres are going to do at
a later point? Can we have multiple slos for the
same service? Is it fine if we just measure slos for
our critical services? So this list of questions also goes
on and on. These are like few examples which are there.
Now to answer some of these questions and to correct some of
these views, there SRE books, there are videos,
there are blogs. Now we will see a frameworks.
Now, what is a framework? The concept of framework is not new.
We have frameworks like Springboot for Java,
flask for Python, and then for agile adoption. Also there's
a framework called scrum. Now if you see the definition of framework,
what is a framework? Framework is a basic structure,
a foundation that is set on which we
can build on. Now this is where Arctic
is. It's basically a framework which tries to set the basic
structure, and a framework for SRE, a foundation
for SRE adoption at enterprises. Now hello Arctic.
Now we will look at what Arctic is and what are its
two pillars. Are now the two pillars
of Arctic are visibility and accountability.
So these are like two key things that are important
for transformation to look at so that we have
a successful transformation. So what is the visibility
required on? So SRE is about practices,
the tooling platforms and policies or procedures.
SRE is also about culture and SRE is
also about principles, but that is not explicitly
mentioned as part of visibility in this framework. So it's like
culture is like implicit. Without the cultural change, it cannot happen.
And also the principle should be understood now
when it comes to the practices. So SRE has like
lot of practices under it, and there's no need
to probably boil the can to go and start
on all of these practices on day one.
Now there can be an exercise that can be
done to look at what practices sre already in
place. It's a natural thing that some of those
might already be practiced in the organization, either because
as part of their product engineering standardised, or it can be part
of their other frameworks like ITIL. Now,
what is like monitoring? So monitoring nowadays with very complex
architectures where we have cloud
infrastructure, vms on that, then there are platforms bit
on top of that. Sometimes it's not even just directly deployed services,
there are other services which are deployed there and it becomes a SaaS
and they are consumed and there is cdns there,
SRE containers, those are auto scaling environments,
there are like DNS. So those are a lot of things.
Like before even a request leaves the browser or
the device of the user and then hits those production
services and then returns back to response. Now at
what level is the monitoring? Is there end user monitoring? Is there infra
monitoring? Is there APM, is there database monitoring?
So there are a lot of things within monitoring. Again,
then there's observability which is the actual data,
which serves a purpose of monitoring. Observability by itself has
three pillars. It has traces, logging and
metrics. Now, one of the difference that I like between
monitoring and uploadability, which is said is monitoring is the microscope
and observability is the slide under the microscope, which gives the
clarity. Then it is about slos. Like do we
have defined slos and does the team actually understand
what slos are? And then at what level are they defined?
Then measured suits and error budgets. Once the slos
are defined, are we actually measuring the suits and
do we have error budgets in place which are also measured?
We'll also talk about error budget policy next and incident response.
So when an incident happens, like how is that getting notified,
how are the terms coming into action, how they are trying
and how they are resolving all of that is incident response
like incident management, how is the communications happening?
Are we informing the stakeholders or which group the users
informed? So all of that. And then how is the severity
and priority determined? So a lot of things under incident
management perspective and postmortems.
So like postmortems probably might be done, but from an SRE's
perspective, it is important to do postmortems in a blameless way.
So it's not a blameful game on who made the mistake,
but it is about how did it happen and how can we avoid it
in the future. Change management, a change cannot
necessarily be a code change. It can be a configuration change,
or it can be any other change. It can be patching upgrades.
It can be anything. Now, how are these changes actually being done,
how are they communicated, how are they being approved, how are they being
validated? So there is so much that goes into change management.
Release management is about how are the releases happening
then how sre the deployments,
sre it blue green, canary, what are that?
And eliminating toil. Toil is basically the
manual, repetitive and work that can be automated away.
And how much of toil exists is the toil being tracked?
Sre those efforts to automate that, and at what level are they being automated?
Capacity planning is about how are we planning for
the infrastructure needs on a normal day?
How is it going to handle in a peak day or a high volume day?
Like if a high volume happens out of unplanned,
how is it going to handle? Do we have elastic environments in
there? So those kind of things go there.
And infrastructure automation, how is the infrastructure being provisioned?
Is it manual again? Is there automation in that?
Aops nowadays with data and machine learning and
all of these modern tools available, like it's
not only about you need not do everything from the scratch, but there are
libraries and frameworks available even in that space that
can be used now with aops, what we can do,
we can do things like autoremediation, we can do things like alert
correlation, it can be other areas as so
then chat hubs. Nowadays it's all about chat tools like
slack terms or any other telegram or
WhatsApp, take any chat application. And are
these tools being used efficiently where the information is being
sent over to the operations or sres through chat?
And can sres actually take some action directly from the chat
window? Then again,
with the modern complex infrastructures, how confident
are we on our own infrastructure and services?
Can we actually handle failures that are unplanned or that are
unknown to us? So that's where has engineering
helps in to simulate some of those scenarios and fault
injections and then explore the weaknesses and fix them.
Security best practices nowadays there sre so many
security incidents that are happening, and it is utmost important
that the customer data or the company
data, or the organization data or the services,
everything is kind of protected, whether it's with ddos
or any other thing or any breaches,
anything there, and regulatory standards. So depending
on the type of the business and the type
of the market, where the business is actually happening and the
type of products and all that, there are regulatory
standards and that need to be followed and how is
the compliance with those regulatory standards. So in this
case, from an SRE perspective, it's more about technical standards,
it's not really about any business related
standards then tools
and platforms. So to do all these practices,
there is a need for having tools and platforms in place,
like for monitoring, we need dashboarding,
we need visualization, we need tools
that actually ship data, then tools
that help in transformation, tools that help in storage.
So there are a lot of tools. Similarly for observability,
there are a lot of tooling that is required combined to achieve
both monitoring and observability. Then there
are also frameworks and libraries like open tracing or open
telemetry that can be used for tracing and
alerting. Like how is the alerting being done? And same
with on call management. Like how are the on call person being
reached? Is it automated or is it manual? So automated
through what? Tool alert correlation. So now there
can be a number of alerts caused by the same underlying problem.
So these alerts are already being correlated,
so that you finally have one single incident out of that particular
set of alerts. For example, a data center hosting
100 vm SQL is not available, then all of those
will start saying like okay, this is not reachable. So things
like if there is a network problem in a particular area,
again like the entire region will have problems. So how are
those being correlated? So runtime platforms
like nowadays it's all about deploying
services as containers or on platforms
like Kubernetes, Openshift or pivotal cloud foundry.
So there are various platforms and then there
are chat applications like slack teams that are actually
used to community as I previously stated,
and then ticketing. So when
an incident happens or when a change actually has to happen. So how
are those tickets being created? Is it again automated
manual? So it's not always possible to automate.
So what extent of automation is already
available and self healing. So in order to do
auto remediation or self healing, there are many tools
now available and some of them need
to be integrated with in house monitoring tools or alerting tools
and to what extent it is being used.
Then CACD tools are required to take care
of releases or source control merges
builds things to SRE artifacts. So there are
a lot of tools available and what tools are being used and how
effectively sre they being used. And again, there are tools required
from a change management perspective, there are tools that are required
from infrastructure provisioning perspective there is backup
and recovery. How often backups are happening and how effective
are there and how soon can a backup be restored
when it is required and again, at what extent is it automated and at what
extent is it manual? Then about patching. Like patches
are always there, like whether they are security patches or OS patches,
any other upgrades, end of life, end of support related.
So there sre a lot of patching or updates or configuration that
will be required at
what extent this is also automated and there
are use cases around natural language understanding like for
example chat applications. Now can an SRE just type
in a command, please restart this XYZ service.
Or it can also be said in a different way,
please reboot or please bounce XYZ service.
The intent is the same. It's all about rebooting that particular
XYZ service. And can the chat application actually
understand that particular command
through NLU and then fault injection?
Fault injection is actually useful for chaos experiments. There are tools
that are available to inject faults at a network level,
tools to inject faults at a VM level, at a platform level
like kubernetes. So it can be done at various levels
depending on the type of infrastructure in an organization.
Again, all these tools and platforms need not be like one tool
because there is no one size fits all. So depending on the
type of infrastructure and services and
businesses that are there, there can be different set of tools
that are actually used and policies SRE
procedures SRE has heavy focus, as I said,
on incident management, change management and
error budget policies. Like what happens if the error budget is exhausted.
Similarly, SRE onboarding procedure like how does a service actually get
onboarded to SRE? So what is the
procedure around that? Now that's
about the visibility of the practices,
tools, policies. Next we will look at metrics.
So after all this thing like what is those value out of SRE
transformation, the first thing to look at is how much toil got eliminated.
Now by eliminating toil we would have saved manual effort. We would have
improved the efficiency. Efficiency cannot derive a
dollar value, but at least like the manual effort can derive
some blue or green dollar value. Then a reduction
in MTTA, like the main time to acknowledge how
soon an incident actually is getting acknowledged before SRE and
after SRE. The faster the acknowledgement the faster would
be the recovery time like bit all start from each stage.
How soon something is detected, how soon something an
incident is actually acknowledged and how soon are we
able two get to an insight of the problem.
The time taken two insight is actually helped by having
right level of observability. Now to two triads.
Like we need to have the sufficient data to find what
exactly is the problem. Then finally is the recovery method.
Like how soon are we able to make a
fix and deploy or do any recovery action. It's not always a
fix and deploy. It might be a restart or rerun of something.
So it can be different things or it sometimes is at
complete rollback as well. So how soon is that recovery
actually happening then meantime between failures,
like when we know a failure has happened, then how soon those failure
has happened again, what are we actually trying to
do to fix known failures and
reduction in postmortem action items? Like with proper
blameless postmortems in place, like postmortem action items are
actually resolved faster. And how soon sls
are actually getting breached? Now we have the best
architecture, but the sls are getting breached. Or we have
the best kind of services but they are getting breached. So what exactly is the
problem? Where is going wrong? We need to look at that and then fix
it. And how soon they are getting exhausted.
The same thing. So that's about metrics.
And then there are benefits as well. Like there is better utilized with
proper capacity planning. We have better utilized and planned infrastructure
and we have improved tech staff experience, be developers or
sres and by toil elimination and effectively
handling incidents, avoiding DPD incidents.
So the productivity obviously goes up and business
launches. When I was part of product development
and business analysis, I was part of a number of business launches
around launching new markets
or launching new products, launching new verticals,
or even sometimes not even a business launch,
it might be a launch of a new regulatory reporting. So there's
so much of nervousness on the last day or on the final
day that it is going to happen and will it all work as expected?
So if we have SRe concepts
and everything is built with a shift left mindset where we
are confident that what we have built is reliable enough, the experience
at business launch improves. And nowadays there are sites
which show downtime messages, or there are sites
which actually show the
improper experience messages that are posted in social media.
So the reputation improves when these kind of
issues are actually reduced. And accountability.
Now when it comes to accountability, how do you actually
structure an SRE team? So do you have a central SRE
team which takes care of everything that is required from
an SRE perspective, or if that
becomes a bottleneck in a very large organization, there is an option to
actually split SRE by function. Like have an
infrastructure SRE who focuses on infrastructure, have a data SRE who
focuses on data side of things. SRE tools team focuses
on building in house tools or bringing in vendor tools,
integrating between them. So it's not only about bringing two
tools, but it's also about integrating them in those right way
and integrating with the internal. How much
ever external things you bring in, there is always that internal factor
that you need to consider and integrate. And then
there's a concept of embedded SRE where there can be a
central SRE which has sres embedded
into the product engineering teams. So they work very closely with
the product engineering teams with a shift left mindset where
everything is built upfront,
the reliability part aspects are built upfront. Then federated
SRE, like in large organizations, when it is difficult
to maintain a central SRE, or even
to maintain something like an embedded SRE, they can also look
at federated model where each vertical,
or maybe each department actually has their own federated SRE
teams, they are doing their own tools which actually suit
their particular vertical department.
But the recommendation would be to maintain
the same set of policies in those standards that are set by central
SRE. The toolings can vary based on technology,
but the whole culture principles and the
policies procedures they need to be standardized.
Now, roles and responsibilities like depending
on the number of SRE terms or how they are split,
how they are structured, in order to make sure that nothing slips
through these teams, in between these teams,
and nothing is left over without a proper owner.
It's important to look at RNDR of various things, like when
a service, like during SRE transformation, when there are existing
services, who is those destination maker to decide
that, okay, these are the services or these are the applications that should
onboard to SRE first and how is the actual onboarding going
to be done? Who is responsible for that and community
about new launches? Like how does SRE actually know
that, okay, there is a new vertical coming in or there are new business
launches. Sometimes it's not always
related to a code release. So I have seen a number of cases where
new business launches or new product related
or new flows are not tied to a release and they are simply tied to
a code flag. A user can either switch that
flag, turn it on from a UI, or there can be some flags
that are enabled from behind the scenes through some configuration
change and everything starts flowing through. So it's
important to make sure that this communication is sent through
properly and conflict resolution.
In larger organizations, there is a possibility of priority conflicts
or any other conflicts between SRE in any other team
or between SRE teams themselves. So it's important to identify
who would be the final authority to help resolve these
conflicts. That's about Arctic
its visibility and accountability. Now there's no framework
which can stand on its own and it needs to be combined with
other concepts and frameworks for successful results.
Now what are the frameworks are useful for sres?
First framework is agile.
Now why is agile framework important for sres?
Now sres actually we looked at, there can be a tools SRE
team that looks at tools now for such team because
it's again product development kind of a work where they can
look at adopting scrum for their development
of tools. Sres with both interrupt
work and engineering work, they can probably look at Kanban mod
where they have their kanban queue where they sre clearing
their tasks. There is extreme programming as well. And for
rapid prototyping they can also use rapid
application development model. Now these frameworks sre
useful depending on again, which is the way the SRE
teams are structured and what framework suits for which type
of SRE team. And we talked about SRE helping
in improving tech staff experience. Now how do
we actually measure that? So recently there is this framework called space that
has been introduced by Microsoft Velocity Lab and
that can be actually used for measuring this.
So go check that out if you're interested.
And concepts of product engineering. So SRE
uses lot of product engineering concepts like architecture,
high availability microservices, micro front
ends, the blue green deployments, canary deployments,
the list goes on. So sres work so closely with the product engineering
teams that they also understand the product engineering concepts
and can help guide the engineering teams where
required. If something is not being followed and
design thinking, again, like if we are looking to introduce some
new practice, new tool, or build
something new, or bring in something very new feature,
very creative, innovative feature or innovative practice into the
organization or innovative way of doing things like design thinking can
be helpful. Design thinking, we can look at it
as when bit comes to bringing something new, then look
at what is the business viability of that, what is the technical
feasibility of that, and what is those human
desirability. Like how much of adoption will be there after that is
actually brought in. Then it's about empathizing
with sets when something needs to be built and then ideating,
prototyping and iterating over. And there's also
this thing called sprint zero or design sprints.
Like if SRes are looking at building some dashboards,
nice management dashboards, especially for the visibility
aspect or the metrics aspect that I mentioned, so they can
actually look at doing some sprint zero or design sprint kind of
a thing, where they can build those initial
prototypes before even getting into the development of that and
chicken and egg problem. Now sres do
talk about building incident knowledge base.
Now what is the chicken and problem is like
do we build something and do producers actually
produce something and then bring the consumers, or do we get the consumers
even before producers completely build out what
they are trying to build? Now this is a problem that can be solved by
different ways. For example, if we
say there is an incident knowledge base and there is not sufficient knowledge base there,
then that option will not be there. So build the right level of knowledge
base before spreading the word further. Similarly,
it can also be about common frameworks and
tools that are built, again that can
be consumed by other product engineering teams before they
are not fully built. If we go for an adoption,
that will not happen. So it's like a very
tricky situation where you need to balance out at what stage
you will actually bring in the users or people
to actually adopt it, then personalities
and skills. So there are various personality types that will
be required for a successful SRE transformation.
So SRE transformation will need evangelists who
can actually go in and talk about SRE and then
say like why SRE will benefit for the organization or for the product,
engineering teams, or even a specific practice
within it. And then there are strategies who can make plans around how
to do this. And then there are specialists who will be technical specialists
or any other specialists who can help in
the individual aspects. So there are skills
and personalities and skills wise, like SRE, as I said, is a pretty broad
role which includes the knowledge from engineering
and operations. It's where by definition SRE is
like what happens to an operation terms when it is done with
a software engineering mindset. So there is a wide range of skill
set required, right from understanding different types of architectures,
infrastructures, testing, CI CD
tools, blue, green and canary deployments. Then has
engineering first testing, monitoring, observability,
autoremediation, capacity, planning, some amount of
machine learning. So the more the SRE knows, the more those
SRE can add value to an organization.
Again, it's not always possible to find someone who knows everything,
but it can also be a balanced act where
few set of sres focus on one area.
It depends on how the organization would like to structure. And then there
can be cross training that can be done and they
can always upskill. And SRE is always about
watching out for what is new coming up in the market and
then getting the organization at that level
and what are the different things to avoid. So one is
about avoiding bandwagon by us
now, use the right tools and right platforms for the
purpose that we are looking at. There's no need to do something
because someone else is doing. And no over engineering.
Sres themselves accept that failures
are normal and we measure failures and keep them under
control at a level that is required.
100% reliability is a wrong larger and that's one of
the principle of SRE. So have the right set
of slos defined, agreed by users, and engineer
the service to the level that needs to meet that or cross that,
and then coexistence of traditional and
SRE policies. Now the
organization might already be using certain policies now when
it sets migrated over to SRE. Now don't keep
them together. Once it's transformation, it's transformation.
So yeah, those are the things for my talk. So any further
questions, please feel free to reach out to me on discord
and thank you,