Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, this is Fabio and I will talk with you
about why SRE is the best way to improve
efficiency in crisis time like we are living today.
It really reminds me about that song of Foo fighters,
times like this and we are all pretty sure
that these times are not being as easy as we would like to.
So we will deep dive in this
theme. We will talk about history, what's going on
on these financial markets, and why
companies should invest their efforts in
improving SRE practices inside their companies and with
their teams. Right? A little bit
of history and things
that happened in the last decades.
We were living a huge age of transformation,
as we call the age of digital transformation. Everybody called
about it for a long, long time in many
conferences, articles,
and a lot of companies had a
lot of money because of this, because every
company in the world was running a
digital transformation. And I
put some things here that I believe that are the most
important regarding some things that
happen in the world, some technology advances
and things like that. So I would like to start
talking about virtualization as
an old guy like me, at the beginning, every project we
had, we need a server to
work in the project. So we had to
buy the server, install the server and
then start working. And these
during this period we have the virtualization that
is the mother of the cloud computing. That's a
huge event that we still see today that
changed everything around these world.
We have a completely move
in the way we did
the project management. We moved from the waterfall
to an agile methodology where
we are supposed to implement software as soon as
we can, every time as possible.
As soon as I sprint was done, we should
deploy software to our customers. So because
of that we have the rise of
DevOps. That was the end of that.
Silos inside the IT departments between
operations and development. So after
that and after the
launching of the iPhone, we saw a
huge adoption of mobile technologies.
And after that, a few years later,
we are coming closer to our current time.
We have the Covid-19 pandemics
which accelerated a lot
all of the digital transformation around the world.
At the beginning we have some companies firing people,
but sooner, just a few
months later, the market was really
hot. A lot of companies working to get digital
in order to guarantee that everybody could still
work and consuming inside their
homes during the period of isolation
and things like that. So we have a lot of opportunities
during these pandemics and during this period
we have something that's really strange that was called
quiet bursting where everybody was,
I'm not so happy were today and I will
just do what I need to do to keep my job.
And that was called quiet quitting. It was something
that was a lot of companies were concerned
about it. Another situation that we
had was regarding developers that
was having many simultaneous jobs,
developers working in two these four companies at the same
time because the market was so hot, there were
so many opportunities and everybody was trying to get their money
from that. So we have this situation with
overworking, as I like to say. And regarding
technology, in the end of the
last few years, everybody was talking about
Web 30, metaverse and
NFT. And now what we are
talking about, what changed a
lot during these last few months.
We are leaving now what I like to call the age
of digital eficiency. We are not concerned
anymore on just being digital.
We are concerned about being eficiency efficient
in the digital way. Why is that happening?
After the pandemics, we saw a bubble
exploding around the world. In the economic scenario,
we have a word in a recession in
many countries, you have an inflation that
we didn't sre for many years. A lot
of companies are cutting their budget.
Everybody that was working to
manage their budgets for 2023,
they are reviewing these results, how much they will invest,
these would invest in this year and
then start the wave of layoffs
as I will explore a little more in the next few
slides. And we saw a new technology emerging
that it was not on our rear
view mirror. Nobody was expecting that we should
have this change around the technology.
But the Chat GPT is a
revolution that we are seeing in the last few months and that's changing
everything. Everybody's feeling pressured to be
more effective and chatt could be also
can opportunity moving forward.
We have here delay.
These is panic in the world.
Panic, panic, panic around the world.
Because what was a hot market is now
a market where we don't have a lot of opportunities.
So just a few examples.
Amazon during the last few months had
eliminated around 27,000
positions, right?
27,000 employees was fired from
Amazon. Another big tech here,
Microsoft laid off
10,000 employees.
Salesforce,
they reduced 10% of their workforce.
Meta Facebook,
they have eliminated around 13%
of their employees during these last year
and they expect to do more
during this one. Accenture,
a huge consulting company,
they laid off 19,000
people. So what does it
bring to us? Everybody's losing their job,
many people, not everybody for sure, because we have
billion people in the world, but we have a lot of people
losing these jobs. And that's something that have
put a lot of pressure inside companies and
on thinking what should we do to guarantee
that we won't have to fire anyone? Nobody likes to fire
people. I'm an executive. I had to fire some
people in the past, but that's not a
pleasant situation. So nobody wants
to be in that position. And the object
of my lecture were, is how we can use
SRE to prevent these kind of situation.
So let's see in the next few slides.
As I told before, 2023 is the
year of efficiency. Everybody's trying to save
money, and I like to sre something with you guys.
I was working in a huge company a few
months ago, and I was responsible for
doing some savings strategy.
But what was the expectation of my company to
do to generate savings?
Adopting power platform and dashboards.
How would I was supposed to improve the
developers eficiency? Just doing that,
it's not possible. And that's the reason why
I'm not working there anymore, because that's something that
I strongly don't believe. I believe that you should do something
more structural. And one of my hypotheses
is investing in SRE to do so.
And let's see what I can bring to you in order to
do that.
But in the end, what all these
2003 crisis brings to us, what does
it mean in the end? Obviously investment,
but no investment. People and
companies would only invest what they can.
And that's something that you should invest,
because nobody wants to get the risk of losing money
and soon had to fire anyone.
Everybody's searching for saving on operations,
okay, I will not grow. And how can I save money
inside my operations? How can I
automate my operations and generate some savings?
How can I better use my team
to work better, generate more,
right? And as we
all know, some companies are staking this
moment to eliminate some people that was not performing
well, people that they knew that were working
in many companies at the same time, things like that,
companies are taking this moment,
I will do some kind of
diet, were with my employees. I have somebody that
is not performing well, or maybe they are working here
in two more consulting firms. That's not
something that should be acceptable, but people were doing that.
So companies are taking these opportunities too.
And what is on the table for us,
what we as technologists should
be thinking of? We should thinking
about eficiency and
adopting AI. Everybody's talking
about Chat GPT, but at this very moment,
I cannot say that very easy for us to
adopt Chat GPT in order to
change the game and save people.
But what I can say is that if we
can invest our efforts on eficiency and
adopting Sre the right way, I strongly believe
that we can do a good job and some
positions right. And we will explore more.
And why should companies invest on SRE?
First of all, first of all,
SRE will improve system reliability.
If your system is more reliable, probably you won't
have many incidents. And then we go to the second
point, faster incident resolution.
If you solve your incidents in a faster way,
your products will be available for your customers for more
time. You won't lose any
selling things like that. So one thing
turns to the other. I have a
more reliable system, so I will resolve my
incidents faster. I will increase
my agility because as we
all know here, SRE and DevOps
are kind of brothers, twin brothers that
share some similar missions. So our agility
will increase significantly.
So that would be something really important for companies.
SRE would increase the collaboration inside
the company. When operation
teams and development teams work together,
they seem to collaborate more, they create
more confidence with each other and so they will be able
to deliver best software,
improve all these environment
inside the IT department. And I
would believe that it would really diminish
these pressure under these teams.
Everybody in technology in
these last months are feeling very pressured,
afraid of losing their jobs. And when
we work in this collaboration environment,
it would bring some peace for these people and they will see
the results of their efforts, right? That would be
really great for your company.
And of course, if we have all of this,
all of our customers will be more satisfied.
The systems that we use will be available for more
time. We won't have incidents, things like that.
So customer satisfaction shouldn't increase.
And that would probably help us with our nps
and other indicators that we might have in our
company. Okay,
continuing. Why should companies invest?
Continuing. Reduce downtime costs.
Every time our application is down, we are
losing money with SRE. We won't do
that. We don't lose that money. We will increase our
efficiency. It's the theme of
our lecture. So when you implement
it, when you are in a mature way, working with
SRE, you will increase your whole
eficiency. All your team will work better. You will
deliver more software.
If you don't have to spend so many time fixing
bugs or working in incidents, there's a huge probability
that you can spend more time developing
new features and bring more business to
your company. You will have a better resource allocation
because people won't be investing time
fixing bugs, but developing new features
or improving your environment.
Everything, it's a virtual
cycle. You will improve your scalability.
When you have a moment where you have a
lot of access in your platforms,
your scalability will be doing very well, because your
team will be not invest a
lot of time improving this scalability.
And for sure you will reduce
your maintenance cost, because you won't have some bugs, you won't have problems
in your infrastructure, you will scale faster,
and then all your maintenance costs should be
reduced. And following here,
this is really interesting, because we always
think about SRE, considering that we
would improve our internal environment
and the services that we would provide to
our customers. But what about consulting
firms? Why should they invest in SRE?
They are under a huge pressure,
I believe that higher pressure than the other companies,
because when we have any
cries, the first thing companies
cut is these investment on technology.
And that's
the main reason we have consulting firms. So a lot
of them sre under a huge pressure
of customers, of clients canceling
their projects. And why should they invest
on SRE in their development teams and their practices.
They should adopt SRE because it will differentiate
them. And that's what will be on the table for the
consulting firms. Because when we have
now chat, GPT and everybody under crises time,
main factor to decide between
a company or other is paying
these same money, I would have more code,
more application. What will
be better for me investing in this company instead of
the other? And when a company consulting
firm that adopt SRE,
in their practices, in their culture, probably will
deliver more software, not only
deliver more, but deliver in better way,
with better code, low bugs
and thinking about the scalability,
the software that you will deliver,
will reduce the meantime to repair, will improve
these security of the applications, and will
be more reliable on the perspective of
the application of the architecture, right? So we
should invest everyone in SRE.
During this period.
When we talk about SRE, we always have a lot of
qualitative results. We called
about many of them in the last few
slides, but I'd like to bring some quantitative
results for you. For example, Google is the
father, the creator of the SRE. And Google
has reported that SRE has helped them reduce
its incident rate by 50% it's
too much. And improved these
reliability in 99,
95% too
much LinkedIn, after adopting SRE,
they reduced its incident rate by 85%.
A lot of things. And these company also reported
that it was able to improve these MTTR
by 75% too
much. Netflix, everybody likes Netflix,
right? Another company that's under a huge pressure,
not only because of everything that's going on the market,
but because of the signature crises
time, because everybody was sharing their signature,
but Netflix adopt SRE
since 2010 and they reported that it has helped them
achieve an availability of 99 99%
too much pain. It's too much. And Netflix
also reports that it has reduced heat downtime
by 9% 90% after
adopting SRE.
And last but not least, Dropbox.
They reduced these outage by
90% after adopting SRE
and they reduce the number of incidents by
75%.
When we are under attack in a
moment where everybody needs to
cut in the bone,
this kind of reducing would help
a company to save a lot of jobs,
right? So let's continue were and
here I would like to share some data from the market
related to state of SRE and
why we have a huge opportunity.
Take a look of how is the adoption of
these SRE. We have 6%
of the market that's totally immature in these
practice,
32% that's emerging,
only emerging. We have 42.
That's maturing is too much.
If we sum all of these, we have 80%
of the market that's not
adopting completely the
SRE.
Can you see the amount of
this opportunity that we have here? 80% of the
market has the opportunity to save money
to improve their operations only by adopting SRE
practices combined to DevOps and Et cetera.
It's a lot of being. Let's see some
other important information regarding the sres,
what they are dedicating most of the time
doing and how this relates to
efficiency. Almost 70%
reducing the MTTR it's
a lot of things. It's a lot of things when you're spending
all this time and it will bring the result.
67 reducing MTTR 60
building and maintaining automation code
automation would generate a lot of savings
and eficiency and time
free to spend and more important activities for
your team and your company. Ensuring security
vulnerabilities are detected and eliminated quickly.
Security is a huge problem for tech companies.
So you have this SRE team
spending more than 50% of your time.
Design experiments, running tests to reduce risk of production
failure. Nobody wants to have failure
in production. And you can see here all of this information
about how you can use your team
in a better way. What are the
expecting expectations and demands
on SRE, what these want
to achieve? And which
of the following tasks do SRE in your organizations dedicate
the largest amount of their time on an average week?
The same as we saw before. Reducing MTTR,
things like that. Building automating code.
It's really good, it's investment.
You're investing a lot of time were a lot of effort
for in the future you can save
a lot of time and your
operation of incidents and outages.
How does your organization evaluate service level for its
applications and infrastructure?
That's something that I would like to indicate here.
A lot of companies works with okrs
and key performance indicators, but these
heart of SRE is the
slO. So 75% of the companies that
answer these state of SRE, they work
with slos. And why is that so
important? Because slos
would really, as we know everybody knows here
in this conference, SLO is
the key indicator for us to work with SRE.
And we have a lot of difficulties working with
this. So investing
a lot of time in SRE would request
invest a lot of time defining slos
the right way. We have a huge challenge
here in defining and to getting this
information, because we have too much information,
too much data inside our company and we have
to clean this data.
Like all data science that we know,
we have to work a lot to get this
data really good in order to
be effective managing our SLOs
and the SLOs are the core of the
success of the SRE,
because with the SLO we will
have our error budget and the error
budget is what will help
us provide a
safer environment where we can have some
experiments and where we can even get
wrong sometimes. So the difficulties
that we have creating and defining slos,
we have too much data, too much
data sources, too many metrics,
monitoring tools that don't allow to easily
get that slo. So you have to
invest a lot of time and effort defining and really
getting your slos.
Continuing here. Some are really good slos
to implement in order to be
successful in your SRE. And here we
have some slos for
the business point of view, even the mobile ones
for the business and end user centric,
right? That's another buzzword,
always availability. We have to measure
the engagement, we have to measure the
user satisfaction, the conversion of our platform,
how is that going? And of course we have our performance
slos. How is our utilization,
response time, traffic,
saturation, success rate?
Every one of these slos are really important for us
on the technical point of view and
for mobile applications. Everyone is mobile
today, right? So, app adoption,
availability of the app, response time,
you should provide a good experience and SRE will help
you a lot. Improving this response time,
success rate,
crashes and of course app rating.
Everybody wants to know, how is your company evaluation
on the App Store and how
would you identify how the companies identify the
targets for each of your slos?
26% do that based on end user experience,
24 based on historical data and industry
standards and 20% on our system
on however our system is doing today.
And who in the company helps
in defining these slos.
The SRE team is responsible for 80 80%
security, which is really important and
they really contribute a lot for these SRE
adoption and for our success. 49% from
the business 47,
infrastructure 45,
DevOps 41, operations platform
36, development 33 and Application
32 were we
have some other opportunities here. That's not the
theme of this lecture,
but as soon as you evolve adopting
DevOps and SRE, you should consider evolving
to an AIOPs environment using
everything that you can to automate the response for
everything that happens in your operation and
the platforms that you might use would help you provide
that. And I strongly believe that with
all the advances that we have with
chat DPT, that would be a reality
even for I would believe that chatDPT and
all these new generative AI would
increase the adoption of AI ops and last but not
least, finops.
We have a lot of expenditure unnecessarily
expenditure with cloud costs.
So we should work in order to keep this
on the right way, deploying exactly what
needed for each application and for each environment. So we
should work to keep the Finops working
really well and SRE and DevOps.
Right. We are coming to the end of this lecture and
I would like to share with you some key
takeaways of this lecture that I really like you guys to
save in our mind for this year and probably
the next one. That should be harder years for us
working with technology. Right? We have
SRe increased the commitment and morale of the team.
We are in a moment where people SRE feeling
are afraid. They are not feeling confident, they are
afraid of losing their jobs. SRe increase the commitment
of their working together and improve their morale.
That will be great for your company. Besides the obvious
qualitative benefits, as were seen,
SRE generates quantitative results.
We believe and we should insist inside
our companies that we have quantitative results
that are feasible and that they generate
economic and financial returns to our company.
We will increase our agility as a company.
We will be faster and then we should invest
in SRE. Even for consulting
firms, it will be a huge differentiation.
You can sell SRE projects and you
can adopt SRE practices in your development.
So your team will be really differentiate and
the end define and work for achieving
slos should be the main objective in the adoption.
You should work well in defining the slos.
That will be the most important thing in these
SRE adoption. Right? The SLO
will provide you the error budget and everything
that you can do will be based on that.
So that's it, folks.
That's all, folks. For this. Again,
my name is Fabio. Here you have my care code for my
LinkedIn. It was a huge
pleasure to be here. I'm really thankful opportunity
of sharing my knowledge and my experience with you
guys. And let's embrace SRE and
make this world better. Thank you and have a nice
year for every one of you.