Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hey everyone, welcome to
Conf 42 platform engineering streaming sessions.
The talk of for this topic is optimize your cloud cost.
In this topic we'll be talking about why cloud cost is important,
how to know your cloud cost, how you could analyze your
cloud cost, what are the quick and long term approaches
you should take? And at the end, we'll talk about some of the
major points related to building the cost culture.
So first step, why cloud cost is important I quoted
here one of the Gartner reports
for April 2023 which says that
worldwide public cloud end user would be spending
almost near to $600 billion in 2023.
You can assess that how important it has become now to measure
your cloud cost to see because it is a big part
of your operational expenses. Before jumping to
Finops maturity assessment, I'll try to make
you aware about the FinOps. So FinOps is a foundation which is based
on kind of similar principle. You could map
with DevOps where developer and operations join
hands. So similarly here, the finance analysts and the operational people join
hands and see about what could be the best done
for the cloud financial management, the best practices about that
and the different specifications has been considered.
So now on the Finops maturity assessment,
that has been again the latest report they have published where they
have shared that how the
industry is moving as of now on the maturity side.
So they said that 71% are as of
now in the crawl mode, 25% in the walk
mode and 3% in the run mode.
So what that means actually. So the crawl mode
means actually the team is getting enough
acquaintance or education about the cloud coast and they
have started now driving enablements around it.
The work is where education is now
getting spread across multiple teams and they
have started adopting the Finops process
and practices like forecasting, budget management and all.
And the last the run is where
cloud cost, education and enablement has been spread across
the multiple or teams or the enterprise level.
And that has been continuously aligned with the organizational
level initiatives, the OKRs and the goals set
by the senior leaders or the board members for that specific organization.
So moving ahead,
the step one is to know your cloud cost.
Our first target always should be the cost transparency.
And then only we should think about what is cost optimization. And the
reason is very simple. You should be able to know
what is the baseline where you have to work over the cost
transparency, to know that
this is what you are spending. This is where your
cloud cost billing has come across actually.
And then you definitely go and further target it about the optimization
and accuracy and all the stuff so I just created the
basic flow. What you need to do step by step.
So first is like you should identify that each
of the cloud resource getting spun up has
accurate tag with respect to the application
it is deployed for. So domains,
applications, everything has to be a metadata and
that metadata is being attached by cloud resource, whether it is EC two,
s, three, ebs, anything which is getting
deployed actually over cloud.
The second step is saying yes, once you have
identified, you enforce that yes, you should have some guardrails,
policies, rules which says that yes,
my cloud resources are getting tagged correctly.
Now once you have identified the tagged,
now you have to identify that yes, what are the different layers
or stages in your organization based on
development environment, operational environment and what
are the maintenance cost as well? Sometimes you need to run in
a staging environment equivalent to production environment and sometimes you
need to run about that. Few applications might
be on need basis where function as a service can be used as well.
So on demand and all those things have to be considered
pretty well while identifying the cloud cost.
And at the end, yes, everything needs a good visualization
to see. So once you have all the data, maybe use
some good dashboards which could actually
visualize that yes, this is the traffic trend. This is the cloud coast
trend according to that. And then you would be able to
get the clear cut picture of what is your cloud cost,
which is the step one. So one of the quote
I have mentioned here from Charles Webbage is inadequate
data gives more error.
So while we always say a less
knowledge is much dangerous than non knowledge.
So the similar way like error using inadequate data are
much less than using no data at all.
Step two is analyze and predict your cloud cost.
So that means now you know your
cloud cost, which is the step one you have done, now go on
and analyzing it. So we have to go from now course grained to fine
grained. You have to break down with respect to the domains.
Your organization have then titled the different units,
how it is being spread across, then going from the application microservices,
all the applications, the databases and now everything running.
And then based on that you could go and further
segregate it with the resources. So is someone using
EC two? Is someone using ECS eks?
I'm just quoting about the AWS services. But yes,
based on whatever cloud you have been using, you have to just take
care of that. What all resources have been using as of now, object storage,
computes, networking resources and everything.
Then think comes about how to
analyze and predict as well. So I think you have
to do a unit economics. And while doing
unit economics, you definitely need a focus group which
is working for the set of practices. And this is what
Finops always suggests, that have a focus group of FinOps member.
The people who know finances and the people who know how cloud
computing works has to be mingled
up and then they can only value the
cloud cost. Actually you have to be getting an
agreement from the cost attribution model how you want to build your different
team. Cloud cost is costing as well,
and that has to be clearly in
a pane glass view, should be available to your top leadership
to consume. Now if I come about
data insights as well. So from analysis
now we have to predict as well. And then prediction definitely needs
that you are doing a regular analysis too.
So every time your cloud billing coming
up to you should be reconciled with
the cloud cost you incurred from the resources you have used. So there
should be a reconciliation mechanism, whether it could be an in
house tool or an open source tools or enterprise tool, but that
reconciliation is definitely needed.
You always have to perform the chargeback to the application domain.
And what do I mean by chargeback is actually so
let's say application a in team a uses
this much cost until, unless you
tell or make them aware about the cloud
cost being used, they won't be able to own that.
So that application owner needs to be very much aware that
might this application have this much used, the compute, the networking and
the calls, and then they would be able to optimise it.
Okay, I know that now what I've been
doing actually, and that comes to the third point, that you
should always be aware about that yes, these are the top idle
unutilized resources. And you always be able to
take back the recommendation to the different teams as
a platform engineer or as a DevOps engineer that yes, or as a
FiNOps member that yes, these are the idle resources or these
are the underutilized resources. And offender
might be the term. But yes, you could always be
able to highlight with the data. Now there are
always, in every business, whatever organization you might be
working on, there are always some seasonalities included which
needs to be induced in your traffic pattern.
So maybe if you are an ecommerce, maybe festival
sale is coming, then you have to induce that yes, during that
time, obviously since the resources would get spinned up, then your
cloud cost would also get increased. So that seasonality
should not be a barrier
when you're calculating your cloud cost.
When you have all these things sorted out, now you are
ready to predict your cost. And obviously you can always
use the business intelligence, different models
for that and for the baseline I would say
let's do it for let's one quarter and then see that how your cloud
cost is increasing or how your cloud cost is having a saturation
or whatever based on number of resources you have spent and then you
will be able to more accurately predict the
upcoming quarters and all those things.
So the main thing is yes,
what the code says that data really power everything what we do.
So whatever we are doing in cloud cost side there should
be a supported data which would give us all the
measurement to do it.
Now coming to the step three. So that is about optimizing
your cloud cost and how we could do on the short term or long
term. So based on experience, I would say
that compute optimization is going
to be your first bet where you
could definitely go and see that. Okay, I am seeing lot of
idle resources. Maybe my instances are not rise
sized, I'm using a lot of cpu memory but I
have requested that is my resources
using that or not. I think that can be definitely drawn up with lot of
open source tools right now, Grafana with the
Kubernetes as well, c advisor and everyone. Every tools
give that definite usage based on
your attribution. Then you can always
do a balancing of your spot instances and on demand instances.
I understand that every business has some critical application. So you need to do a
good business judgments that where you are putting your sport instances
which definitely has a risk of getting
shut off. So that's why there should be a good balance
of on demand and sport instances now on and
off. So maybe your dev environments are not using resources
on your weekends or might
be your business has
some specific festival ops or something like that.
Where that is where the cloud
costs resources or cloud resources are not that much needed.
So maybe you have the capability of doing an on and
off of the resources. Similarly how you could
immediately vertical scale. So scaling is that capability that
yes, you can eventually scale the
resources not only horizontally
means not only you are adding the new machines, but also based on
your you increasing the power of your existing machines, which is the first step we
always do. So how you
are scaling that in and out is also pretty important.
And you should also do a logging storage cost
related activities. So usually we do not take that
in consideration, but that end up incurring
a lot of cost and that's why how much the retention, what is the
means? Maybe any incident happened, how much previous
logs you need, what is your RCA cycle. So all those things are
baked into this infra layer. Now.
Long term strategic wins, which is quite important for building the coast culture,
which we will discuss in upcoming slides as well. But yes,
now you should always think about when you are building some RFC or you
are exploring some tools or you are building some microservices
that am I doing the right thing with the right resources?
Can I switch to some other language?
Recently rust is being getting very properly in terms of
performance and the resource optimisation. So can my application be
built with rust? Then the second step
should be whenever you are doing a buy versus build kind of thing as well.
So definitely keep cost as a big factor
and in your judgment in one of the factors to make
the choices. And I
won't say that any of has
any specific inclination. Buy is
something, you are buying it directly, an enterprise edition tool and
ingesting that and then you are totally on the
third party vendors to manage that and build something.
You're building your own which also incurs the cost definitely. And that's
why I'm saying that that has definitely inbuilt
tendency of measuring the cost.
Your DB, your DBS, how you are sharding your DB, which are
the cost optimization queries, all those things are to be taken in pretty
much consideration while deciding the different database
usage as well, which DB suits the best for your use case
that has to be taken in very well consideration in based off cost optimisation.
How your networking calls are being done are you using lot of
net gateway calls? Are you using lot of transit gateway calls?
Is your network is specifically isolated? Do you definitely need a
multi domain calls as well?
So all those networking calls has to be pretty much aligned based on what you
are designing at the very high level itself, your API
calls. Now if you are doing a lot of API calls,
does that draw any cost to any third party tools as well?
And what are the cost implications of those?
And at the end, yes, your data transfer, you're storing
some data. You are uploading data directly on some
of the object storage or some of the cold storage.
What is the cost of retrieving those data? What is the
cost of archiving those data? All those things has to be
considered pretty much in depth with your design architecture
team once you are building that long term strategy
for your cloud cost final
step, which is like build the cost culture.
So I have shared two approaches.
I won't say that there is one best or one not
that better maybe. I would
definitely recommend that we should do both in parallel
based on whatever setup you are in as of now in your organization.
So first is top down approach. So top down approach is
from the leadership, from the senior leadership that is coming that yes,
in any of the organization wide strategic initiatives,
cost is being placed as one of the factor, as one of the
metrics and being continuously considered.
And it is kind of
a key performance indicators when you are building your goals for any of the
new feature, any of the new development.
And this is not, I'm talking about a lot of stuff
related to marketing and all those things. We are just as of now focused on
cloud cost. So yes, whenever you are deploying a new feature, what all
cloud cost it could draw actually. So that has to be
baked in, in your organization goals as well.
Add cost optimization as definition of requirement criteria. So if you are
using an agile format where you are working on Jira or working on
any of the task management system, task workflow system,
so maybe ingest a culture
of that, yes, every time a new
task comes or a new initiative gets spawned up, a new strategy is being built.
We always have a factor of cost optimization consideration that
yes, can this be done in a way where the
cost could be optimized? Now on
the right side you're seeing the bottom approach which is like from
ingenious level. This is going up till up and showing
the data that yes, the cloud cost is getting saved.
Now here we are talking about eventually
there are a lot of ids as of now supporting a lot of visualization
and metrics for cloud coast as well. So your code commit,
you can use those automation and your code commit can actually make
you visualize that yes, this resource
or this is what is going to happen when I run this terraform job or
run this ansible job and this is what going to create the
resources and how that resources is mapping to the cloud cost.
Now the other thing could be from
the team level itself that yes, you automate your all
CI CD pipelines with kind of a retention
policy that yes, nobody has to take care of once
you have run that spannecker jobs or GitHub action jobs
or a Jenkins job and then your prs are getting cleaned up after
a certain time whenever you need that. Or maybe a good retention policy
is being established that yes, if I'm using
in a dev environment, my all cluster labs are automatically
getting cleaned up after seven days of time until unless there is no movement
happening on that specific machine, we could always monitor
and do such kind of thing with lot of automation in place.
So make sure you have those alerts, you have those notification coming
up and then you are working around it.
Draw dashboard for cost and alert based on identified thresholds.
This is I think the pretty basic requirement that yes,
anytime you are seeing that your cpu and memory is being idle
for a certain amount of period and then you go back and shoot
an alert that yes, since last 15
days this application has not been using the cpu and memory
it has been requested for. So please take a look and re optimise
your resources. I'm just quoting an example of 15 days. Maybe your business
is too critical then maybe you could take 30 days or maybe your
application running on only on dev and test environments,
then you could take an approach of more stringent with seven days and so
that I will leap up to your best judgment.
But yes, there should be some alerts and based on your thresholds
that alert not only giving the cpu and memory
thresholds alert, but also the cost alert.
So once the people saw the data that yes okay,
I have requested this much core but I am using this core and this is
spending actually this much of cost. So then everyone
owning that application would have an open eye and would see that
yes, that's where they need to make save, that's where
they need to do the resource optimization at their
side. And the data
usually speaks a lot. So maybe a good analytics needs
to be done based on what we are predicting, based on what we are forecasting.
And that forecast also gives an alert as well. Once there
is a reconciliation and the forecasting get mixed
up or get not matched, so then all those alerts can
be established. That is all. For this talk,
I hope I have been able to provide a good insights
on what we could do do and then maybe you
could take away that yes, how you could map with your
specific organization culture and then start
doing the things in the right fashion, in the right manner to save your cloud
cost. Cool. Happy learning. Thank you.