Conf42 Platform Engineering 2023 - Online

Optimise your Platform Cost

Video size:

Abstract

How to enable a deep rooted Cloud Cost Optimisation culture Where we don’t do quarterly or month cost optimise exercise rather we embed cost efficiency in the each feature we develop for our platform

Summary

  • The talk of for this topic is optimize your cloud cost. In this topic we'll be talking about why cloud cost is important. And at the end, we'll talk about some of the major points related to building the cost culture.
  • Cloud cost is a big part of your operational expenses. Gartner says that worldwide public cloud end user would be spending almost near to $600 billion in 2023. FinOps is a foundation which is based on kind of similar principle to DevOps. 71% are as of now in the crawl mode, 25% in the walk mode and three percent in the run mode.
  • First target always should be the cost transparency. Then only we should think about what is cost optimization. Then think comes about how to analyze and predict as well. And at the end, everything needs a good visualization to see.
  • compute optimization is going to be your first bet. Then you can always do a balancing of your spot instances and on demand instances. Long term strategic wins, which is quite important for building the coast culture.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hey everyone, welcome to Conf 42 platform engineering streaming sessions. The talk of for this topic is optimize your cloud cost. In this topic we'll be talking about why cloud cost is important, how to know your cloud cost, how you could analyze your cloud cost, what are the quick and long term approaches you should take? And at the end, we'll talk about some of the major points related to building the cost culture. So first step, why cloud cost is important I quoted here one of the Gartner reports for April 2023 which says that worldwide public cloud end user would be spending almost near to $600 billion in 2023. You can assess that how important it has become now to measure your cloud cost to see because it is a big part of your operational expenses. Before jumping to Finops maturity assessment, I'll try to make you aware about the FinOps. So FinOps is a foundation which is based on kind of similar principle. You could map with DevOps where developer and operations join hands. So similarly here, the finance analysts and the operational people join hands and see about what could be the best done for the cloud financial management, the best practices about that and the different specifications has been considered. So now on the Finops maturity assessment, that has been again the latest report they have published where they have shared that how the industry is moving as of now on the maturity side. So they said that 71% are as of now in the crawl mode, 25% in the walk mode and 3% in the run mode. So what that means actually. So the crawl mode means actually the team is getting enough acquaintance or education about the cloud coast and they have started now driving enablements around it. The work is where education is now getting spread across multiple teams and they have started adopting the Finops process and practices like forecasting, budget management and all. And the last the run is where cloud cost, education and enablement has been spread across the multiple or teams or the enterprise level. And that has been continuously aligned with the organizational level initiatives, the OKRs and the goals set by the senior leaders or the board members for that specific organization. So moving ahead, the step one is to know your cloud cost. Our first target always should be the cost transparency. And then only we should think about what is cost optimization. And the reason is very simple. You should be able to know what is the baseline where you have to work over the cost transparency, to know that this is what you are spending. This is where your cloud cost billing has come across actually. And then you definitely go and further target it about the optimization and accuracy and all the stuff so I just created the basic flow. What you need to do step by step. So first is like you should identify that each of the cloud resource getting spun up has accurate tag with respect to the application it is deployed for. So domains, applications, everything has to be a metadata and that metadata is being attached by cloud resource, whether it is EC two, s, three, ebs, anything which is getting deployed actually over cloud. The second step is saying yes, once you have identified, you enforce that yes, you should have some guardrails, policies, rules which says that yes, my cloud resources are getting tagged correctly. Now once you have identified the tagged, now you have to identify that yes, what are the different layers or stages in your organization based on development environment, operational environment and what are the maintenance cost as well? Sometimes you need to run in a staging environment equivalent to production environment and sometimes you need to run about that. Few applications might be on need basis where function as a service can be used as well. So on demand and all those things have to be considered pretty well while identifying the cloud cost. And at the end, yes, everything needs a good visualization to see. So once you have all the data, maybe use some good dashboards which could actually visualize that yes, this is the traffic trend. This is the cloud coast trend according to that. And then you would be able to get the clear cut picture of what is your cloud cost, which is the step one. So one of the quote I have mentioned here from Charles Webbage is inadequate data gives more error. So while we always say a less knowledge is much dangerous than non knowledge. So the similar way like error using inadequate data are much less than using no data at all. Step two is analyze and predict your cloud cost. So that means now you know your cloud cost, which is the step one you have done, now go on and analyzing it. So we have to go from now course grained to fine grained. You have to break down with respect to the domains. Your organization have then titled the different units, how it is being spread across, then going from the application microservices, all the applications, the databases and now everything running. And then based on that you could go and further segregate it with the resources. So is someone using EC two? Is someone using ECS eks? I'm just quoting about the AWS services. But yes, based on whatever cloud you have been using, you have to just take care of that. What all resources have been using as of now, object storage, computes, networking resources and everything. Then think comes about how to analyze and predict as well. So I think you have to do a unit economics. And while doing unit economics, you definitely need a focus group which is working for the set of practices. And this is what Finops always suggests, that have a focus group of FinOps member. The people who know finances and the people who know how cloud computing works has to be mingled up and then they can only value the cloud cost. Actually you have to be getting an agreement from the cost attribution model how you want to build your different team. Cloud cost is costing as well, and that has to be clearly in a pane glass view, should be available to your top leadership to consume. Now if I come about data insights as well. So from analysis now we have to predict as well. And then prediction definitely needs that you are doing a regular analysis too. So every time your cloud billing coming up to you should be reconciled with the cloud cost you incurred from the resources you have used. So there should be a reconciliation mechanism, whether it could be an in house tool or an open source tools or enterprise tool, but that reconciliation is definitely needed. You always have to perform the chargeback to the application domain. And what do I mean by chargeback is actually so let's say application a in team a uses this much cost until, unless you tell or make them aware about the cloud cost being used, they won't be able to own that. So that application owner needs to be very much aware that might this application have this much used, the compute, the networking and the calls, and then they would be able to optimise it. Okay, I know that now what I've been doing actually, and that comes to the third point, that you should always be aware about that yes, these are the top idle unutilized resources. And you always be able to take back the recommendation to the different teams as a platform engineer or as a DevOps engineer that yes, or as a FiNOps member that yes, these are the idle resources or these are the underutilized resources. And offender might be the term. But yes, you could always be able to highlight with the data. Now there are always, in every business, whatever organization you might be working on, there are always some seasonalities included which needs to be induced in your traffic pattern. So maybe if you are an ecommerce, maybe festival sale is coming, then you have to induce that yes, during that time, obviously since the resources would get spinned up, then your cloud cost would also get increased. So that seasonality should not be a barrier when you're calculating your cloud cost. When you have all these things sorted out, now you are ready to predict your cost. And obviously you can always use the business intelligence, different models for that and for the baseline I would say let's do it for let's one quarter and then see that how your cloud cost is increasing or how your cloud cost is having a saturation or whatever based on number of resources you have spent and then you will be able to more accurately predict the upcoming quarters and all those things. So the main thing is yes, what the code says that data really power everything what we do. So whatever we are doing in cloud cost side there should be a supported data which would give us all the measurement to do it. Now coming to the step three. So that is about optimizing your cloud cost and how we could do on the short term or long term. So based on experience, I would say that compute optimization is going to be your first bet where you could definitely go and see that. Okay, I am seeing lot of idle resources. Maybe my instances are not rise sized, I'm using a lot of cpu memory but I have requested that is my resources using that or not. I think that can be definitely drawn up with lot of open source tools right now, Grafana with the Kubernetes as well, c advisor and everyone. Every tools give that definite usage based on your attribution. Then you can always do a balancing of your spot instances and on demand instances. I understand that every business has some critical application. So you need to do a good business judgments that where you are putting your sport instances which definitely has a risk of getting shut off. So that's why there should be a good balance of on demand and sport instances now on and off. So maybe your dev environments are not using resources on your weekends or might be your business has some specific festival ops or something like that. Where that is where the cloud costs resources or cloud resources are not that much needed. So maybe you have the capability of doing an on and off of the resources. Similarly how you could immediately vertical scale. So scaling is that capability that yes, you can eventually scale the resources not only horizontally means not only you are adding the new machines, but also based on your you increasing the power of your existing machines, which is the first step we always do. So how you are scaling that in and out is also pretty important. And you should also do a logging storage cost related activities. So usually we do not take that in consideration, but that end up incurring a lot of cost and that's why how much the retention, what is the means? Maybe any incident happened, how much previous logs you need, what is your RCA cycle. So all those things are baked into this infra layer. Now. Long term strategic wins, which is quite important for building the coast culture, which we will discuss in upcoming slides as well. But yes, now you should always think about when you are building some RFC or you are exploring some tools or you are building some microservices that am I doing the right thing with the right resources? Can I switch to some other language? Recently rust is being getting very properly in terms of performance and the resource optimisation. So can my application be built with rust? Then the second step should be whenever you are doing a buy versus build kind of thing as well. So definitely keep cost as a big factor and in your judgment in one of the factors to make the choices. And I won't say that any of has any specific inclination. Buy is something, you are buying it directly, an enterprise edition tool and ingesting that and then you are totally on the third party vendors to manage that and build something. You're building your own which also incurs the cost definitely. And that's why I'm saying that that has definitely inbuilt tendency of measuring the cost. Your DB, your DBS, how you are sharding your DB, which are the cost optimization queries, all those things are to be taken in pretty much consideration while deciding the different database usage as well, which DB suits the best for your use case that has to be taken in very well consideration in based off cost optimisation. How your networking calls are being done are you using lot of net gateway calls? Are you using lot of transit gateway calls? Is your network is specifically isolated? Do you definitely need a multi domain calls as well? So all those networking calls has to be pretty much aligned based on what you are designing at the very high level itself, your API calls. Now if you are doing a lot of API calls, does that draw any cost to any third party tools as well? And what are the cost implications of those? And at the end, yes, your data transfer, you're storing some data. You are uploading data directly on some of the object storage or some of the cold storage. What is the cost of retrieving those data? What is the cost of archiving those data? All those things has to be considered pretty much in depth with your design architecture team once you are building that long term strategy for your cloud cost final step, which is like build the cost culture. So I have shared two approaches. I won't say that there is one best or one not that better maybe. I would definitely recommend that we should do both in parallel based on whatever setup you are in as of now in your organization. So first is top down approach. So top down approach is from the leadership, from the senior leadership that is coming that yes, in any of the organization wide strategic initiatives, cost is being placed as one of the factor, as one of the metrics and being continuously considered. And it is kind of a key performance indicators when you are building your goals for any of the new feature, any of the new development. And this is not, I'm talking about a lot of stuff related to marketing and all those things. We are just as of now focused on cloud cost. So yes, whenever you are deploying a new feature, what all cloud cost it could draw actually. So that has to be baked in, in your organization goals as well. Add cost optimization as definition of requirement criteria. So if you are using an agile format where you are working on Jira or working on any of the task management system, task workflow system, so maybe ingest a culture of that, yes, every time a new task comes or a new initiative gets spawned up, a new strategy is being built. We always have a factor of cost optimization consideration that yes, can this be done in a way where the cost could be optimized? Now on the right side you're seeing the bottom approach which is like from ingenious level. This is going up till up and showing the data that yes, the cloud cost is getting saved. Now here we are talking about eventually there are a lot of ids as of now supporting a lot of visualization and metrics for cloud coast as well. So your code commit, you can use those automation and your code commit can actually make you visualize that yes, this resource or this is what is going to happen when I run this terraform job or run this ansible job and this is what going to create the resources and how that resources is mapping to the cloud cost. Now the other thing could be from the team level itself that yes, you automate your all CI CD pipelines with kind of a retention policy that yes, nobody has to take care of once you have run that spannecker jobs or GitHub action jobs or a Jenkins job and then your prs are getting cleaned up after a certain time whenever you need that. Or maybe a good retention policy is being established that yes, if I'm using in a dev environment, my all cluster labs are automatically getting cleaned up after seven days of time until unless there is no movement happening on that specific machine, we could always monitor and do such kind of thing with lot of automation in place. So make sure you have those alerts, you have those notification coming up and then you are working around it. Draw dashboard for cost and alert based on identified thresholds. This is I think the pretty basic requirement that yes, anytime you are seeing that your cpu and memory is being idle for a certain amount of period and then you go back and shoot an alert that yes, since last 15 days this application has not been using the cpu and memory it has been requested for. So please take a look and re optimise your resources. I'm just quoting an example of 15 days. Maybe your business is too critical then maybe you could take 30 days or maybe your application running on only on dev and test environments, then you could take an approach of more stringent with seven days and so that I will leap up to your best judgment. But yes, there should be some alerts and based on your thresholds that alert not only giving the cpu and memory thresholds alert, but also the cost alert. So once the people saw the data that yes okay, I have requested this much core but I am using this core and this is spending actually this much of cost. So then everyone owning that application would have an open eye and would see that yes, that's where they need to make save, that's where they need to do the resource optimization at their side. And the data usually speaks a lot. So maybe a good analytics needs to be done based on what we are predicting, based on what we are forecasting. And that forecast also gives an alert as well. Once there is a reconciliation and the forecasting get mixed up or get not matched, so then all those alerts can be established. That is all. For this talk, I hope I have been able to provide a good insights on what we could do do and then maybe you could take away that yes, how you could map with your specific organization culture and then start doing the things in the right fashion, in the right manner to save your cloud cost. Cool. Happy learning. Thank you.
...

Vaibhav Chopra

Engineering Leader @ Expedia Group

Vaibhav Chopra's LinkedIn account Vaibhav Chopra's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)