Beyond Connectivity: Full Life-Cycle Infrastructure Orchestration and Cost Engineering

Video size:

Abstract

Join us to learn about our exciting shift from black-box infrastructure to self-managed empowerment for developers. Our IDP streamlines decentralization with Infrastructure as Code and a user-friendly UI, ensuring secure, scalable support for over 10 million connected appliances worldwide.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. Super happy to join COM32. This is Lone from Electrolux Group in Sweden. Today, my colleague Vladimir and I are going to share our work in the field of platform engineering. Including infrastructure, orchestration and cost engineering. So today's talk will be divided into four parts. First, I will briefly introduce the context of the whole work. Secondly, Vladimir will share more about infrastructure orchestration with the help of our internal developer platform. Then I will talk more about cost engineering and also our open source projects. Let's start from the context. At Electrolux, there are many categories of appliances. In our organization, we mainly work on the home appliances, including three categories. Taste, an oven, for example, care, a washing machine, and well being, an air purifier, for example. These are connected appliances. Imagine this lady bought our oven, made it connected to the internet via our mobile application. She places some food in it, and the camera recognizes the food. And the suggests the cooking mode, temperature and duration. Then the lady can interact with the oven using voice assistant in the mobile application and the sets up everything with several clicks. Finally, the oven cooks the food and keeps sending updates to the mobile application. To support this magical experience. There are many different developer teams, and from our side, we work as a platform engineering team, and we want to empower our developers to focus on innovation and reduce the complexity. That's why we build our own internal developer platform. Which is the core of our developer experience. In this way, we provide nice support by adding features to the platform instead of repeating the work. Our platform today covers different domains, like infrastructure, provisioning, service management, etc. And we keep adding new domains. One new domain that we added recently is cloud cost, which I will share more during this talk. However, I have to say that IDP is not the silver bullet to address all the complex scenarios. We will share the whole journey with you in the next 20 minutes. So let's take a further look at our techniques deck, which is very diverse. Many different coding languages are used by our developer teams. We also have a set of CICD tools like CircleCI, Jenkins, AgileCD, etc. As appliances are connected from different places all over the world, we need to run various of services globally on different cloud providers like AWS, Azure, and Google Cloud. However, at the same time, this diverse tech stack also brings us extra challenges. So feel free to leave a comment if your developers or you have faced or have been facing some issues from the left column. For example, too many different tools, multi cloud setup, etc. Or drop another comment if you work as an SRE and are facing some challenges in the right column, like too many direct messages, not enough time for development, and so on. If you have challenges in both columns, we feel you very much because we were in a similar situation. Now I would love to invite Vladimir to share the journey of addressing these issues with the help of IDP. Vladimir, welcome to the stage. I can explain how it was in the past. Hello everyone, my name is Vladimir, I am a senior SRE at platform engineering team at Electrolux. And about three years ago, SRE team handled bunch of support requests daily. Such routine requests as access control, infrastructure issues, getting data from prod databases, and etc. It feels every day that you are the most important person in the world. If you work in this direction, developers are blocked by us and waiting for our response for a long time. We decided to automate as much as possible. We started with the Python script that did automation for a Terraform code. So at least we could do this thing faster, but it didn't solve the challenge. We did brainstorming what we should provide to developers, so they don't need to ask us for the support and manage infrastructure by themselves. When I joined the organization, there were around 20 developers. Now there are more than 300, and our organization is still growing. Unfortunately, the size of our platform team is not growing that much. So we decided to provide a platform instead. Our first solution was a self made platform. The next, as many in the industry, we moved to backstage. When it was released in the beginning, we used only UI from backstage, but later we took back in part two, most of the site features as authentication catalog, to avoid extra work from our side, developer journey starts from getting all necessary access for work. We launched a user onboarding plugin for new joiners to replace the manual steps we did before. For now, engineering leads can grant every access that is required and never reach out to us. From this slide, you can see different vendors that are covered by this plugin. After they got all the necessary accesses, it comes to infrastructure. You wouldn't expect developers to be able to create all resources, even if they have access to everything. Somehow, we need to control it and it's complicated task for them to do it without knowing the infrastructure and best practices of setting it up. So we looked into two types of use cases, developer usually come to us for the support. They want to create a resource like AWS Kafka, radius, and et cetera. And second one is, create some new resources that are not supported by IDP. For example, for some POC, because they need to investigate new technologies to use it in IOT. We created for them UI, where they can define and provision their infrastructure. Here is an example. Our backend runs in Kubernetes. So very frequently dev team need to provision or destroy new EKS clusters. Developers can go to IDP, select EKS template and IDP will provision different resources for them. For example, it creates VPC and EKS cluster. Then it installs autoscaler, ingress controller, datadoc agent, and our own cost exporter. Developers don't need to know how to configure the cluster to publish logs or metrics into our observability platform. Cause data will be displayed in IDP automatically as well. Sometimes developers need something that is not supported by IDP. They create their own Terraform modules for provisioning. The resources, for example, we don't use DNS configuration so frequent, and there is no need to develop IDP, IDP component for that. Any Terraform code still go via IDP and we keep managing and tracking everything in one place. We already have a bunch of different infrastructure templates that defined by our team. Developers teams can use the template to provision the infrastructure they need in the cloud. Thank you. How it works? it has a few layers. First one is a backend API, that's serving our developers and providing interface for them. Second one is a Terraform modules that are ready for production usage. And, third one is a cloud APIs we use for getting data from cloud, like metadata, or even using for creating resources there. We have a versioning of the templates, and if we make some improvements with a new version, the running infrastructure will be updated automatically. Another is very important part of the platform is a service. Developers are deploying it every day and they need some tool to see the current status of it. We added services to the platform as well. Now they can use different features like CICD status, secret lists, images in Docker registry and manage services in Kubernetes by restarting, killing the posts and et cetera. And everything is available in one page, no need to have a direct access to clusters. If they need to roll out change or some specific version of their service, we have manual deployment feature for that by using IDP. You are getting not only provisioned infrastructure, but also getting more information of how your infra works and, for developer it was a big change. Once one backend developer shared with me, that before infrastructure was a, like a black box. For him, something abstractive, scary, and hard to understand. Starting using IDP, he learned much more simply. Because we visualized it, ensure that it's, tested and safe for developers to use. And he was able to start doing things on his own. Internally and externally, because of, the public talks like that one, we started to receive feedbacks that it would be great to have same solution in their companies, since everyone facing the same challenges. Meanwhile, we were not already that happy with the support we can provide because we built IDP very customized. And, it supported only AWS cloud, while people in our organization started to request the Google and Azure clouds. We did another brainstorming and ended up with building the open source version that is going to soon replace our current IDP. How it looks like you can see from the video. The first version of our open source project is going to be focused on infrastructure management. How I showed today and, We will add much more features for services, cost, and et cetera, a bit later on. In summary, we have automated a lot of daily routine tasks and got the time to focus more on developing, new features that will be very helpful for everyone. In the next part, my friend Lonk will explain how we are going to open source it. Lonk, the stage is yours. Okay. Thank you, Vladimir, for the nice presentation. Next, I would like to share more about cost engineering and our open source projects. Before going into the details, let's take a look at an example first. Or maybe a question to you. Does your infrastructure or service burn money? In our case, Electrolux sells appliances to consumers with a fixed price, as shown in the slide. After buying the appliance, the consumer can onboard it to cloud and interact with the appliance using our mobile application. This means both the mobile application And the appliance communicates with our cloud infrastructure, which constantly cost some money for consumers, they only pay for the fixed price of an appliance. However, for Electrolux is crucial to keep track of the cloud cost. In order to better control the cloud cost, we start from collecting the needs from different teams. From the management team, they usually ask the following questions, like what is the total cost of Project X? What is the cost chain? How to optimize cloud cost? These questions are usually higher level questions and require cost aggregation from multiple clouds. Then, from developers, they care more about the cost of their own services. Thank you for your time, For example, what is the correlation between the service usage and cost? And there's also the same kinds of needs in your organization? In many cases, it's true at our organization. However, we also received different feedback from developers. And their answer is, they just don't care about cloud cost. I think I pretty understand this because developers want to focus more on writing code and delivering features. The problem here is that when there is a spike in cloud cost, the management team usually asks to analyze the reason and about how to fix it. Our platform engineering team can take the investigation at the beginning, but we don't know the actual usage of the related cloud resources by developers. Then we have to kick the ball back to developer teams. And at that time, they have to care about cloud cost and take proper actions to control the cost. In order to address these issues, there are two directions. One is multi cloud cost aggregation, which is helpful for answering high level questions. For existing solutions, there are some limitations. For example, users might be overwhelmed by the features, and the subscription might not ship, even if you only need several key features. And direction, another direction focuses more on developers requirements and is related to fine grained cost breakdown. Each cloud provider usually provides nice tools for analyzing tool, analyzing cost. But the limitation is one provider can only support checking its own cloud cost. Users have to switch among different tools when they have multiple cloud costs. So in our organization, we have implemented two different products for addressing these limitations. One is called InfoWallet, an open sourced multi cloud cost aggregation solution, which I will introduce later. And another one is a set of cost metric exporters, which I will introduce in the next slide. So a cost metric exporter is a service That fetches cost information from a cloud provider and transform them into promising as metrics. For example, we have AWS cost exporter. It uses AWS cost exporter APIs to get the, sorry, AWS cost explorer APIs to get the cost data from an AWS account regularly. Then expose the data as a premises metric. Then it becomes available for different observability platforms like Grafana and Datadog. There are several benefits of having cost exporters. First, the architecture is vendor neutral because the costs, cloud costs from different cloud providers are finally exposed as metrics. It's extensible as well because you can implement your own cost exporter by following the same steps. These exporters can be deployed into different clusters to make it scalable as well. Furthermore, The ultimate benefit of having cost exporters is now we can have cost metrics displayed on the same dashboard where service metrics are. This is very helpful for developers to understand the correlations. Between the usage of cloud resources by a service and its total cost, our cost exporters have been listed on the premises official documentation page. You can find more information where the link on the slack on the slide. As mentioned just now, we implemented another product called InfoWallet, which is a backstage plug in for multi cloud cost aggregations. To use InfoWallet, site admins only need to configure a set of integrations, for example, some credentials for their Azure cloud subscriptions and AWS cloud accounts. Then InfoWallet retrieves all the cost data from different cloud providers and aggregates the cost based on the user's needs. This answers the previously mentioned questions, like what is Project X cost in category cloud computing? for listening. I hope you become interested in it. I will show you more about using it, using our demo website later on. So before the demo, I would like to mention that we recently released a new feature called Business Metrics on InfoWallet. Why we need such a feature? Because without a context, it is hard to answer whether an increase of your cloud cost is good or not. For example, if the number of active users on your platform has doubled, the increase of the related cloud cost is not very high, then it may make sense because the unit cost of one active user might drop. That's why unit economics is a crucial metric to understand the business efficiency. InfoWallet enables users to fetch metrics from different observability platforms and display them on the same cost graph. Here I have a real story to share. Around one month ago, we observed that the daily cost of the IoT category increased a lot. However, the number of connected appliances did not increase that much. So we shared the findings with our IoT team, and they quickly located the problematic environment thanks to InfoWallet. The reason is that some of the air conditioners are misconfigured to send unnecessary messages to AWS. IOT. In summary, without having related business metrics for cloud costs, it becomes difficult to justify the increase of cloud cost. I'm proud to say that both. Cloud cost, cost exporters and InfoWallet have been open sourced and our GitHub organization. InfoWallet also provides you a demo website to easily try its features. So now let's take a look together. So this is our GitHub organization, Electrolux open source software. And there is a replica of InfoWallet, you can find all the documentations, source code and the related issues here, and you can also visit our demo website using this link. So here is a sandbox environment, on backstage, and we have installed the latest version of InfoWallet. Though the cost data and metric data are faked. But I think it's easy to get the idea off having info wallet. So here you can see we fetch info wallet fetches different cloud cost. And by default, it aggregates everything for you. So at the first site, you are able to check the total cost of your cloud resources. And then you can select different time ranges. It sends another API to, another set of APIs to fetch cloud cost and also caches the data so that you don't need to fetch cost from cloud every time. And then, all the cost data can be analyzed and can be grouped by different dimensions. For example, by the name of account, like here we have several AWS accounts, actual account, and also Google account. And then it shows you the summary of each, account's, cost, the portions, and also the increase or decrease of each account. Then you can also check like the total cost of different providers and also cost category. So these are the categories defined by focus. And actually, InfoWallet aggregates costs from different cloud providers into these standard categories. For example, compute category can contain AWS EC2 costs plus Azure Functions costs. Of course, you can also, twitch between monthly cost and daily cost, and you can also use filters to focus more on, for example, a specific category of cost or a specific wonders cost. And here you can also switch to show or hide business metrics and, to check the details of each category's cost. So feel free to give it a try and leave your comments on GitHub. We will be super happy to hear your feedback. All right, back to the slides. The tools are nice, but I have to say that it's only a good starting point. The goal is to optimize the cloud cost, while it is fundamental to have sufficient cost observability. With the help of these tools, developers are able to check a service's abnormal behavior regarding cost. Here is an example of fixing misconfigurations of logging and saving unnecessary cloud cost. We also organized cost optimization programs regularly in our organization. It sounds like a competition and is quite interesting for developers to participate. We provide cost observability tools where our IDP and then developers analyze their cloud costs and propose actions for saving cost. In the last round of this program, we are happy to see that developers made many good optimizations from different directions, and we saved more than 30, 000 per month. To summarize for developers. It's very important to achieve cost efficiency from cost awareness for platform engineers. When you design your cost management solutions, it is recommended to follow FOCUS, which is a vendor neutral, open source specification for consistent cost and usage data sets. And for everyone, cost engineering is not only about techniques. But also about culture. For example, cost optimization programs can be very helpful for your organization to build such a culture. Till now, we have shared our recent work in infrastructure orchestration and cost engineering. To summarize today's talk, we started from discussing various challenges and then Vamir shared more about infrastructure orchestration and service management using IDP at last. I also shared the work about cost engineering and open source projects. You can find all the related open source projects here. Feel free to give it a try and drop your comment on GitHub. Thank you so much for listening to our talk. We hope you enjoyed it and let's connect it. If you have any questions, see you next time at Com 42. Cheers.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Beyond Connectivity: Full Life-Cycle Infrastructure Orchestration and Cost Engineering

Video size:

Abstract

Summary

Transcript

Slides

Long Zhang

Senior SRE @ Electrolux

Vladimir Shcherbinin

Senior SRE @ Electrolux

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Beyond Connectivity: Full Life-Cycle Infrastructure Orchestration and Cost Engineering

Video size:

Abstract

Summary

Transcript

Slides

Long Zhang

Senior SRE @ Electrolux

Vladimir Shcherbinin

Senior SRE @ Electrolux

Join the community!