How we accidentally created a Cloud on our Cloud

Video size:

Abstract

In an effort to automate tasks for our users we ended up creating a cloud (kind of). In this talk we will describe our journey towards automation, and how we accidentally created a Cloud on our Cloud and why we chose go to do it all.

Part of the job of a Developer Advocate is the ability to demo or show off portions of your technology stack to possible users. At IBM we run many workshops and tech demos every week for our clients and conferences. We need to create, monitor, maintain and clean these assets. In this talk we will describe our automation journey from bash scripts run and maintained by individual developers to automating the creations and deletion on automatic schedule with a UI. We will discuss situation, our iterations on what we tried and the painful portions of them, and how we accidentally created a Cloud interface on top of the IBM Cloud, or easier said, we created a Cloud on our Cloud.

Ideally, we’ll show our journey and the lessons we learned along the way, and as an audience member, you’ll come away with nuggets of useful tooling to make your cloud usage more streamlined, and hopefully, you’ll see the pitfalls we fell in and you can avoid them yourself. We’ll show off some bad code, some good code, and some robust code; all open source and in Go and available to allow you to leverage it too.

Summary

Mophi is a software engineer and developer advocate at IBM. He mostly do container stuff, collect stickers and write go code. If you have any questions after or just want to connect, please feel free to find him at mophicodes.
Cloud is a very overloaded term. It means a lot of different things. In simplest term, cloud is someone else's computer. It's on demand and it has a way for users to access the service. As long as it has some way, it can be a cloud.
IBM has a cloud that we have a lot of workshops and other things that run. Each workshop requires about 30 of this Kubernetes cluster or openshift cluster to be spun up. The worst possible way to solve this problem would be for every request that comes in.
Kubmin is building an application that just automate a lot of these things using go. Uses the API to talk to the IBM cloud infrastructure and finally run post provisioning tasks on each cluster. Despite what's lacking, there is a huge impact of this UI makes it easy to teach anyone to create resources.
Should you build your own cloud interface? Do you often find yourself writing custom code to do things in your cloud? And do you struggle to keep the resources in check? Sometimes reinventing the wheel is the best way.
We have a react front end that sits in the cloud. Next, we needed to have some way of handling user authentication. Luckily we're already building on top of a cloud so we didn't really have to invent that wheel. We have a number of different steps we would like to finish to get to our ideal solutions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello and welcome to this session of a story of how we accidentally created a cloud on top of our cloud. My name is Mophi. I'm a software engineer and developer advocate at IBM. I mostly do container stuff, collect stickers and write go code. I can be found on the Internet at mophicodes. So if you have any questions after or just want to connect, please feel free to find me at mophicodes. So before we start this session, the first definition that is important for us to understand is what is a cloud? Cloud is a very overloaded term. It means a lot of different things, a lot of different people. But for the purposes of this talk, and given IBM the one giving this talk, I get to define what cloud is in the context. So in simplest term, cloud is someone else's computer. It's on demand and it has a way for users to access the service. It could be a UI, it could be a CLI. Most notable clouds that you know has a UI that you can go and click some buttons also have a CLI. So you have multiple ways of accessing the cloud, but as long as it has some way, it can be a cloud. Now let's talk about the problem we initially started set out to solve. I work for IBM, we are in developer advocate. We have a cloud that we have a lot of workshops and other things that run. One of the most cost and time consuming workshops to run is Kubernetes workshops and we run quite a few of them. In a given month we do about ten to 20 of these workshops and that means about two to four a week in average. And each of this workshop requires about 30 of this Kubernetes cluster or openshift cluster to be spun up. It can go these highest we have done is probably 150. The lowest we have done is probably five or ten. But in average we're looking at about 30 of this Kubernetes cluster, spinning up and spinning down cluster resources. If you are just like following the UI, it's a manual process, right? You are clicking buttons, filling in forms, or if you're the CLI, you are just typing in a command and youll can create one cluster at a time. Each cluster also requires access to load balancer and the subnet that we have per data center. There is a limit of how many cluster you can put on the same subnet. So if we are trying to spin up 30 kubernetes cluster, we couldn't do all of that in youll ten because there could be other clusters that running in the same data center. So we can hit some upper limit. We can spin up more subnets per data center, but even then we have an upper limit of how many different Kubernetes cluster under a single account we can run. And that limit by default is about few hundred, about three or 400, I don't know the exact number. But if you think of normal use case for a single customer that we have, having three or 400 different Kubernetes cluster is not as normal. So that the use case we have is not something we see regularly our customer trying to do. We also have a limit on how many volumes youll can spin up per data center. Again, that's a limit we can increase, but in a day to day operation for a regular customer, Kubernetes clusters, we don't really have to have that many persistent volume storage attached. Also, when you are just like spinning up clusters manually or the cluster admin or account owner is doing that, we don't really have any way to collect workshop metrics. If a workshop, how many people are using the workshop, how many people actually users the cluster, how much of the cluster got used, we have no way of collecting any of that information. And right now the team that owns the process is basically two person, right? So it is ten or 20 workshop a month doesn't seem too much, but if you're thinking about all individual resource that needs to be created and cleaned up, it kind of adds up to be quite a hefty amount of work. And this two person that owns these accounts, that's not their full time job to do this. They have other responsibilities these have to take care of. So how do you go about solving this problem? There are many ways to solve this, right? Like we could write a custom code, use something like terraform or something like ansible, or write some bash script. So there is a number of different ways we could have done this, but let's talk about the worst possible way to try to solve this. In my opinion, the worst possible way would be for every request that comes in, we spin up these users in the UI, manually give access to the user in the account, and then do that for every users, for every workshop, for every cluster. And if we were to go about doing it this way, between the workshop leader and the owner, we're talking about about four hour power workshop, spinning up and spinning down. Even if you're talking about max efficiency in terms of clicking buttons. Also, this means if you are trying to assign access during the workshop manually to individual users, youll are spending a lot of time during the workshop time to do all those user permission things. So the actual workshop time would have to be cut short because you spent too much time giving people access to things. So again in our cloud UI you can go in and you can select things and select Kubernetes version, classic VPC or different location. You can select different size and you can just create one cluster then repeat the process over and over again for individual each of the clusters. And that is not scalable and nor it is something that we can do efficiently and save time. Because if you're spending 4 hours per workshop as a work account admin, if you have four workshop that week, you're talking about two full working days just spent on setting up a workshops and you have actual day jobs. So we can't really do that and that wouldn't work. These problem with this is again two workdays spent on the account owner. We are pretty much the upper limit of how many workshops we can support. Right now we support up to four and that's pretty much all we could have been able to support. If you are to do it this way. It's a huge cost center because resources need to be created earlier and deleted later. Let's say we have a workshop Monday and we're trying to manually do this. Best case scenario is the person does it like end of day Friday right now, Saturday and Sunday, the resource just like sitting there and like accruing cost. We will also won't be able to do any higher level workshop. Like if you want to do something like on istio or knative, things like that that requires installation on top of kubernetes. We don't really have any way to handle that without going manually, installing more steps and adding more time into setting up the clusters. So that was like a no go from probably the get go. We didn't really think of these as the way to solve this problem. So a passable solution, right? Like let's say a solution that would work and in many cases this is a solution that you probably are using and it is passable. So for the most part if it ain't broke, don't fix it kind of a thing. So you don't really think of improving that. What I mean by that is we use a CLI in a bash script update config file for each workshop request and use the same config file later to delete the resources and use a GitHub repo for tracking cluster requests. So we have a GitHub repository where users would come in and say I need a cluster, I need 20 clusters for that day for these workshop. So we have a kind of dedicated automation where we can find what kind of workshops we helped with, how many resources that needed, and there is a lineage of what kind of work was done for the workshop. And we also have a simple web app that handles a user access. So during these workshop time, user will go to this URL, put in their email address and automatically will be connected. Their account will be added with the right permission. So now they can access the Kubernetes cluster. And this is not bad. And bash script itself is quite simple, it looks something like this. You load up the environment file, so now you have all the environment variables and from there you log in to IBM cloud using the CLI, get access to the API key. And with the API key now you could either make a Carl commands like it does here, or you could also talk to the CLI directly to talk to IBM cloud to access a Kubernetes cluster. So again, we have multiple ways to handle this and this is the way we're doing this here. So bash script, it works in many cases. This is probably the extent a lot of you are going to go. And depending on how much of an automation you're looking for, this might be good enough if you're doing something sporadically. So not always like a full blown software solution is the best answer, but this is good enough for most cases. Yeah, sometimes this is the best you need to do and youll don't need to invent some new ways to do this. The impact immediately just from a bash script. We saw significant improvement on previous solution, the cluster access now automated with simple web app. Right now during the workshop, which we have the most limited amount of time. We didn't have to worry about wasting time from the workshop lead or someone who is helping with the workshop to help individual user that was automated using a web app, but it's still a manual process. Because you are running a bash script, your computer has to have be open, something goes wrong. Usually bash scripts creating you don't really have too many ways to handle errors. So because it's a scripts, someone has to go and run it. It's not running on a cron or anything. Because each request is different, the actual workshop can scale to a fairly large number of attendees. So that is an improvement. That was good because it's just a for loop. The work to spin a five cluster versus 500 cluster was basically the same except for how long it took. So this was not a bad solution. And this has served us well for a fairly long time. And this is where our current solution comes into play. I am a big believer of go and I love go programming. And so I thought, you know what, this is something we can improve upon by building an application that just automate a lot of these things using go. So the current solution we have a UI that is pretty much replicating how our cloud UI looks like and lets you select all these options yourself. We have an API to talk to infrastructure that's written in all go. We have some way to spin up the resources for our purposes we use AWX. So AWX is an open source version of ansible tower which is worker or like pipeline runner kind of a thing that we can run multiple jobs that youll need to run and finally some way to clean up the resources for that. We also use the API to talk to the IBM cloud infrastructure and finally run post provisioning tasks on each cluster. And for that we also rely heavily on AWX. The UI looks something like this. So if I were to look at the same account, I was looking at the so this is part of the application I created. And so that's what we mean by accidentally creating a cloud on top of our cloud. So this has pretty much the same look and feel to IBM cloud, but this is not part of IBM cloud itself. And some of the key things that are differentiated from what IBM cloud offers versus what this does is in this one we can delete multiple things at the same time as well as when you go to create new resources. We could actually select how many cluster we want to create as well as we can select that we want to deploy these clusters in. I don't know, all our data centers in the North America zone, right? So all of a sudden this does a round robin distribution of clusters. So if you have 300 clusters needs to be created and you don't really care exactly which data center they're in now you can spin up all 300 cluster, they will get round robin into different data centers. Now you're not putting too much load in a single data center for volume or network subnet and youll can multiple tag, you can also run select a few of these different services as well as select post provisioning tasks. We can have post provisioning tasks like installing knative, installing Openshift, select a different version here so we can select like cloud pack for iterations. Cloud native toolkit. These are different software that we are trying to teach workshops on but previously was very difficult to do because the post provisioning task of installing that software takes a fair amount of effort. Now we can automate that using ansible playbooks. So what's lacking in this is that in the UI, it's still a manual process. On the UI side you have to still select some buttons, click some buttons, fill in some forms, and so weekend and time zones are still a problem. So if you have a workshop that's Monday, someone still needs to probably do this on Friday evening. Although the process of doing it is fairly fast, but we still need to spin it up earlier than we have to, just because we wouldn't have people working over the weekend just to spin up some resources. We still don't have any real metrics collection from the workshop itself. So we spin up a lot of resources. We get to see at the time when the workshop is running how many workshops are being used, but we don't really have any persistent storage for this information. Out of ten clusters, eight got used, six completed the workshop and things like that. We also have no way of schedule creation and deletion of the resources. And this is one of the big ones that would really help us save a lot of time and money. Because if you can schedule a cluster to be created 3 hours before the workshop, that by itself probably cuts down the cost of the workshops by half. But even with what's lacking, there is a huge impact of this UI makes it easy to teach anyone to spin up resources. So now it's not only the two person that owns the account has to do this, anyone now can just go and click some buttons. It'll take probably five minutes to teach them how to create resources. Because it's a custom app written by us, we can also add some retry logic or rate limit to requests. So if you have for any reason any of this request to the underlying infrastructure fails, we could have some cooldown retry to make sure that we are not overwhelming the amount of cluster we request. So kind of like in terms of architecture, this is fairly simple, this is what it looks like. We have the Kubeadmin application that is both the cluster manager, it's the UI is these provisioner. We have a notification system of sending email on error. All of that is a single giant application that talks to AWX. That's our runner job runner or like pipeline Runner. And that AWX spins up a single web app for each of the workshop that we talked about earlier that gives users access to individual users for their account. So I mean this is working solution and this is being used right now for a lot of our workshops. But what does the next step, next evolution of that looks like? So the ideal solution and the solution we're currently working towards making happen is where automated workshop requests from GitHub to cluster creations with an approval in place. So right now we have a GitHub internal GitHub repository where people would come in and request workshops. What we want to do is process that request using something like cloud function and automating, create a request in our Kubeadmin application with some manual approval so that we make sure that only the approved workshops get resources created for them. Next is where to schedule creation and deletion of resources. Because every workshop that we get request for, we have a time when the workshop is supposed to start and we also have a time when the workshop ends so we can automatically schedule creation of these resources few hours prior to the workshop start time and few hours after the workshop supposed to end. That way we don't have to have manual intervention and time spent from our engineers developer automate that are manually doing this right now. Once we can reach a point where we can do a lot of automation around this, we can also open this up to other teams in our and other orgs to run their resource creation through this. So although this was initially built for developer advocate, but there are other teams that work directly with clients and also other teams that work does like these workshops kind of things and they can now use this tool to create their resources and clean these up without any manual intervention. So we can basically serve a lot more people than we currently do. And finally, ideal solution would give us the proper metrics for workshop that we run on these accounts. So we want to know if we did ten workshop what kind of completion rate we saw from this workshop. This will allow us to see if we can cut down on amount of resource created by right sizing the workshop request. If someone is requesting a 50 cluster for a workshop and we consistently see they have about 30 people using those clusters, we can probably right size that to 35 with some buffer for their workshop the next request that comes so cut down on cost basically. So the impact. So again, anyone within the can now be able to use this. So no more dependency on our small team. They wouldn't be just asking us to spin up these resources, it will be mostly self served. So we don't have to spend engineering hours in just doing manual clicking buttons. We wouldn't also need cycle managing resources because a lot of that we can automate within schedules. Obviously the big part of it is cost saving, although it's an internal cost center, but it's still a cost center that we have to be mindful of. And also, as I said, we can take better decisions when new workshops and events are considered. How much resource to automate? And is this workshop even worth that engineering time spent by the workshop runner? In many cases. So currently I did some of the rearchitecting of this system and many of these are thought of as microservices. But over time we can even consolidate into bigger services. But as of now, what you can see here is that these AWX service youll still have the Cube admin service, but breaking down some of the other responsibilities into smaller services like provisioner and scheduler reclaimer, that would be taking care of reclaiming users back into or deleting the clusters back from the list. We also have a notifier service that's right now handling sending notification via email and sending notification via GitHub, like sending information back to the GitHub issue itself, but we can update that to also send notification back into Slack. That's a service we're working on right now. We also have a cloud function in the works that when a new issue is created, we can take that information and automatically create that request into our scheduler. All of these services are right now being worked on. They haven't gone public yet, but we are working towards getting that to work. So why did we make a cloud? Right? And that is a question. Instead of using some premade solution or just sticking with the bash scripts, one of the key reasons is given we are already a cloud provider, many of these things, we probably could have requested the cloud team to implement some of those things for us where it would have helped our team's need. But the features we needed are not needed by most people, right? Like no one really cares about scheduling, creation of 50 clusters and deletion of those 50 clusters, most people don't probably care about also spinning up hundreds of clusters and spinning them down in a couple of hours or a couple of days. It doesn't make sense to implement these things in our public cloud interface. Also, a cloud interface would be easier for us to use and scale, although it's only for an internal audience. It's mainly because if you have a nice UI, it would be much easier to train or educate someone else to use it. Rather than having like a script or a very custom thing, they would have to figure out how to run. If it's a UI, they can just click some buttons and it just works. So should you build your own cloud interface? Well, the first question you have to ask is does these cloud youll have does not have an easy way to do what you need, right? Like if that's the case maybe do you often find yourself writing custom code to do things in your cloud? Does other teams do the same things? If you find that six of the teams in your doing very similar thing manually or doing some scripting for achieving the same result, that is something you need to consider. Finally, do you struggle to keep the resources in check? If you are using your cloud, you have a lot of resources being spun up and spun down by individuals and you are either the cluster owner or the cluster admin and you are struggling to make sure that you are not overspending or you are not right sizing your clusters and resources. If the answer to these questions are yes, then maybe you need to build the interface to your cloud. And most cloud providers have very nice API that you can make use of. So you don't really have to box yourself to say oh yeah, I am a user of azure cloud and all of a sudden this one thing they don't provide in their UI or their ClI and now I can't really do that anymore. You don't want to box yourself that way. If you find yourself needing to do something over and over again, it might make sense for you to build an interface on top. But infrastructure as code first, right? Like if you can get a lot of this done by using infrastructure as code services such as terraform, pulumi, ansible, chef or puppet, or like doing some CI CD things like GitHub actions, Travis Jenkins Harness, Argo Tecton. If you can get your whole infrastructure as code, and for the most part, unless you need something very custom, you can achieve most of the things just by codifying your entire infrastructure needs. So once you have done all of that, you still find yourself needing to do some manual coding or running some scripts to do things. Yeah, it might be worth like building a cloud yourself. So rolling out a custom solution should be towards the bottom of your list. What I mean by that is if you roll out a custom solution like we have, that is a dependency you will have to carry on going forward, right? So if you have the key people writing this code lives, your company, or things change in your cloud interface or any number of different things that can happen, all of a sudden you have this dependency that you have to carry on. So all code is technical debt. So as little code as you can own yourself, the less technical debt youll will have long term. So why should you consider building a cloud. So sometimes reinventing the wheel is the best way. Although we try to keep our code and our work dry, we do not repeat ourselves. But sometimes reinventing the wheel lets us go in some way that we couldn't go. With all the wheels that you have in the world, a small script across different teams and orgs and needs become a big dependency. So if you have each, let's say your company has two orgs with three teams each, and each of them are maintained a different script to pretty much do the same thing. At that point, it might be worth spending some engineering hours building a tool that does solve the problem in a more general way for everyone. Most teams should not have to own cloud resources. So if you're using public cloud of any kind, and each of the team kind of are responsible for understanding how different cloud resources work and how to do things in the cloud, you are creating this dependency on this cloud which if down the line you choose to move to a different cloud provider, all of a sudden a lot of the team members wouldn't understand how to translate those information between different clouds. If you were to create a simple interface that only lets people access the things that are approved in a size that is approved, all of a sudden, when you move to a different cloud, you just need to change that interface to talk to the different cloud and your teams wouldn't have to worry about those kind of changes themselves. Finally, the last reason the interface of your cloud of your choice might not have all the answers that you need. Things like how long has these resource been around? Who is the last person that used this resource? What other projects are using this resource? Is this resource approved? How big of this resource is approved? And all of these different answers might not be available or be able to add onto your cloud provider's interface. And if that is something you're looking for, building a wrapper or interface on top of your cloud might be worth looking into. So consideration of if you're thinking about building a cloud, what you should be looking at. So you can always do more, right? No matter how much you do, you can always do more thing. So don't do more than necessary. So only do whatever gets the job done, then look back and see doing more youll improve or it's just like you are doing more for doing more sake. Cloud usually has a lot of API and they can change. So you have to be ready to update things. As I said, the moment you own, kind of like building this new interface on top of your existing interface, all of a sudden you built up some dependency and code debt, technical debt that you have built up because as your cloud changed, now you have to update your underlying interface with that solving the general cloud problem, it's probably going to be more than you can take on. Solving your and your adjacent team's problem are enough. Oftentimes youll might be tempted once you get down this path, to think okay, what if I basically build a new type of cloud that uses another cloud underneath, but make it very easy and make it very applicable for everyone. It is a novel idea, but oftentimes it might be way more than you need to do, as well as way more than it's worth your time. Unless you are trying to create a new product and a new cloud interface based on some other cloud, that is probably not something you want to spend your time doing again. Last option always is don't start with creating the cloud. In our case, we tried a couple of different things and found things that makes it not solve our problem as good as rebuilding the cloud did in our case. So that's when we started kind of like treating the interface on top of our cloud to do things a little bit easier for us. So if you were to start out of the gate, you started with creating the cloud, I think you will spend a lot more time doing that rather than just starting with solving the problem any way you can using ansible, terraform or any number of other things. And eventually if you find yourself to be stuck on a loop that you are just redoing the same things but with code now at that point it might be worth creating some of these things in code. So at this point we'll tasks a quick look at some of the code that youll have done here and that is going to be the end of this session. First of all, we have a react front end that sits in the cloud. That is the UI you would see if you were to go to this URL. Next, we needed to have some way of handling user authentication. Luckily we're already building on top of a cloud so we didn't really have to invent that wheel. What we could do is just fall back on IBM cloud for our authentication. So if we have a login with IBM ID and as a user, if you have an IBM account, youll be able to log in to the cluster, to the account and you will be able to access the resources youll actually have access to on your IBM cloud account. So a big part of building a cloud is user management and we didn't really have to do that because again we're already on top of a cloud. So these UI we access here, we see all the clusters and such. So the back end of it is all go. And the way we went about building it, it's kind of like a monorepo. All the different applications are being built under the same project as of now. The biggest part is the web, that is what is Cube admin. It's basically a back end API rest API that is built using Echo. And we have all the different endpoints that we can access and the app just starts on port 9000 and also serves the react front end from the back end. So this part of the code we had to for the most part just look at our cloud docs. So cloud ibm.com docs and for each of the endpoints we cared about we just looked at our cloud docs, I'm going to look at containers Kubernetes service and now we can see how we go about setting up. I'm looking for API references. Yeah Kubernetes API service so we have a swagger API where we can see how to get access to all the clusters. So this API endpoint gives us access to all the clusters and our application would for the most part just be wrapping that endpoint. For example this one, this API just talks to our second app function that tasks to our API and gets us the clusters and we just make a fetch request to the endpoint, basically talk to our cloud the same way our cloud interface talks to as well. And this is similar for most of the other endpoints that we have. For the most cloud providers that we have, they would have decent documentation on how to talk to different endpoints. If you don't have that available, that might be very difficult to build a wrapper UI on top. Luckily IBM cloud does have fairly good documentation for all the different products we needed to access. So we could build that interface on top to talk to AWX. That is another dependency we have. We have this package we are talking to AWX to and this is also very similar to what we do for IBM cloud. We can get list of all these workflow job templates, we can get job templates and we also can launch a new workflow. AWX is the runner for different jobs that we have for running our actual workload.com. So we have these playbooks that we have written here and these playbooks can do a number of different things on top of just creating Kubernetes cluster. So you could do things like install istio or install knative. The moment a cluster has been created and install tecton or any number of different application. And if you want to define a new application, that is also possible by just writing ansible playbook and we can run that after the cluster has been created. So as I said that we have a number of different steps we would like to finish to get to our ideal solutions. That's what we are working towards. And again, this project is open source. I don't know how useful it would be for your use case if you don't are already using IBM cloud and need to have very similar functionality, but if you're trying to look for place where this kind of work is done, this code base could be useful for your need that way. With that, end this session and if you have any more questions, feel free to ask. In any of the social media platforms, Twitter is probably one of the easiest way to reach me if you'd want to come back and ask more questions about some of the decisions we made, decisions we took, some of the more challenges we had while doing this, more than happy to chat and answer those questions.

Slides

Download slides (PDF)

See all 25 talks at this event!

Conf42 Golang 2021 - Online

June 24 2021

How we accidentally created a Cloud on our Cloud

Video size:

Abstract

Summary

Transcript

Slides

Mofizur Rahman

Developer Advocate @ IBM

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2021 - Online

June 24 2021

How we accidentally created a Cloud on our Cloud

Video size:

Abstract

Summary

Transcript

Slides

Mofizur Rahman

Developer Advocate @ IBM

Join the community!