Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to this session of a story of how we accidentally
created a cloud on top of our cloud. My name is Mophi.
I'm a software engineer and developer advocate at IBM. I mostly do container
stuff, collect stickers and write go code. I can be found on the Internet
at mophicodes. So if you have any questions after or just want to connect,
please feel free to find me at mophicodes. So before we start this session,
the first definition that is important for us to understand is
what is a cloud? Cloud is a very overloaded term.
It means a lot of different things, a lot of different people. But for the
purposes of this talk, and given IBM the one giving
this talk, I get to define what cloud is in
the context. So in simplest term, cloud is someone
else's computer. It's on demand and it has a way for users
to access the service. It could be a UI, it could be a CLI.
Most notable clouds that you know has a UI that
you can go and click some buttons also have a CLI. So you have multiple
ways of accessing the cloud, but as long as it has some way, it can
be a cloud. Now let's talk about the problem we
initially started set out to solve. I work for IBM,
we are in developer advocate. We have a cloud that we have
a lot of workshops and other things that run. One of the most cost and
time consuming workshops to run is Kubernetes workshops
and we run quite a few of them. In a given month we do about
ten to 20 of these workshops and that means about two
to four a week in average. And each of this workshop
requires about 30 of this Kubernetes cluster or openshift cluster to be spun
up. It can go these highest we have done is probably 150. The lowest we
have done is probably five or ten. But in average we're looking at about 30
of this Kubernetes cluster, spinning up and spinning down cluster resources.
If you are just like following the UI, it's a manual process, right? You are
clicking buttons, filling in forms, or if you're the CLI, you are just
typing in a command and youll can create one cluster
at a time. Each cluster also requires access to load balancer
and the subnet that we have per data center. There is
a limit of how many cluster you can put on the same subnet. So if
we are trying to spin up 30 kubernetes cluster, we couldn't do all of that
in youll ten because there could be other clusters that running in
the same data center. So we can hit some upper limit.
We can spin up more subnets per data center, but even then we
have an upper limit of how many different Kubernetes cluster
under a single account we can run. And that
limit by default is about few hundred,
about three or 400, I don't know the exact number. But if you
think of normal use case for a single customer that
we have, having three or 400 different Kubernetes cluster is not
as normal. So that the use case we have is
not something we see regularly our customer trying to do. We also
have a limit on how many volumes youll can spin up per data center.
Again, that's a limit we can increase, but in a day to day operation for
a regular customer, Kubernetes clusters, we don't really have to have
that many persistent volume storage attached. Also, when you
are just like spinning up clusters manually or the
cluster admin or account owner is doing that, we don't
really have any way to collect workshop metrics. If a workshop, how many people are
using the workshop, how many people actually users the cluster, how much of the cluster
got used, we have no way of collecting any of that information.
And right now the team that owns the process is
basically two person, right? So it is ten or
20 workshop a month doesn't seem too much, but if you're thinking about all
individual resource that needs to be created and cleaned up, it kind of adds
up to be quite a hefty amount of work. And this two person that
owns these accounts, that's not their full time job to do this. They have
other responsibilities these have to take care of. So how do you go about solving
this problem? There are many ways to solve this, right? Like we
could write a custom code, use something like terraform or
something like ansible, or write some bash script. So there is a number of different
ways we could have done this, but let's talk about the worst possible
way to try to solve this. In my opinion, the worst possible way would
be for every request that comes in, we spin
up these users in the UI, manually give access to the user
in the account, and then do that for every users,
for every workshop, for every cluster. And if we were to go about doing it
this way, between the workshop leader and the owner,
we're talking about about four hour power workshop,
spinning up and spinning down. Even if you're talking about max efficiency
in terms of clicking buttons. Also, this means if you are
trying to assign access during the workshop manually to
individual users, youll are spending a lot of time during the workshop
time to do all those user permission things.
So the actual workshop time would have to be cut short because you
spent too much time giving people access to things. So again in our cloud UI
you can go in and you can select things and select Kubernetes version,
classic VPC or different location. You can select different
size and you can just create one cluster then repeat the process
over and over again for individual each of the clusters. And that
is not scalable and nor it is something that we can do
efficiently and save time. Because if you're spending 4 hours
per workshop as a work account admin, if you have four workshop
that week, you're talking about two full working days just spent on setting
up a workshops and you have actual day jobs. So we can't really do that
and that wouldn't work. These problem with this is again two workdays spent
on the account owner. We are pretty much the upper limit of how many workshops
we can support. Right now we support up to four and that's pretty much
all we could have been able to support. If you are to do it this
way. It's a huge cost center because resources need to be created
earlier and deleted later. Let's say we have a workshop Monday and we're
trying to manually do this. Best case scenario is the person
does it like end of day Friday right now, Saturday and Sunday, the resource
just like sitting there and like accruing cost. We will also
won't be able to do any higher level workshop. Like if you want to do
something like on istio or knative, things like that that requires
installation on top of kubernetes. We don't really have any way
to handle that without going manually, installing more steps and
adding more time into setting up the clusters. So that was
like a no go from probably the get go. We didn't really think of
these as the way to solve this problem. So a passable solution, right? Like let's
say a solution that would work and in many cases this
is a solution that you probably are using and it is
passable. So for the most part if it ain't broke, don't fix it kind
of a thing. So you don't really think of improving that. What I mean by
that is we use a CLI in a bash script update config file
for each workshop request and use the same config file later
to delete the resources and use a GitHub repo for tracking cluster requests.
So we have a GitHub repository where users would come in and say I
need a cluster, I need 20 clusters for that
day for these workshop. So we have a kind of dedicated automation
where we can find what kind of workshops we helped with, how many
resources that needed, and there is a lineage of
what kind of work was done for the workshop. And we also have a simple
web app that handles a user access. So during these workshop time,
user will go to this URL, put in their email address and
automatically will be connected. Their account will be added with
the right permission. So now they can access the Kubernetes cluster.
And this is not bad. And bash script itself is quite
simple, it looks something like this. You load up the
environment file, so now you have all the environment variables and from
there you log in to IBM cloud using the CLI, get access
to the API key. And with the API key now you could either make
a Carl commands like it does here, or you could also talk to the CLI
directly to talk to IBM cloud to access a Kubernetes cluster.
So again, we have multiple ways to handle this and this is the way we're
doing this here. So bash script, it works in many cases.
This is probably the extent a lot of you are going
to go. And depending on how much of an
automation you're looking for, this might be good enough if you're doing something sporadically.
So not always like a full blown software solution is
the best answer, but this is good enough for most cases.
Yeah, sometimes this is the best you need to do and youll don't need to
invent some new ways to do this. The impact
immediately just from a bash script. We saw significant improvement
on previous solution, the cluster access now automated with simple web
app. Right now during the workshop, which we have the most limited amount
of time. We didn't have to worry about wasting time
from the workshop lead or someone who is helping with
the workshop to help individual user that was automated using
a web app, but it's still a manual process. Because you are running a bash
script, your computer has to have be open,
something goes wrong. Usually bash scripts creating you don't really have too many
ways to handle errors. So because it's
a scripts, someone has to go and run it. It's not running on
a cron or anything. Because each request is different, the actual workshop
can scale to a fairly large number of attendees. So that
is an improvement. That was good because it's just a for loop.
The work to spin a five cluster versus 500 cluster was
basically the same except for how long it took. So this was not a bad
solution. And this has served us well
for a fairly long time. And this is where our
current solution comes into play. I am a big believer of go and
I love go programming. And so I thought, you know what, this is something
we can improve upon by building an application that just
automate a lot of these things using go. So the current
solution we have a UI that is pretty
much replicating how our cloud UI looks like and lets you
select all these options yourself. We have an API
to talk to infrastructure that's written in all go.
We have some way to spin up the resources for our purposes
we use AWX. So AWX is an open source version of ansible
tower which is worker or like pipeline runner
kind of a thing that we can run multiple jobs
that youll need to run and finally some way to clean up the resources
for that. We also use the API to talk to the IBM cloud infrastructure and
finally run post provisioning tasks on each cluster. And for that
we also rely heavily on AWX. The UI looks
something like this. So if I were to
look at the same account, I was looking at the
so this is part of the application I created.
And so that's what we mean by accidentally creating a cloud on
top of our cloud. So this has pretty much the same look
and feel to IBM cloud, but this is not part of IBM cloud
itself. And some of the key things that are differentiated from what
IBM cloud offers versus what this does is in this one we can
delete multiple things at the same time as well as when you
go to create new resources. We could actually select how many
cluster we want to create as well as we can select that we
want to deploy these clusters in. I don't know, all our data
centers in the North America zone, right? So all of
a sudden this does a round robin distribution of clusters. So if
you have 300 clusters needs to be created and you don't really care exactly
which data center they're in now you can spin up all 300 cluster,
they will get round robin into different data centers. Now you're not putting too much
load in a single data center for volume or network subnet
and youll can multiple tag, you can also run
select a few of these different services as well as select post
provisioning tasks. We can have post provisioning tasks like installing
knative, installing Openshift,
select a different version here so we can select like cloud pack for
iterations. Cloud native toolkit. These are different software that
we are trying to teach workshops on but previously was very difficult to
do because the post provisioning task of installing that software
takes a fair amount of effort. Now we can automate that using
ansible playbooks. So what's lacking in
this is that in the UI, it's still a manual process. On the UI
side you have to still select some buttons, click some buttons,
fill in some forms, and so weekend and time zones are still a problem.
So if you have a workshop that's Monday, someone still needs to probably do this
on Friday evening. Although the process of doing it
is fairly fast, but we still need to spin it up earlier
than we have to, just because we wouldn't have people working
over the weekend just to spin up some resources. We still don't have
any real metrics collection from the workshop itself. So we spin
up a lot of resources. We get to see at the time when the workshop
is running how many workshops are being used, but we don't really have any persistent
storage for this information.
Out of ten clusters, eight got used, six completed the
workshop and things like that. We also have no way of schedule
creation and deletion of the resources. And this is one of the big ones that
would really help us save a lot of time
and money. Because if you can schedule a cluster to be created 3
hours before the workshop, that by itself probably cuts
down the cost of the workshops by half. But even with what's lacking,
there is a huge impact of this UI makes it easy to
teach anyone to spin up resources. So now it's not only the two person
that owns the account has to do this, anyone now
can just go and click some buttons. It'll take probably five minutes to teach them
how to create resources. Because it's a custom app
written by us, we can also add some retry logic or rate limit to
requests. So if you have for any reason
any of this request to the underlying infrastructure fails,
we could have some cooldown retry to make
sure that we are not overwhelming the amount of cluster we
request. So kind of like in terms of architecture,
this is fairly simple, this is what it looks like. We have the Kubeadmin application
that is both the cluster manager, it's the UI is these
provisioner. We have a notification system of sending email on
error. All of that is a single giant application that
talks to AWX. That's our runner job runner or like
pipeline Runner. And that AWX spins up
a single web app for each of the workshop that we talked about earlier
that gives users access to individual
users for their account. So I mean this is working
solution and this is being used right now for a lot of our
workshops. But what does the next step, next evolution of that looks
like? So the ideal solution and the solution we're currently working
towards making happen is where automated
workshop requests from GitHub to cluster creations with an approval in place.
So right now we have a GitHub internal GitHub repository
where people would come in and request workshops.
What we want to do is process that request using something like
cloud function and automating, create a request
in our Kubeadmin application with some manual approval
so that we make sure that only the approved workshops get resources
created for them. Next is where to schedule creation and deletion
of resources. Because every workshop that we get request for, we have
a time when the workshop is supposed to start and we also have a time
when the workshop ends so we can automatically schedule creation
of these resources few hours prior to the workshop start
time and few hours after the workshop supposed to end.
That way we don't have to have manual intervention and time spent
from our engineers developer automate that are
manually doing this right now. Once we can reach a point where we can
do a lot of automation around this, we can also open this up to other
teams in our and other orgs to run their resource creation
through this. So although this was initially built for
developer advocate, but there are other teams that work directly with
clients and also other teams that work does like these workshops
kind of things and they can now use this tool to create
their resources and clean these up without any manual intervention. So we can
basically serve a lot more people than we currently do. And finally,
ideal solution would give us the proper metrics for workshop
that we run on these accounts. So we want to know if we
did ten workshop what kind of completion rate we
saw from this workshop. This will allow us to see if
we can cut down on amount of resource created by
right sizing the workshop request. If someone is requesting
a 50 cluster for a workshop and we consistently
see they have about 30 people using those clusters,
we can probably right size that to 35 with some buffer for their
workshop the next request that comes so cut down on cost basically.
So the impact. So again, anyone within the can now
be able to use this. So no more dependency on our small team.
They wouldn't be just asking us to spin up these resources,
it will be mostly self served. So we don't have to spend engineering hours
in just doing manual clicking buttons. We wouldn't also need
cycle managing resources because a lot of that we can automate within schedules.
Obviously the big part of it is cost saving, although it's an internal
cost center, but it's still a cost center that we have to
be mindful of. And also, as I said,
we can take better decisions when new workshops and events are considered.
How much resource to automate? And is this
workshop even worth that engineering time spent by the workshop
runner? In many cases. So currently I
did some of the rearchitecting of this system and many
of these are thought of as microservices. But over time we
can even consolidate into bigger services. But as of now,
what you can see here is that these AWX service youll still have
the Cube admin service, but breaking down some of the other responsibilities
into smaller services like provisioner and scheduler
reclaimer, that would be taking care of reclaiming users
back into or deleting the clusters back from the list.
We also have a notifier service that's right now handling
sending notification via email and sending notification
via GitHub, like sending information back to the GitHub issue itself,
but we can update that to also send notification back into
Slack. That's a service we're working on right now. We also have a
cloud function in the works that when a new issue is created,
we can take that information and automatically
create that request into our scheduler.
All of these services are right now being worked on.
They haven't gone public yet, but we are working towards getting that to
work. So why did we make a
cloud? Right? And that is a question. Instead of
using some premade solution or just
sticking with the bash scripts, one of the key reasons is given
we are already a cloud provider, many of these things, we probably could have requested
the cloud team to implement some of those things for us
where it would have helped our team's need. But the features
we needed are not needed by most people, right? Like no one really cares
about scheduling, creation of 50 clusters and deletion
of those 50 clusters, most people don't probably care about also
spinning up hundreds of clusters and spinning them down in a couple of hours
or a couple of days. It doesn't make sense to implement these things in our
public cloud interface. Also, a cloud interface would be easier for us
to use and scale, although it's only for an internal audience.
It's mainly because if you have a nice UI, it would be much
easier to train or educate someone else to use
it. Rather than having like a script or a very custom thing,
they would have to figure out how to run. If it's a UI, they can
just click some buttons and it just works. So should you build
your own cloud interface? Well, the first question
you have to ask is does these cloud youll have does not have an easy
way to do what you need, right? Like if that's the case maybe
do you often find yourself writing custom code to do things in your cloud?
Does other teams do the same things? If you find that six
of the teams in your doing very similar
thing manually or doing some scripting for achieving
the same result, that is something you need to consider.
Finally, do you struggle to keep the resources
in check? If you are using your cloud, you have a lot of resources being
spun up and spun down by individuals and you are either the
cluster owner or the cluster admin and you are struggling to make
sure that you are not overspending or you are not right sizing your
clusters and resources. If the answer to these
questions are yes, then maybe you need to build the
interface to your cloud. And most cloud providers have
very nice API that you can make use of. So you don't really
have to box yourself to say oh yeah, I am a user
of azure cloud and all of a sudden this one thing they don't
provide in their UI or their ClI and now
I can't really do that anymore. You don't want to box yourself that way.
If you find yourself needing to do something over and over again,
it might make sense for you to build an interface on top.
But infrastructure as code first, right? Like if you can
get a lot of this done by using infrastructure as code services
such as terraform, pulumi, ansible, chef or puppet,
or like doing some CI CD things like GitHub actions, Travis Jenkins
Harness, Argo Tecton. If you can get
your whole infrastructure as code, and for the most part,
unless you need something very custom, you can achieve
most of the things just by codifying your entire infrastructure
needs. So once you have done all of that, you still find yourself
needing to do some manual coding or running some scripts
to do things. Yeah, it might be worth like building a cloud yourself.
So rolling out a custom solution should be towards the
bottom of your list. What I mean by that is if you roll
out a custom solution like we have, that is a dependency
you will have to carry on going forward, right? So if
you have the key people writing this code lives,
your company, or things change in your cloud interface
or any number of different things that can happen, all of a sudden you have
this dependency that you have to carry on. So all code
is technical debt. So as little code as you can
own yourself, the less technical debt youll will have long term.
So why should you consider building a cloud. So sometimes reinventing
the wheel is the best way. Although we try to
keep our code and our work dry, we do not repeat
ourselves. But sometimes reinventing the wheel
lets us go in some way that we couldn't go. With all the wheels
that you have in the world, a small script across different teams
and orgs and needs become a big dependency. So if you have
each, let's say your company has two orgs with three teams
each, and each of them are maintained a different script to pretty much do the
same thing. At that point, it might be worth spending some engineering
hours building a tool that does solve the
problem in a more general way for everyone. Most teams
should not have to own cloud resources. So if
you're using public cloud of any kind, and each of the
team kind of are responsible for understanding how different
cloud resources work and how to do things in the cloud,
you are creating this dependency on this cloud which if
down the line you choose to move to a different cloud provider, all of a
sudden a lot of the team members wouldn't understand how to
translate those information between different clouds.
If you were to create a simple interface that only lets
people access the things that are approved in a size
that is approved, all of a sudden, when you move to a different cloud,
you just need to change that interface to talk to the different cloud and
your teams wouldn't have to worry about those kind of changes themselves.
Finally, the last reason the interface of your cloud of your choice
might not have all the answers that you need. Things like
how long has these resource been around? Who is the last person that used
this resource? What other projects are using this resource?
Is this resource approved? How big of this resource is approved?
And all of these different answers might not be available or
be able to add onto your cloud provider's interface.
And if that is something you're looking for, building a wrapper
or interface on top of your cloud might be worth looking into.
So consideration of if you're thinking about building a cloud,
what you should be looking at. So you can always do
more, right? No matter how much you do, you can always do more thing.
So don't do more than necessary. So only do
whatever gets the job done, then look back and see doing
more youll improve or it's just like you are doing more for doing
more sake. Cloud usually has a lot of API
and they can change. So you have to be ready to
update things. As I said, the moment you own, kind of like building
this new interface on top of your existing interface, all of a sudden
you built up some dependency and code debt,
technical debt that you have built up because as your cloud changed, now you have
to update your underlying interface with that solving
the general cloud problem, it's probably going to be more
than you can take on. Solving your and your adjacent team's problem
are enough. Oftentimes youll might be tempted once you get
down this path, to think okay, what if I basically
build a new type of cloud that uses another cloud underneath,
but make it very easy and make it very applicable for everyone.
It is a novel idea, but oftentimes it might be
way more than you need to do, as well as way more than it's
worth your time. Unless you are trying
to create a new product and a new cloud interface based
on some other cloud, that is probably not something you want to spend your
time doing again. Last option always is don't start
with creating the cloud. In our case, we tried a couple of different
things and found things that makes it
not solve our problem as good as rebuilding the cloud did
in our case. So that's when we started kind of like treating
the interface on top of our cloud to do things a little bit easier for
us. So if you were to
start out of the gate, you started with creating the
cloud, I think you will spend a lot more time doing that rather than just
starting with solving the problem any way you can using
ansible, terraform or any number of other things. And eventually
if you find yourself to be stuck on a loop that
you are just redoing the same things but with code now
at that point it might be worth creating some of
these things in code. So at this point we'll tasks
a quick look at some of the code that youll have
done here and that is going to be the end of this
session. First of all,
we have a react front end that sits in the cloud.
That is the UI you would see if you were to go to this URL.
Next, we needed to have some way of handling
user authentication. Luckily we're already building
on top of a cloud so we didn't really have to invent that wheel.
What we could do is just fall back on IBM
cloud for our authentication. So if we have a login with IBM ID
and as a user, if you have an IBM account, youll be able to log
in to the cluster, to the account and
you will be able to access the resources youll actually have access
to on your IBM cloud account. So a big part of
building a cloud is user management and we didn't really have
to do that because again we're already on top of a cloud.
So these UI we access here, we see all the clusters
and such. So the back end of
it is all go. And the way we went about building it,
it's kind of like a monorepo. All the different applications are being
built under the same project as of now.
The biggest part is the web, that is what is Cube admin.
It's basically a back end API rest API that
is built using Echo. And we have all the different endpoints that
we can access and the app just starts
on port 9000 and also serves the react front end from
the back end. So this part of the code we
had to for the most part just look at our
cloud docs. So cloud ibm.com docs
and for each of the endpoints we cared about we
just looked at our cloud docs,
I'm going to look at containers Kubernetes service and
now we can see how
we go about setting
up. I'm looking for API references.
Yeah Kubernetes API service so we have a swagger API
where we can see how to get access to all the clusters.
So this API endpoint gives us access to all the clusters and
our application would for the most part just
be wrapping that endpoint. For example this
one, this API just talks to our
second app function that tasks
to our API and gets us the clusters and we
just make a fetch request to the endpoint, basically talk
to our cloud the same way our cloud interface talks
to as well. And this is similar for most of the
other endpoints that we have.
For the most cloud providers that we have, they would
have decent documentation on how to talk to different endpoints.
If you don't have that available, that might be very
difficult to build a wrapper UI on top. Luckily IBM
cloud does have fairly good documentation for all the different
products we needed to access. So we could build that interface
on top to talk to AWX.
That is another dependency we have. We have this package
we are talking to AWX to and this
is also very similar to what we do for IBM
cloud. We can get list of all these workflow job templates,
we can get job templates and we also can launch a new workflow.
AWX is the
runner for different jobs that we have for
running our actual workload.com.
So we have these playbooks that we have written here and these playbooks
can do a number of different things on top of just creating Kubernetes cluster.
So you could do things like install istio or install knative.
The moment a cluster has been created and install tecton or any
number of different application. And if you want to define a new application, that is
also possible by just writing ansible playbook
and we can run that after the cluster has
been created. So as I said
that we have a number of different steps
we would like to finish to get to our ideal solutions. That's what
we are working towards. And again, this project is open source.
I don't know how useful it would be for your use case if you don't
are already using IBM cloud and need to have very similar functionality,
but if you're trying to look for place where
this kind of work is done, this code base could be useful for your
need that way. With that, end this session
and if you have any more questions, feel free to ask. In any of
the social media platforms, Twitter is probably one of the easiest way
to reach me if you'd want to come back and ask more questions about some
of the decisions we made, decisions we took, some of the more challenges we had
while doing this, more than happy to chat and answer those questions.