Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm really happy to be here. Really happy
to talk to you all and let's enjoy this session,
this conference, and let's start.
So in this session I would like to talk with you
about the next topic, how to build low cost
CSD solution on top of AWS.
Okay, this talk will describe my
own story and my own experience.
And you can take it from here to wherever
you want. You will see what I achieved here and
what were my issues through my journey.
So before we start,
I would like to introduce myself. So, my name is Valera
Bronson. As I said before, I am head of DevOps at Memphis
dev Memphisdev. It's an open source real time
data processing platform. Okay, in short, what we
do, we are building a full ecosystem for in upstreaming
use cases. You can read more in our website.
I am 31 years old, married with two wonderful kids,
more than ten years in it world.
Okay. Through my career I study as a Linux
administrator, storage administrator consultant,
senior consultant, solution architect,
and where I am today, ahead of DevOps at Memphisdev.
Okay, so enough about myself,
let's start with the story. Okay, this is my Jenkins
story and I will try to explain
it and tell it in the best way. Hope it will be
clear. So how did it start? Okay,
we are startup and when we started
we want to do it quick and dirty.
Okay, that's the exact explanation. Quick.
We need to run fast to do AWS much
as possible in a small amount of time.
So in some point we looked
for an automation tool, some CI CD tool,
and our specific request.
With our specific request we understood that Jenkins can be
the best and quick and dirty solution. Okay,
so we want Jenkins,
we install it, and we're starting to use it without
any deep dive, without any deep dive configurations and
so on. Just as it is, let's use it.
And in the first time it works good.
Okay. But through our
journey we understood
that the initial setup is not working so
well, the build taken so much more time,
the disk getting full too quick.
So we started to upgrade our
Jenkins instance. And then we came
to what I call Jenkins ninja, a really,
really powerful instance with a lot of
cpu, a lot of ram that can do a lot of workloads,
heavy workloads in the same time. But through
the time we get to
the new issues, to the new obstacles that we need to solve
and what we found,
what was bothering me as a DevOps engineer in
the company, and what I wanted to achieve
in the future. So let's start. The first
issue is of course single point of failure,
you can understand it by yourself. Okay? I have only
one machine, only one instance, and how much it
can be powerful as far as
I want, but still, it's one. And when
something unexpected happened during the build
of one of the pipelines, it will affect all the rest.
And it's okay when it happened in day to day work,
but it's less okay when it happens,
when you are in the middle of the release and you are pushing your comes
to the production, to the customers. Okay? So we
don't want this scenario to happen.
We need to think how
to be more,
how to, sorry, excuse me. We don't want this
single point of failure happen. That's it. That's the point. Okay?
And when we have only one Jenkins instance,
we cannot run build in a parallel
way. No parallelism, okay?
And it's obvious,
in case we have some build that will take all our resources,
there will be no resources for other builds. And you
can take it to every scenario. You want
mongodb that catch the port, and another build
that use another mongodb instance cannot run
on the same node because the port is already in
use, and so on.
And as you can understand, the first two
points already lead us to the third one. We need monitoring,
and we need to monitor everything, and we need to be
with the eyes on the product
every day, every single moment, because with this
Jenkins instance is production
for us. For me, as a DevOps engineer in the company,
I need to have it up and running all the time,
and I don't want to do this kind of job on
my daily basis. I want to be sure that my Jenkins is running
all the time, no matter what, okay? And of course,
we are growing, and the bill is growing with us.
Okay? So now when we understand
what are the problems, let's see how
we can solve them and what we want to achieve. So,
first of all, I want to be high available. I want to
be sorry. Ha. All the time. All the time, no matter
what. In case one of the pipelines is killing
my node, I want to kill this node and run the pipeline on
another node without any regrets,
without any thought about what. Maybe I have
something on this particular node that I need for my next build. I don't care.
I want to destroy it and create another one. This is a
chain in my perspective, okay? In this specific scenario, I want
to run parallel, I want faster build. I want to reduce
time to market. I want to run AWS, much aws parallel
as I can, as my setup can
afford. If I need to run four builds,
then do the same. I want to run four builds in
the same time. I don't want to wait until each
one of them is end and then run the second one, the third one,
the fourth one. Okay? I don't want it. I want in parallel another
goal that I wanted to achieve.
Eventually we achieve this. I want
to run a dedicated compute node per pipeline
type per pipeline logic. Okay, I'll explain.
Imagine yourself, I have one pipeline that perform
all the workloads, that are massive workloads
that have a lot of throughput, a lot of cpu usage and drum and
so on. But on the other hand, on a daily basis
I have some cron jobs that push data to databases or get
some backup from my GitHub repositories.
I don't need those kind of highperformance
machine from the first pipeline to
be involved in the second pipeline. The second pipeline can be run on
some free tier, maybe t two micro t three medium
or something like that. Okay? So I want some kind of
logic that will see which pipeline I
run, and after that trigger the proper
instance into that pipeline and you will see
how we achieve this. And of course the build.
Okay, so let's talk a little more
about the build. You can make all the calculation I did
here by yourself. You have AWS
calculator pricing calculator, and you can use it
and see the numbers by yourself. But let's take my scenario,
scenario number one, before the optimization, my instant type.
I eventually came to 32 vcpus and
64 ram.
And the next step was already 64 vcpu.
I needed this machine because my
pipelines took all the
cpus and all the run from the server on each run,
in each run, in each build. Okay, so this
machine was on demand. And I will explain why the
Jenkins, the Jenkins ninja run on
it's monthly usage. Okay, and aws, you can see this
is my estimated cost and that time before
I jump to the next level, to the next instance, type the
664 vcpus. I didn't want it.
So this why we started to find to looking for another solution.
And you can ask why not
onspot instances and okay, these are
the numbers for the onspot instances. But you know the merfilow,
when you need it, it will happen. And I mean when you are in the
middle of the release and something happened,
your machine will be destroyed. Because in
some reason this spot instance is there for
someone else. Okay, it will happen in the middle of your release.
Believe me, this is how Merfield low works. And I
was there okay. You don't want to be there.
So this is my scenario, scenario number one, before the
optimization. And this is the numbers. Okay, let's go.
So now when we have some background and we
understand what are the issues and what was the goals
that we want to achieve, let's talk about the solution
itself. So before we dig into solution in the architecture,
I want you to see, to show you some
diagram. Okay, let's say it like this, how I
change the Jenkins ninja into the Jenkins ninjas.
Okay? So now I have only one instance,
the Jenkins ninja that will coordinate
all the others agent, all the other ninjas.
The master Jenkins will say them what
to do and which pipeline to run in each and
every minute. Okay? So let's see how
to get there, how it works. Imagine yourself,
you are starting your day. You are logging
into your Jenkins Ui. Choose the pipeline you need
to run and click on build now. Okay. In a
regular scenario, build now will trigger the pipeline and it will start
running on the same instance. Okay, my scenario is working
like this. Build now will trigger the relevant
fleet group. Okay, fleet free group.
I will show you later in the short demo how it looks and
what I mean to, but in two words.
I have a fleet for every
pipeline group that I want to
divide between them, okay? And you will see right now how it
works. The relevant fleet group will trigger
the easy to flip plugin. Okay. And the easy to flip plugin knows
how to connect to aws and how to
run the auto scaling group. There I
have number of auto scaling groups and each one of them
that was triggered will run the relevant
launch template. In this template we can configure
this instance type, network consideration,
security groups and so on. But something to mention
and it's important, you don't need to choose one specific
instance type. And this is the beauty in this solution. Okay, you can configure
in the auto scaling group, you can configure a group of instance
types that suits you, that can
perform the workload you need,
and the auto scaling group will choose them automatically in
case one of them is not available in that specific
time. I don't know since we are using spot instances and
it can happen that some kind of instance is not available in
the specific time you need it. So the auto scaling group
will choose another one and you will not feel it. Okay, your pipeline
will starting. So we choose the launch template. We choose the
instance type we are starting the user data scripts. They are part
of the launch template. The user data scripts are
as simple as that. Are the prerequisites for our build.
If I need during my build libraries for
node js or I need to install some specific version
of Java, I will do all of these prerequisites
in the user data script. So when the EC two node is coming
up, it's coming up, it already have
all the prerequisites. I want it to be there so
the pipeline can start immediately. Okay, so after we
finish the data script, all the prerequisites,
we'll raise up a flag at the status
is okay. And our Ec two instances,
sorry, the Jenkins agents are running, they are up
and running. When our master,
our coordinator see that this flag is based
up, it can start the pipeline. So we have some
kind of another path to get
to the pipeline to be started. But you can
see all the additional value that you get from
this process. So you
will ask, okay, now we are using auto
scaling groups and launch templates and ac two instances
instead of only one Jenkins. So what are the numbers?
Where is the build? Okay, so what we have now, now we
have scenario number two after the
optimization. In this particular example,
I will use for you the same instance type.
Okay. Aws I used before 32 vcpus 64
ram. But this time
it's spot instance that launched with the SG
auto scaling group. But the interesting part of it,
on a day to day basis I have zero instances
up. Okay, it's important if before
this optimization we had one Jenkins fat
Jenkins huge machine with a lot of power running
all the time. Twenty four seven for all month, for all year. Let's say,
let's go there. Now we have zero instances
up day to day, and you will see how it reflects
into the numbers. I'm taking here some
assumptions for the calculations, but they are from the
real world. You can understand. Let's say I
have four peaks in a month. Okay, I have a
release or comes build every week,
and I have some massive workloads in this time.
Let's say every pick like this will
use all of the instances in
this auto scaling group. For this
example, I choose five spot instances. Okay,
let's make this assumption for a second.
But from my real world I use only two or three maybe,
and it's not for 4 hours.
But I'm taking you to the limit over here.
So each instance will have 4 hours
of intensive workload during the peak.
Okay, so you can see this is the
estimated cost. You can multiply it by the number of instances,
but it's much lower from
the previous one. From the $1,000 for one Jenkins machine.
And yes, that's a lot. That's a lot.
Imagine yourself as a growing company when every
month you multiply your
workload on the pipelines, on the build,
the $1,000 that we started in the beginning today,
after six months, after a year,
it can be multiplied by two or three
or five or ten. And you will understand how
this number is so big and so important
to us. Okay. The one who
is still listening will ask, okay, it's unfair.
You are talking about the spot machines,
but all this time, you still have the Jenkins instance
up and running. And yes, you are right. But now my
Jenkins coordinator is a different instance type.
It's not free, but it costs me much
less. Okay. It's a t free medium. And honestly,
I can take the t three micro if I want one cpu
and two ram, because on a daily basis, this machine,
the only thing it's doing is
only run the plugins and be a coordinator to
point to the right fleet and
redirect the pipeline to the right agent in this
fleet. That's all. And the estimated cost
is, of course, is $45 in a month.
Once again, there are theoretical numbers,
but I can say from my own experience, there are
the numbers I saw before the
optimization. And after the optimization, our bill
reduced significantly. So after
we saw all of this theoretical, let's say,
information, let's go to the Jenkins itself
and you will see how it works. So, this is my Jenkins,
and for this session, I prepared two pipelines.
One pipeline name is big ec two.
The second one is small ec two. They do the same. Okay. They take some
GitHub repository and back it up.
But one of the pipelines will use.
And you can see over here, sorry. One of the pipelines,
the small one will use the Jenkins small footprint
Sg. And the second one will use. The big
one will use the Jenkins fleet Sg. Okay. The names are
not so aligned, but it's important
to understand the Jenkins fleet SG.
The first one is for my big workloads,
for my massive pipelines that will run all the build,
all the e two e tests, and create images,
destroy images, and so on and so on and so on. These builds
will use the auto
scaling group that have the launch template with the
huge instance type. Okay, as I saw before, 32 cpus,
it can be 64 cpus. Whatever I want. The second
one is for the cron jobs, for the
backup processing. I don't want to trigger
these massive instances for those small kind
of jobs. I want to use something t two micro or t two free
medium. The small instances, one cpu, two cpu,
that's enough for me. I don't care if this particular job
will take two minutes or three minutes.
It's okay for me, but I do care how
much money, what is the difference
in the cost of those two instances? And you have the difference,
you can realize it. So now we have the
big one, I assume. I want
to. Okay, I want to show you how it works. That the big ac
two pipeline will run on this particular
node and the small will trigger a new machine,
a new node in my auto scaling group. So let's
run them and see. As you can understand, the big
one will take much more time. That's why it's already here.
But the small one, I believe we can see it in
minute or two how it starts. So I will run both
of them, I repeat myself, they do the same.
Okay, but one of them will trigger the small group
as a fleet and the second one will trigger the big one.
So while we're waiting for it to
start, I want to show you how it's configured. And it's configured.
It's a really simple process. Okay,
you need to install the plugin. This is to fleet plugin we showed
before. And then you go to the manage comes
configure clouds and from here you can see
your Amazon configuration. Okay. In this setup
I use AWs, so you see the Amazon ec two fleet. You can
check in your cloud provider how to create those fleets
and the configuration is really simple. The name of the fleet
in this section you will see the credentials
and after that every basic information
you need to enter it. What region I want to
run into, what's the name of this auto scaling group? And one
specific section I want to show to you is if
you remember from the diagram I had this okay flag that
everything is okay and we can start run. And this is how
implemented. It's a prefix start agent command that I run
before the Jenkins starting start the pipeline and
it repeats itself every 5 seconds
if I remember right. And check if this flag is
raised up. It's just simple as that.
Okay, this one, the first fleet,
and this is one, the second one, the small one, the same credentials,
the same configuration, but the auto scaling group is
different. Okay, fine. Now I want to show you
the auto scaling group in AWS
here I've already filtered two
sg that I use here, the small one and
the big one, and you can see that the
configuration is different. The small one I
need maximum of two instances,
but in the big one I want to go up to the five instances
on daily basis, the desired capacity and the min capacity are
zero. Remember it, zero.
That's the catch in this story. Okay,
so now we can see that. Okay,
already one instance is up and
our build I assume is starting
to run right now. And yes,
the big one is already finished and the small one started right now.
And when will it will finish?
The instance that was triggered will destroy itself
in two or three minutes. Just like that I have
another cron jobs and they will run the same way.
They will trigger the auto scaling group. The instance,
easy to instance will get up,
process all the logic and go down.
That's it. Okay guys, I hope you enjoyed it
and thank you very much for attending my session.
If you have any questions or you need some additional information,
feel free to contact me in my email.
Valera at Memphis dev feel free to contact me on any
social network you are using and
enjoy the conference. Thank you very much.