Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello guys, my name is Aman Parauliya.
I'm working as a senior software engineer at Infracloud Technologies.
So I will be talking about tinkable which is
an open source project developed by Packet team
and I am a core contributor, one of the core contributor
in this project. So I will be talking about it in
this webinar. So Tinkerbell
is actually about bare metal provisioning. So in this cloud native world
we are moving towards cloud, right?
So there are a lot of cloud providers today in the market like
Google Cloud, AWS cloud or something else.
But still there are few use cases, few things in
which we would like to
have a server besides me
or like I need to
provision a bare metal server in which I will be storing my data
instead of using the AWS or any other cloud
native thing. So I will be talking about
how the thinkable helps you when it's about
bare metal provision. So this
will be our agenda for today. We will be talking about
bare metal concepts. First we'll
talk about what makes a server, what is a network,
booting use cases especially and the challenges that we face.
And then we will move to tinkable as soon as possible.
So let's start with what is bare
metal? So when we heard
the term bare metal we think about two
things. This can be like two
things. So I have provided two pictures.
The left picture is a kind of rack of servers
which is used in the data centers,
right? And the right one
is nothing but a raspberry PI board, a small raspberry
PI board which is actually being used in many of the IoT devices
and IoT use cases. So these are two
bare metal I can think of two types of bare metal I
can think of.
So let's talk about what makes a bare metal
a server. So the first and foremost thing,
it should have a network support. Basically it should have
network interfaces which should support IPMI
and also have NIC card, like NIC card which can
be of today's ten gbps or something
like that. The secondary thing a bare metal
should have is like storage so
that it can manage the stuff like
raid and all so that if a data
is lost we can be able to recover it
using the raid and other stuff, raid applications and everything.
The third thing that is required for bare metal
is boot environment. So boot environment is nothing
but like in which order we want to boot. For example,
if you want to boot from a device or from something or from,
if you want to boot from the actual device
or if you want to boot from the USB device or if you
want to boot from a network. So when you talk about
booting from the network, the ipxe comes into the picture.
There should have a support of iPxe if you want to do a
network boot, if you want to boot a
bare metal or a machine from the scratch
without having anything on it.
Okay, so let's talk about the network booting, which is the main
stuff. If you talk about the bare metal as
we talk about IoT, should support PXe or iPXe. So PXE
is nothing but pre executable environment.
So like before it boots up into BIOS or something else.
And ipxe is nothing but an extended version of
PxE which is open source and
DHCP actually provides you IP dynamically
and TFTp is there to provide the initial
file system. So when you
try to provision a machine it should have
a minimal os or a minimal platform in
which it can run and all those things and
the provisioning can happen. And for that
it required something like, something like a ram which is
like in memory files, in it ram like in memory file
systems and basic VM which
have VM, Linux like basic file
which have a very basic os,
something like alpine or something. The third thing is NFS.
So the NFS is like a network file system.
So for example if you don't have the storage in
your hardware, for example in Raspberry PI
board which we are talking about here. So raspberry PI board
actually consists the hardware using the
micro SD cards. So suppose if we lost a micro SD
card or there's a corruption of micro SD cards and
if you want to provision that again, raspberry wire system,
then what you can do is you can mount a directory
in your raspberry wire into the network itself,
into the network card itself, which will be nothing
but will be mounted on the provisioning engine or
a different machine through the network,
obviously. Yeah. So this is the important part
here. So we are talking about bare metal, but why we are
talking about bare metal, why in this cloud world we
are still talking about bare metal things. So the first
thing, existing infrastructure services. So there
are a lot of companies which has their own data center today
and they want to use it
for their own purpose and they want to provide that thing to other
companies as well. So the existing infrastructure still
want to use a bare metal things instead of the cloud
things. The second important thing is
data security. So if you talk about
a very classified data,
like the data for a particular bank
which account holder is
having how much of money in his account, something like
that. So these are very classified data, which required quite
a lot of security. So that's why all the
banks on all those kind of companies, all those kind
of domain where data security is a major concern,
they still want to use bare metal,
they still want to use their own data center instead
of putting the data on the clouds.
The third thing is latency. So when it
is about cloud, whichever it is,
you will get a limited latency.
After the limit cross, you cannot decrease the dead
particle latency. Instead of that,
if you use a bare metal, you will get almost zero latency.
Like last my latency or whatever, it will be almost
zero latency. And the
fourth thing is about consistent and predictable performance. So this is
a very important term when you talk about the storage things and when you talk
about using any kind of storage or
servers. So you
want consistent performance,
a very good performance, which should be consistent. And if
there is any problem, you should be able to predict it,
right? So in the cloud thing, cloud world,
you don't even know about where
your data is even being put in,
right? So in that case you
cannot consist, the performance is not consistent.
There can be ups and downs in your performance without even knowing
with the same workload that you have, for example.
And if you increase your workload or if you decrease
your workload, how much performance it's going to impact, you cannot
predict it. But if you have your
own deer metal machine in which you know that I
have this much of ram or just this much of things, so if
I want to increase the workload, I should increase
my ram or network device or like NIC
card or whatever. So you can get the consistent and
predictable performance in case of bare metal servers or bare metal
storage. So the challenges,
the thing is you get a lot of control
when you have your bare metal within
your premise, right?
And that's how IoT is, increase of control comes with
the increase of complexity, right? So for
example, if you have a data center, you want to create a
data center for a bank or something like that,
and which should have like hundreds of servers.
So it is difficult to manage and provision
or manage the large scale infrastructures in
case of bare metric. And you also have to sometimes
deal with different type of cpus like
Intel X 86,
Arm 64. And there are other cpu as
well, comes with IBM servers as well.
And also you need to deal with different distros.
For example, for any use case
you want to provision half of your servers with
Centos and half of your servers with Ubuntu.
Then the complexity will increase
again in this case. So these are the challenges
which bare metal face when it comes to
bare metal, a user will face. Right. So here comes
the tinkable as a complete solution,
I would say. Yeah.
So tinkable is actually
nothing but a project which help you to
provisioning a machine automatically. It's an automated
bare metal provisioning engine, as the title says.
So it consists of five microservices.
First is Tink, another one
is boot, Hegel, Aussie and TBNJ. So as
mentioned there in the diagram, tink is
responsible for provisioning engine work as a workflow engine.
So tink is actually a kind of,
you can say an interface with
the help of which you deal with all
of your thinkable components. Right.
And all the services. Boots is responsible for DHCP
and iPXC things that we talk about,
DHCP or TFTB services.
And Hegel is there to provide you metadata
of the particular hardware in case of requirement.
And as we talked about earlier in this slide,
that we need some initial
file system or initial files to put into a network.
So here comes the Aussie.
In the Aussie, Aussie is actually not a kind of
service. I would say it's more or less a combination
of files which is required
when it comes to network boot. And PVNJ
is a process which is like just
to control your power off and power on of your system
and boot control services. For example, if you want to
boot from iPXe or if you want to boot from
somewhere else, like a disk, inbuilt disk
or something like USB or something like
that.
So this is the diagram in
which you can easily say that
for a workflow, to run a workflow or basically to
provision a bare metal machine, you need already
provisioned or control plane kind of control plane,
which is like provisioning engine. So on the provisioning
engine, the component of tinkable
will be running. For example, on the
provisioner there is tink server. So tink
is a complete workflow engine. So it has three parts.
One is tink CLI which is being used
by the administrator or the user to interact with the tink
server.
On the provisioner there is a tink server which is
running which is actual workflow engine
and responsible for executing those with
maintaining the workflows and the data.
And on the provisioning boots, Hegel and
Aussie will also be present so that they can handle the
DHCP TFTP remitter data service
and the installation environment.
And this is in the
lower part of the diagram there are three worker machine,
so it may be one, two or three.
We'll think about any fund. So there's one machine which
want to provision itself.
So in this diagram
I will be providing you a kind of workflow,
a kind of basically flowchart or flow deck or
the flow in which the provisioning,
provisioning engine happen if you use the tinkable.
So what you need to do is first of all you need to
up your provisioner, right? So when you start
your provisioner there are a lot of service which will
be running on the provisioner. One of them is like
tink server is Boots Hegel.
As I mentioned, Aussie is not a service, it's just a combination of
file. But it should be present on the provisioning engine. So that can
be servers once the provisionning engine,
yeah. So this is a provisioner and also there
are two important things which should also
be there in the tink server. One is the database in
which your hardware data template data workflow
data will be stored and the private
docker registry in which I will be hosting images
of the actions of the workflow. Like what are the actions?
For example, your boot process required a lot of
actions like first you want to remove
or you want to do a disk wipe operation. So for that you
will build an image for disk wipe and then
you will push that image in the private docker registry.
We'll talk about that in the later slide.
This is a kind of flow. So what it does is
first user need to put all the hardware
details in the tink server using
the tink CLI in a JSON format. So what
are the hardware, hardware details like hardware Id
and the Mac and the IP it wants and
an ip which it required
or it has. The second thing is like
template which is the most important part of a
workflow because it defines a workflow. So user can
define its own template.
So whatever he wants to use, whatever user want to
provision. For example if a user want to provisioning into CenTOs
or Ubuntu or if you user want to do something
else so he can define its own template.
So we'll talk about template later.
Again the details. And third thing is workflow.
So workflow is nothing but a combination of template
and the hardware. So for example if you have multiple
templates stored in your tink server and multiple
hardware, multiple bare metal hardware stored in
your tink server. So what you can do is you can select
any of the template, any of the
hardware to create a workflow. So if
I will just would like to give you
a quick demo out of it.
How does it work? So for example, if you want to like,
so there are tink hardware if
it.
I have deployed a vagrant virtual
box, a virtual machine with the help of vagrant which
is working as a control plane. So if you will see.
So here are the things, if you can see it out here.
So there are, as I mentioned,
that the tink, ClI,
tink server registry, eagle boots
and the DB and the Nginx also running out
of it. So Nginx is actually used to servers
the TFT files to the worker
machine or the bare metal machine,
right. So if I will go into
the ClI, if I'll do the docker exit into
the CLI and I will
try to do a hardware which I have listed
on. So these are the two
hardwares which I have pushed
in the hardware database.
So if you want to check any of the details of the hardware,
for example, you can provide a Mac address of
particular hardware here using the Mac command.
And then you can provide details
flag. And then you will get the complete metadata
which is like this in the JSON format.
So basically this is the
minimum hardware data which you
require to store for a particular hardware.
So in which comes the metadata part, the network
part, at least one interface it should have, which should
be of a type DHCP and it should have a Mac address hostname
architecture and then in IP it should have an IP
Iot wants to get. And the netboot,
it's like allow PXE is true and allow workflow should also be
true to get the workflow running. And there's a last field
which is very important like id of the particular hardware which is like unique
across the hardware.
So this is the hardware data. So let's
talk about the template again.
So just like hardware we also
have like tink template list. So if you want
to look at what this
template is like, it is like this.
Yeah. So this is a very basic template
which is required to run a workflow.
So the template should have a version
and the name of the template, the global timeout.
The global timeout is nothing, but is actually the time, the time
in which all the tasks in the particular template
should be completed. Otherwise the workflow will get timeout and
there will be a task. And in the task there
can be multiple actions and there can be multiple tasks as well.
In a particular template, in a particular
task there can be multiple actions. So here I
have just mentioned one action named as server partitioning.
But I'm just using the hello world image. So I will be just
saying the hello. The important
part is here is this one. So the worker
is actually device one. So what will happen is
so when I will be creating a workflow,
right? So let's see what is the workflow.
So I have already created these workflows which
executed successfully.
So what I will do is I will
be doing tink workflow,
I will be deleting two of them.
It's not even required. So what you can do,
you can see first of all
what it takes to create a workflow.
As you can see, the tink workflow create
command is actually required a template
uid. So I have a template
which have a uid this one. And then it required a hardware
string. So it is nothing but a targeted hardware.
So when I will be creating a new workflow t
I will be providing the target, the template
which I need to execute. And in the hyphen
r I will be providing the
targeted device and
the Mac address of a particular device.
As you can see that my workflow has created.
So here the important thing is this key
of this input should match
with the key that the template has.
So if you will see, it will replace the Mac address of
that in
place of device one in the template. So let's see what
is the workflow. Yeah, so this
is how my workflow is looking like. It is saying that
your global time order is 600.
As I mentioned, the template devices a workflow. Now my workflow
is also know that it should also run
on this worker which
has this Mac address. So when it receives the
id or the request from this Mac address
only then it will provide the workflow
that. Okay, start this workflow. And the
important part is the image of the hello world should be there
in your private docker registry which is running here
on your Pavisner machine.
So this is like,
yeah, so this is the private
docker registry which is running. So it
should have the hello world image and it should also
have that tink
worker image as well. So as you can see, my docker,
the private registries IP is this.
So it has the tink worker image and it also
has the hello world image in itself.
So now what I will do is I will start another vm
with the help of vagrant.
Okay, so we'll look into that later once
we go for the demo. So this is how your provisioner
looks like. Yeah, so let's
talk about this. So tinkable. Basically the
provisioning, as I talk about is a control plane. It should have boots it should
have heagle boots for DHCP and TFTP,
hegel for providing metadata of machine tink
server, tink CLI and PSQL
is nothing but basically a DPDB deployment of a DB which
is used to store the hardware data, the events
and basically the templates and
everything else. The private docker registry I talked about,
it should have the image of tink
worker. I think there's a spelling mistake there in the image section.
So Iot should have an image of tink worker and also all
the actions, the image of all the actions
which are there in the template should
be stored and should be stored in the registry.
Okay, so this is the hardware data that I had
already given you the reference.
So this is the minimal hardware data which
is required. This is the same as I shown
you that it required id metadata.
And the network network should have interface.
At least one of them should have a DHCP type architecture
of cpu and the IP, IP,
like what IP I would like to have and the Mac address.
So here it is. It's just like the sample.
And if you will see here, you will see the Mac
address is actually this and
is UEFI is not required now. Previously it was
required, but not now. We have removed that dependency
as. So let's
talk about template. As I've shown
you a very basic template in my example,
which is nothing but like this one,
right? So this is a very basic template, but your template
can be as complex as this one.
It can have task, so tasks,
there can be multiple tasks and each task should
have like name, the worker.
Worker is nothing but a vm or the machine,
or the bare metal machine you want to provision.
And the volumes, so all the volumes you will define
at the task level will be used for each and every
action. So we support that volume level things
at a task level. And if there is a particular
volume for a particular action, then it should also
be supported, as you can see in the
action and also the enrollment variables,
for example the mirror host. So mirror host is at
the task level. So mirror host will be applicable
to all the actions. So all the actions will
be using this mirror host envs if
required. And then
there is actions. The basic requirement of action
is like the name of the action, the image,
like the name of the image which it
will fetch from the private docker registry
that we talked about here.
Yeah, so it,
so action should have the name, the image and
action can also have its own timeout, for example,
if disk wipe is not completed within 90
seconds. So by default it is seconds. If disk
wipe is not completed within 90 seconds,
the action should timed out and the workflow should
return as an error.
Same for the next action, the disk partition.
So these are the basic things when you install
any os on your bare
metal machine. So first you do a disk wipe, then you do
a disk partition, and then you install
a root fs, which is provided by the Aussie, as I
talked about. So root fs means the very basic
Os in which you will run all those things. Then you will install
the actual graph files which you want
to do. For example, if you want to install the ubuntu,
you will install the graph
files for ubuntu.
Yeah. So the image
name should be very careful. For example, if you have
basically have two templates, one for Centaur
and one for installing the Ubuntu. So in
that case the first two action will
remain the same and the last
actions, last action install grub. The name
of the image should change. It should like install grub Ubuntu or install
grub sentos, something like that. And we also support volume levels,
volumes at particular action level. So in
that case the volume which
is there on the action will be mounted,
not for the remaining actions. So this is how
a workflow definition or a template
looks like.
Yeah, so this I have already shown.
So this is a way in which your CLI works.
Like you have tink hardware, tink template and tink workflow.
So these are the three main clis. So I'm going to show
you how it will. So basically in this
slide as well, you can see, you can exact basically
and enter into a tink CLI container.
And then you can run tink hardware
push and then you can provide a file in which
your hardware data is there, which is nothing but
a hardware
data in JsOn format. So I will be
showing you that again.
So if you can see tink hardware, actually support
these few
like 123-4567 commands. One is like delete
id is like get hardware by Id. IP is
like IP, get hardware by any associated ip
list means list all the known hardware. Hardware Mac
is like get hardware by any associated
Mac. Push is like new hardware
like to tink, which I've shown you in the
slide. So if you want to do it like this,
what I can do is I can create a
file which
can have the
data.
So for that, first we should make sure that
this id is not present in your hardware.
So what we will do is we will first of all list all the hardware.
So as you can see, this id is already present. So what
I will do, I will change that id for now because
every hardware will have their unique ip. So I'll make it as two
and also the Mac address as well. So if you will see
the particular Mac address is already
there in the list,
like this one. So you also need to change that as well.
Otherwise you will not be able to store that
data into your database.
So I have changed the Mac address and
the IP address. So what I'm going to do is,
so it says hardware data post successfully. So what
we can do, we can just list the hardware. So now it has
three hardware. Initially it was having two, now it has three hardware,
out of which two of can have same
ip. So one of them can run at a time. So it should
also have a unique ip as well. So what I will do is I will
simply override that thing and I
will again do the same thing. So if the id
remains the same, the data will be overwritten.
Now you can see the ip has changed for the hardware.
So when you will start with this Mac address,
then it will look for this ip.
So now Iot comes to creation of a template.
When you create a template, you need to have
a template name. So let me show you.
That's it.
So it has functionality of create, delete, get list and
update, right? So what you can do, you can do a tink
template create.
So I have a sample template here,
which is nothing but again a hello world,
but I need to have a unique name for it. So now
let me check what are the.
So I have a template name here
in which the template name is sample. So if
I create a new template, I must have
a different name, otherwise it will not be able to get basically
the template. So if you will try tink
template creates
nothing but a path of the file, of the template file.
So as you can see, it has created the template.
And now if you will list, Iot will have basically
two templates. One of is
named as sample and one is named as sample two.
And then the third one is the workflow. So I have already shown
you the workflow creation part. So what it can have is
like it can have workflow create hyphen t the template uid and
then it can have the device one as a key.
This key can be changed as per the templates. So if
your template, for example,
if your template here has worker,
if you change it to device one from like
worker one,
then while creation of a template,
you need to have here like worker one or something right.
So since we have not updated the template yet
in the database because it is simply changed here.
But if we will try to get the template which
we have just created. So we have just created this one right here.
So here it is still device one, so you can update it if you
want with the help of tink template update.
So now I will be taking care of which
workflow I've created.
So as you can see I have like
workflow which is dell 21,
22 yeah, so this is the new workflow which
I have created. So I can check where the
state or yeah,
state of that workflow.
So it says that this is the workflow but
the progress is 0% and the current action state
is pending. So what I will do is I
will see what is this workflow,
the Mac address basically. So this workflow is related
to this Mac address. So what I'm going to do is I'm going to create
another vm with the help of vagrant.
So this is my vagrant file. So what I
will do, I will simply change the Mac address from here to
zero eight to two here
as well,
right? So now I will be giving
you the demo. So as we know that we already have prepared
the setup. So for preparing this setup, what you
can do, you can go to the tinkerbell.org which
I have opened it for you. So you can go to the documentation part
and then you can go to the setup and
you can do a local setup with vagrant.
So this will provide you all the steps with
the help of which you can up your provisioning
machine and then you can start
your workflow, right?
And this is the source code of Tinkerbell.
So this is the project in which we have tink, Aussie,
Hegel boots and everything else.
So I'll just show you the quick demo. So what I will do,
I will just start
a worker, basically a new virtual machine with
the help of virtualbox as
I've shown you the vagrant file. In the vagrant
file the Mac address is I have changed the Mac address
to this one and
here as well for the worker part and
again here as well if you want to do any customization part.
So I have changed the Mac address to the one in which the workflow
is pointing to. So when I will be starting this vagrant
machine, when I will be starting this virtualbox
named as worker with the command, this vagrant
up worker, so what
it will do, it will first import a base os of
alpine and then it will start basically
in the ipxe mode as
you can see the worker has already started and now it
is going for installation. So Iot is asking
for an ip, it has gotten one four
as an ip here. Now Iot is asking for
few files like give me the Tftp
files like so
it has gone to files vm, linux and anytime
fs from the Aussie part and
now it has started the installation of OS WcW
base OS which aussie has provided which is nothing but an
alpine.
Just would like to show you something like the logs.
Once this worker boots up it will
start the worker now it will send the request to
tink server. Yeah so I
think there's something wrong with it valid
state something
wrong I have done, so let me do it's
so what we can do, we can simply close this one and
destroy this worker because it has got some error because
of the tink worker I guess.
So what I will do, I will destroy this
one's
yeah so what I will do, I will,
I will tag this latest image with the
registry image and then I will push that
particular image in
the registry itself. So I am pushing the tink
worker latest image in my local registry and
then I will again go to the exec
and then again
I can just say think workflow
get let's see what
are the state of this workflow. So it has not been yet started.
So what we can do, we can go
again destroy this current
worker which has caused the error and then we can start up
again. I hope this works fine.
Got the Ip waiting for the tftp
files,
getting the tb files from the nginx
and through Og from the OG through Nginx
I hope that workshop
here's
the IoT.
Yeah, so this works fine now. So as you can see the
workflow state which I have created is now
100% completed and the state is success.
The worker on which the workflow
was running has this id.
So let's take a look at this.
Yeah so if you take a look at this,
so I have this IId in my hardware which
has this Mac address which I provided as an input when
I created the workflow and this is the ip
it got from the boots through DHCP and
now it is success. You can also check the events of
the particular hardware for the particular actions
if you can see that. Let me minimize
that a bit.
Yeah, so if you can see that IoT is saying that this
action name is servers partitioning, start execution,
that action progress and has taken almost zero
time and finished execution successfully with the action status
success. So we'll go and verify
what happens on this worker. I will
simply log into the lose root and.
Okay, I'll see the worker has completed his task
and exited with zero means. A worker executed
very successfully. So I just go with the logs and
I will take this log in the log file so
that I can go through it.
Okay. Yeah. So this
is like this. So what we
are doing is we are retrying
interval is Iot set. So this is an en variables that you can set or
as per your convenient, right?
And so it got the workflow which
is completed in the completion state. And then it got something,
some other workflow which is not in the completion state. And then
it has started like so it
has got the action name as OS installation.
Sorry, task name as OS installation action
name as server partitioning level. Is this sender?
So what is doing? It is pulling the hello world image which
is there in the action of named as server provisioning.
And then Iot is pulling it from the private
registry, not from the docker hub.
Now it executes the hello world image and it
prints the output in the container itself. And then
it says that, okay, so each
action runs in a different separate container. So it contains
container id is like this,
which is container removed with the status action success.
And then it sent the action state of the
particular workflow to
the think worker. So this is actually
the state which is set by the worker that
I have completed this task successfully. So it says that action
status is success. And that's why here, if you can
see this, got this finished execute successfully with
the accent success. So this workflow
had only one task. If you will see this,
the task name was Os installation, which is there.
The action name was servers partitioning, which is there.
The image is for hello World which is the output of which
is there in this file. You can
see that hello from Docker. This message shows that something
like that, right? Yeah. So this is all
about the demo. So let's go to the sites
back again.
These are the two links, Tinker YG and GitHub.com.