Chidori - AI/ML Cluster Management Platform
Video size:
Abstract
Kubernetes cluster management platform’s role in speeding up development, scaling AI infrastructure, and lowering computing costs will be discussed.
Summary
-
Ahmed Gaber and Nadine Khaled talk about Chidori AI and ML cluster management. They talk about how spark operates in Kubernetes and did five summit challenges involve it. And also we prepare a good demo for you.
-
Nadine Khaled from Incorta Cloud team demonstrates how to install Shizuri on your environment and put it into action. Through Chidori you can submit your spark jobs through any spark master. You can also see the status of these jobs and share them with others.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. One welcome to comfort two cloud native today
we are happy to talk about Chidori AI and ML
cluster management developed by Incorta. Let me introduce
myself first. I am Ahmed Gaber but you can call me Biga.
I am a cloud engineering manager at Incorta and
also I have today with me Nadine Khaled, cloud engineer at Incorta.
In our agenda today we will talk about how spark operates in Kubernetes
and did five summit challenges involve it and explain
how Chidori can address these challenges. And also
we prepare a good demo for you.
Let's dive into how spark work on Kubernetes.
As you see in this diagram, the client will start to submit the
spark job into Kubernetes to run as a spark driver in
Kubernetes we have two modes. First one is a client mode
which means the driver will still running in
the client side. The second mode is
cluster mode which means the driver will be running
as a BoD inside Kubernetes. Once the driver
started, it requests from Kubernetes to start the
executor bots. So the Kubernetes scheduler will
start to allocate the executor costs inside Kubernetes.
After this exactor costs created, the driver
will be get notified to start to schedule jobs in these exactors.
So as you see here, Spark will get benefit from
the Kubernetes scalability features like the
cluster horizontal scaling for the node itself.
So if you have costs to be allocated
inside the nodes and there isn't enough equity, cluster will be
started to scale up to add new nodes to
be more flexible with your job. And also you
have like a resource management to ensure that
the driver and executor running within your capacity.
We figure out some of good insights on that
model. First one, as I said, the cluster auto scaling cluster
auto scaling give you the flexibility to get
the better performance with low cost. So you don't have to have
like a node to be up and running almost the
time. Also we find that enable the
dynamic allocation give you a flexibility inside
the application itself to scale in and out. The executor itself
also uses spot nodes with Spark workload
will save a cost a lot.
This will give you a flexibility to get
higher performance with low cost.
Also we notice most of spark bottlenecks
come from the shuffling issues.
So to optimize your spark job
you have to attach the spark job into fast
local SSD based on your cloud vendor
to optimize the spark scratch space. In summary running
Spark on Kubernetes not only optimize resource utilization
and reduce cost, but also enhance the overall
performance for your spark application.
While running Spark on Kubernetes bring us a lot of
benefits like better resource use and the
cost saving. It is not without its challenges.
Let's dive into some of these challenges you might face.
Firstly the Nikode boot which
mean like when
spin spark a job as a cluster mode inside Kubernetes.
As I said before, the driver will start
running as a separate BoD inside Kubernetes.
This BoD is not controlled by any of replica controller
or stateful set or deployment object in Kubernetes and
this creates some kind of availability issue for
this BoD and it would be like act as a single point of failure.
So which means if this driver down for
any something your job will completely fail and
kubernetes will not have any controller to spin this driver
again. The second challenge is around driver
bods distribution across nodes and its implications
for cost. Driver bods are allocated across
node based on resource request and node offenses.
However, as job conclude you may observe
scattering of some bods across the nodes due
to the constraint imposed by the naked BoD issue.
This distribution can prevent the scaling down of
a node to optimize running cost.
Another challenge is around startup time overhead.
This time will come from two factors. First one
is if a new node
is required from a driver to be allocated or executor,
so there is a latency to wait this node
to become available. Also another factor is the
startup time of the BoD itself for the driver. If you're
using heavily bison libraries to start
your job. So you will wait to install these libraries
and configure some configuration
for this job before the bot
become available. So this will impact also the time to
job to be executing once it's submitted.
Another challenge is the Kubernetes scatter itself.
To understand this issue we
must understand how Kubernetes allocate bot to the
node. This scaddler is called Kubescadular is
watch that from the abi server
in Kubernetes once the master get
requested to create a bot so the scheduler will
start looking to available node to be hosted this node
based on the resources constraint defined by the
bot definition itself and also node affinities.
Once the scador find feasible nodes
to costs this bot, it will have like a scoring to
find the best match for this node.
Also, if the scheduler didn't found any feasible
node to costs this BoD, the BoD will
remain unscadbled until the schedule find
best match node for free. So what is missing?
Kubernetes will build for running microservice
with scale out architecture in mind. The default Kubernetes schedule
is not ideal for AI and ML workload lacking
critical high performance scheduling component like
batch scheduling permission and multiple queues for
efficiency. In addition,
Kubernetes is missing gang scheduling for scaling
up burial processing AI workload to multiple
distributed nodes.
Also most of AI and the ML jobs
require array of libraries and the framework including
wheels, eggs, jars and framework.
This diversity require a robust tracking system to ensure everything
within our container image is up to date and function as
expected. Moreover, the size of container images
become critical consideration as we add
more component the images grow larger which again
slow down the deployment time and
impact the efficiency and also adding to that managing compatibility
and upgrade of these versions.
So another challenge to run any spark or ML job
inside Kubernetes is it's related to concept called
twelve G awareness. Mainly scheduler will
allocate the pod based on the resource request
and pod affinity which is defined by the BoD
itself. However, the node state itself is managed
by kubernetes. It's another agent running inside each
node in your cluster to know the state of the
node itself. So for example if this node have
a disk pressure or some kind of throttling
in some resource. So you need to have like awareness
before you allocate this job into node.
Also you must utilize the node affinities and boot affinities
together to get the best match of allocating bods to
nodes with Kubernetes scheduler. Another challenge
is related to integration and monitoring in
Spark. In Kubernetes to monitor job you have to use Spark
Ui master or using Spark history server.
This tool is mainly concerned about the job.
It's job focused and it's concerned only about the tasks
or in the stages and some kind
of resource monitoring of infrastructure. But it's missing
the correlation between the cluster behavior,
the Kubernetes behavior with this job. So you will find
some difficulty to troubleshooting some issues you may
face. Also the integration with third party tools
the current way to submit any job as we see in the first slide to
use the Spark submit command which is a CLI command in
Spark. So it's not friendly to be integrated
with other tools. So after
addressing all of these challenges we
starting to build our beloved solution is chidori.
So we started build chidori with a mindset to solve all
issues that I listed in the previous slides by solving the
naked boot availability issue, provide more stable framework could
run the spark or email jobs inside Kubernetes and
also provide well integrated rest API with third
parties and provide more clear monitoring to
the create different factors to
have a good troubleshooting for resource jobs inside kubernetes.
So in this diagram I will explain the high level
design of chidori. So let us start with chidori server.
So as I said in first slide
of the issues that we have naked
bot availability issues when we run our driver inside Kubernetes.
So we build Shuduri with a concept to
be like a hosting for spark drivers inside
Kubernetes. So Shaduri will costs the spark driver and Shaduri
itself is a Kubernetes deployment so it's
totally managed by Kubernetes to guarantee the high
availability and disability also. So we build API
server that provide multiple APIs to be deal with spark
in kubernetes like create job, delete job listing jobs and
get logs. So this is BI will be integrated
with the spark submit client and also integrated
with any third parties that can be integrated with
Spark on Kubernetes server. So once
we receive a job inside Shaduri it will be queue and
we build the queuing because we want shudderi
to be controlled how much driver can be run at a time.
So the admin can configure the maximum
number of jobs and maximum allowed
memory and CPU to be consumed at a time.
So once the job received in the bias server it
will be stored in our queuing system.
We provide interface for multiple queuing system
like rabbit, meq, cloudbubsub and
Azure. So once the job queued
if there enough capacity to job to be run,
the scheduler will fetch this job from the queue and start
go routine function to run this job and
the core engine will start tracking this job
to manage all the lifecycle of job and all
of this metadata stored in our backend store for
monitoring purpose and auditing. Also we
build the story to be like interface with many of Spark
vendor provider like Incorta, Kubernetes and databricks.
So you can use chidori to submit jobs to incorta
or your own cluster in Kubernetes or
your cluster in databricks. Also we build like
a monitoring to monitor the jobs running jobs and get
full monitoring capabilities to create different factors
while you troubleshooting your jobs.
Also we have a connect layer that
provides spark connect interface with other
parties in the client side. Also we will provide our
Spark summit chidori version that easily integrated
with our Chidori server. So you can
consider Chidori is a full AI and
ML cluster manager in Kubernetes that provide
a full integration with Kubernetes and also
different tools in ML ecosystem
like MLflow and Kieflow. So we can focus on your
business logic by developing your model, training,
deployment and serving. And Shduri will take
care about infrastructure management. So now
is the demo part hi everyone, as mentioned by
Bega, this is Nadine Khaled from Incorta Cloud team and today I'm going to demonstrate
with you how to install Shizuri on your environment and put it into
action. So as you can
see here, we provide hand charts for easy installation into your namespace.
So once chidori is installed you can verify if all
the infrastructure components are created.
And by infrastructure components here I mean spark
server deployment which is in our case it's chidori.
And we also have rabbit MQ sit for set which is responsible for
queuing the jobs, and Shidoicore deployment
which is responsible for monitoring the jobs that you had ran before.
And also, as you can see here, we have created all the necessary services
that are responsible for making the deployments communicate with each
other.
Also, Chidori simplifies the management of Python
packages, so you can install the python packets that you want
to install for your job execution and
removing the hassle of installing it manually. As you can see
here, I have installed Python package Tensorflow and all these packages
are already pre installed in Chidori so you don't need to install them
again. Also through Chidori
you can submit your spark jobs through any spark master,
whether it's kubernetes or data Brext or Databrock or Azure
HD insight. Also, Chidori offers the flexibility
in specifying the driver memory that you want. You can choose the size
for the driver that you want.
So let's go back to Chidori setup.
Once you make sure that all the pods are up and running, you can start
submitting your spark jobs through Spark submit and you
can open Chidori monitoring to
see the status of these jobs.
As you can see here, all the jobs I have created before,
I can filter by status, whether it's failed,
whether it's succeeded. I can also filter by
the date the job was created and I can filter
by the schema name and the table name of the job that
was created. Also you can preview
the history of the job that you have created with the same schema
name and table name as you can see here, these are the jobs
that was created with the same schema name and table name and these are the
status and all the information about them.
You can also view this history in
a short view. So you are
going to see how
long these jobs took in order to be
loaded or created through shadowy.
And also you can do some actions on the jobs.
You can download Spark
driver locks and also you can open this job in Spark history
server. This will redirect you to the Spark history
server. Also Chidori provides
a shareable link feature.
So when you want to share with someone the history of the
job or any details or any information about the job,
you can just give him this signed URL.
So you will remove from them the hassle of logging to shadowy
monitoring with credentials. So as you can see here,
you just copy the URL, the shareable link and the
person that you gave him this URL. He will open this
link and he will be able to view the history
of these jobs and
he can display it as I said before in the chart view
and he can display details about each job
was created. So that was Chidori
and I hope you enjoyed the demo. So thank you
to attending our session and have a good day.