Transcript
This transcript was autogenerated. To make changes, submit a PR.
Learning. In this talk, we'll be talking about Qflow and
how you can get an endtoend machine learning platform with just a few clicks.
My name is Mophi, I'm a software engineer at IBM and a
contributor to the Qflow project. So if
you want to follow along and find the slides, the link to these slide is
this slide. So the link is tiny ccml
k eight s. So once again it's tiny tc
e two e ML khs.
So not too long ago, we can all remember
machine learning and AI was a novelty, right? Like companies were
spending a lot of time and spending a lot of money researching a lot of
these machine learning ideas.
And it was not necessarily a core part of your business.
Some companies were doing it well, some companies were doing it a little bit more
sporadically. IBM, where I work at, even ten
years ago, machine learning was a huge research topic and we
were spending hundreds of millions of dollars every year.
But it wasn't directly making any business impact. It was more of a
theoretical side of business where you have a lot of
data, you just don't know what to do with them, and you just try different
things out. You try to gather some information from the data.
Well, if you look at the whole
machine learning ecosystem, there is a lot of different things that needs to happen
for machine learning code to become useful.
But most of the effort is still spent on that small,
tiny block in the middle. Machine learning code.
Engineers love data. Scientists love their machine learning code.
We write a lot of code in Python R and like Tensorflow
and all of these things which like spending a lot of time in the middle
where to make your machine learning code valuable,
you actually need to spend a lot more time on these things around
it, right? You need to spend bit time on data verification,
analysis, you need to serve the model you create, you need to monitor the
model so that you know the thing it's doing, it's the right thing
to do. So now more and more,
it is a core part of every business. So just the machine learning code
is not good enough to be able to serve.
The underlying business is trying to improve by
using machine learning and AI. Data is everywhere. We generate more
data now more than ever. And understanding what that data
is and what that data means is more important than ever.
It's estimated in the next ten to 15 years,
our data ingestion and data creation will quadruple
or go to an exponential number. But if we
don't understand what data means, that data means to us, we are just like
wasting time and money by generating more data, we're not actually getting
anything valid out of it. So with that,
I think because machine learning is becoming more and more mainstream
and becoming more and more core part of our business,
comes the rise of mlops.
If you are not familiar with what MLOps is, well,
you can ask, what is MlOps? So MLOps is these ability to apply DevOps
principles to machine learning applications. And this is the definition
given by the MLOps Special
Interest group from CD foundation. So what MLOps is
trying to solve is that right now, machine learning,
even to this day in a lot of companies and a lot of organizations,
is a sporadic thing. It's done by a data scientist. It's almost treated
like research. It's done almost as in an educational capacity,
and we want to make that into a core part of your model.
If you're building software in 2021,
I'm hoping you have embraced some of the DevOps
practices. Where you have your code, your code is being continuously tested,
your code is being integration tested, then your code is going
through some sort of deployment, continuous delivery model,
where your code goes to your version control,
then gets tested, gets built, gets pushed to production,
gets tested again, and you can have rollback and all that features.
And MLOps is trying to bring those same principles into
the lifecycle of your machine learning models as well.
Well, with that introduction, I am MoFI. I'm a software engineer and developer
advocate at IBM, and I mostly do container things.
So if you haven't figured it out by now,
I am not from these world of machine learning. I'm not coming into this
from the perspective of a data scientist. I come from the world
of infrastructure and containers. Most recently, I have
been contributing to the Qflow upstream project into
the manifest and deployment special
interest group. And if you need to find me
in a social media later, I can be found at movie woes in any of
the social media above, mostly Twitter. So if you have any questions
after the conference, you can reach out to me. Feel free to reach out to
me at moviecodes in Twitter.
So the title of the talks about end to
end machine learning platform, so what does that even mean? And why do we care
about an end to end machine learning platform? So end to
end machine learning platform without having going deeper into
what it means. Let's just talk about what we want an
end to end machine learning platform to have. And when we say
end to end machine learning platform, we're talking about something that covers everything
from the start of the data to actually using that model into
something useful in our application in our business,
the very first step, or even like one of the earlier step,
is data ingestion, right? So you have data being produced either
by you collecting the data from your application or user just
sending your data, or by some other mean, you're just collecting some data.
And this is the part a lot of companies are still in trends.
There are some graphs and machine learning and data analysis
actually stops for a lot of people at that stage.
But if you want to do anything useful with the data, if you want to
build some intelligence around the data, you need to take that
data and clean it up, transform that data into something usable,
validate the data so that you know that this data makes
sense for your larger subset of the larger structure
of the whole data. Then you are just bidding the data for training.
You are building models, you're validating models and training that
model into like a distributed training to create this model that
you can use. This is where we see
a lot of these gaps in the theoretical structure of machine learning
versus the business use cases that we have.
Oftentimes data scientists are building these models,
building this training into their own machine,
and then testing out that this works. This is great.
But the true value of machine learning only can happen when
we are serving that model, like rolling that out into
our application using that intelligence, and then
continuously monitoring that model's performance and
logging that out and making that a whole loop. Right? We're not
just stopping at, oh, yeah, I have built a model.
Great, now our problems are solved. We actually now have to say,
we have built a model. We're serving this model. We are getting information
from the serving that the data we are getting makes sense,
the performance makes sense. We're continuously improving that model.
So an end to end machine learning platform ideally would
give us all these things, right? And some parts
it will probably do better than others. And you also want to make sure
that we have a way to continuously improve. So end to end, not just teams
that we have all the features and we're good to go. I think end to
end should give us a way to continuously improve in each of
those steps. So there are many commercial
end to end machine learning offering out there, and almost
all major cloud providers has one. Other than that,
you also have a bunch of third party companies just providing machine learning
platform as a service you can use.
For example, you have Google Cloud AI platform. I happen to
work for IBM. We have couple, we have IBM cloud tech for data. Watson Studio,
AWS happened to have Sagemaker and many other services,
and you have azure machine learning platform. And again,
as I said, there are many more third party companies providing this
as a service. The pros of
using end to end machine learning platforms from one of the cloud providers
would be it is fully managed, you don't have to do
all the bells and whistles are included with it. It works well with
other cloud services. So if you are already a customer of Azure,
using the Azure's machine learning service will work pretty
well with the other Azure services that you have like Azure pipelines
and Azure Kubernetes and all the other things will kind of
work together. And the same goes for any other cloud services.
And because you are already with a major cloud provider, your machine
learning platform itself is also cloud scale. You have an easier time
just scaling everything up. And you also have enterprise support,
right? You are working with a cloud provider and you are taking
an enterprise plan. You definitely have a lot more support
with it. So if something goes wrong, you have someone to talk to or
get support from. Some of the cons, I would say
of an end to end machine learning platform, that if you are
going with a cloud provider in this sense it's going to be expensive,
obviously, because it's a pretty hefty cost.
You have to consider if you want to go down that route,
you definitely have these chance of getting vendor locked in.
One of the reasons that would happen is you are not necessarily doing
machine learning, you're doing AWS machine
learning. You're not necessarily like
if you're whatever vendor you go with, you will have a version of okay,
we are doing this tool that Google thought
that's how it should work, then you build your infrastructure
around it and what at some point you find out that you
can't change it without having major rework if
you want to move out to someone else. So you have this problem of vendor
lock in, and because your environment
is managed by the cloud, it is not always super
clear how you could have a local environment that replicates exactly
how the cloud environment works. Oftentimes some of these components
are not even open source, so you have no idea exactly how they work
under the hood. So you have to kind of do everything
on the cloud or you can do something on the local machine. These you run
the risk of not having a matching environment between cloud and
local. Obviously many of those tools are not open
source in the vendors, so you don't really have a way
of kind of doing it on your own or having a
look at the source code or improve the code. If you see
anything problematic, you are kind of dependent on the cloud to continuously
make that improvement. And code and models are
usually, sometimes not. Most of the times are not portable because you're building them
in the cloud with their proprietary software. So it's going
to be difficult for you to take your code and model and do it somewhere
else. It's not true for every provider and not
every cloud, but there is a high chance in a managed environment
that could happen. So why not DyI, right, a lot of open
source tools are out there. Tensorflow is open source.
There are a bunch of other things, if you look at the whole
end to end things that you need many of the tools that are out
there that are open source. So definitely you could do something yourself
using the open source. Or it is also possible for you to write everything
from scratch. And it is not the first time industry in the industry
have done so. Companies like Uber,
Netflix, Airbnb, Lyft and many more have actually rolled
out their own solution internally. So I
think Uber has a software called Michelangelo
and that kind of covers very Uber specific problems and
in a very Uber specific way, it just solves those machine learning
problems by building can entire platform internally at
Uber. So if you are going down that
route, some of the pros you can look at is you have full control
over the platform, right? You are building the platform for your company and
your exact needs. So everything you will build would be custom made
for you because it's owned by you.
Usually you would not have any vendor lock in, right? You would decide exactly how
you can run it, even if you need to switch cloud providers.
I think because you own the software and the platform itself,
you will just need to pay for the infrastructure if you're not running in
your own data center and it's customized to your needs.
So at least initially you would feel like
this is the perfect solution because you makes that solution for
the problem at hand. But some of the cons are
it's expensive. So although bit might be less expensive because you
are not paying for the service itself, but the engineering
hour you would spend in building something ground up like this
would be pretty expensive over time to manage. And you are on
these hook if something goes wrong. This is a service you made.
You have to make sure it is up to date with the latest things.
And as things change, things progress. You will have to have
a continuous upkeep of engineering hours to make sure
this thing is like up to date. So as time passes,
a platform becomes harder and harder to manage because you've built a
platform to do some machine learning for your business, you end up managing the machine
learning platform and now you don't have time to maintain your business.
So there are some difficulty with DIY
as well. Now let's take a step back and think what
we would want from a perfect machine learning platform. End to
end machine learning platform. Right? So number one is bit should be built on scalable
infrastructure. It should be something that can scale to whatever we
need it to scale to. It should use existing tools data
scientists already use. So we don't want to make sure we create something
so new that our data scientists have to have a huge learning curve and learn
new things. Again, ideally, and at least in my
opinion, should be open source so we know the community itself is
improving and taking these project forward that we use, that we
depend on, supported by the industry. We want to make sure that
not only we're not the only person using it, that's good
for us to get long term support. Also it's good for us to get
long teams talent that knows these tools,
enterprise support options. So of course you should
have the option to DYI but DIY, but you
also want to make sure that even when you are ready to
go to the route of, I want to just pay some company
to deal with some of the management issues, you want to have that
as well. And finally, it should be portable.
Your machine learning models near code should be portable for you to take anywhere
you want to take it. Well, a lot of the things,
and again, I probably gave it away early on, is this QFlow?
Is Qflow the tool that covers all of these end to end machine learning
needs that we have. And I would like to think that is
bit right now we're going to talk about how QFlow
fits all of this criteria that I want to have in my
end to end machine learning platform. So it's an open source project that contains
a curated set of tools and frameworks for machine learning workflows on
kubernetes. And these Kubernetes is a keyword here because that
kind of gives us a bunch of the
features that we see here. So because it's running on.
So Qflow is scalable, composable, portable, is open source,
is industry supported and it's multitenant. So you can
have the same Qflow environment you can use for the entire
team, give them individual spaces and so you can run your
experiments in an environment that is going to
be very close to what your final destination
of that product or project is. So scalable,
it's built on top of kubernetes so scalability is built in,
you can get scalability of the pods. You can also get the scalability of
nodes. So depending on your installation
of kubernetes, if you are doing it yourself or using a managed kubernetes,
you can scale your Kubernetes to any number of pods
to some reasonable amount. And also because it's built
on top of kubernetes and Kubernetes has existing
pool of skilled
individuals that knows how the infrastructure works.
So either you already are using kubernetes for other
things in your company, or it is fairly easy to find folks with
skills on kubernetes. So you can also think of using
kubernetes as a means of scaling your teams that needs
to use this platform, right. You can easily find talent rather than if
you are building something very custom in house, you would have a
harder time finding people that just knows these system. You have to hire
someone, train these on the system. So you're spending a lot
of cycles on building skills up where in
this system you already have people that just know kubernetes
already. And also hopefully by virtue of Qflow being an
open source project, a lot of people would also know Qflow was a system
composed. So going back to this slide,
Qflow has ways to manage each
of those steps by using different tools
under the Qflow ecosystem.
So we're going to look at a few of these today.
But composable basically means you can use different
parts of the tools under Kubeflow to kind of create this
system that covers all of these things that we want in an
end to end machine learning platform, portable. So Kubeflow,
you can use Kubeflow from local to training environment,
or from training environment to cloud, or from cloud to cloud.
And your environment of Qflow underneath stays pretty much
the same. We like to think that when you're running experiments
on your local machine versus when you're running your training versus when you're running
your cloud deployment, it kind of looks the same, right? Like this is
what we want to happen. But usually what ends up happening is that
our experiment environment is much smaller in scope. We are
just like running maybe a Jupyter notebook or one Python
file. Then in our training environment we are running that,
but in a much beefier machine with GPU and other resources.
Finally we go to cloud. Now we are dealing with a
lot more things like we are dealing with IBM permissions, we are dealing with models,
we are dealing with canary deployment, rollback rollouts.
So our environments ends up looking a lot different when
you are just like doing it ourselves. And although we would like
to think like, okay, we have deployed it ourselves
and we have tested it, the model works,
but every single minor differences in each of our environment
would end up leading us to getting into some
outages, right? Because we probably haven't covered the
differences between our experiment stage versus our staging,
versus our staging and our cloud. And each of those differences
ends up turning into something bigger later on,
because machine learning is no longer a novelty. Machine learning is a core part of
our business, and machine learning is no longer just
research. Right? We have to use machine learning now, get true
insight into our business to be able to stay ahead of the curve.
So machine learning used to be something that you used almost
as like getting can advantage before.
Now machine learning is what you have to use to stay with the
curve, because everyone else is using it too. So it's
part of the core business. We quality control our software.
We make sure that our software is not regressing from version to
version. Then we need to actually quality control our machine learning artifacts
as well. We can't just build the model on our
local laptop and just deploy it into production just by copying
some files over.
That's not how we do software, and that's not how we can do machine learning
either. So with Qflow, you can have a local environment
of Qflow just in your laptop, or in a dev
Kubernetes cluster somewhere. You could have a training environment with GPU
also running Qflow. And finally, you can have the
deployment Qflow also in the cloud. And now you
have kind of limited the number of differences between your
experiment, training and cloud environments because all of them are underneath using
Qflow. So this makes your environment pretty identical
to each other, and thus making your environment portable.
These are some of the Kubeflow components we usually talk about.
So first of all, you have the platforms,
you have your clouds. It could be on Prem Kubernetes, or local
Kubernetes, or any of the cloud providers. On top of that,
you have Qflow application. There are a lot of names
here. We're not going to be deploying, talking about all of them today.
But some of the key things are here. We have
istio as the network layer. We have Argo or Tecton as pipelines.
So if you want to build Qflow pipelines, you use one of
those things for machine learning tools. You have Jupyter notebook,
MPi, Mxnet, Tensorflow, Pytorch, xgboost,
and all these other things. So another
view of the components. The very first thing we have on Kubeflow is a dashboard
that lets us look at all the things that are on our Qflow
environment right now. Next we have Jupyter notebooks and as
of the latest version of Qflow, we have also a bunch
of other server there as well like code server or RStudio.
We have some of the machine learning frameworks like Tensorflow, Xgboost,
Pytorch. For pipelines we have choice of Tecton or
Argo. For serving we have seldon or we
have KF serving. For machine learning metadata
we have MLMD. For feature store we can use fist
and for monitoring because it's running on top of istio,
we can make use of Prometheus and grafana dashboards.
Look at what's happening in a cluster as well as monitoring other
models as the traffic routing is happening. So if
you want to deploy Kubeflow today, you can head to
the manifest repository and you can use customize and
kubectl. That's the latest, most recent instruction on how
to install Kubeflow and you can use that to install
Kubeflow on your Kubernetes cluster or your local machine and mini cube.
So manifest repository is structured in a way now
where it's easy to kind of find what are the extra apps, what are the
components common things in Kubeflow and what are the contribution from
the community to Kubeflow. Back in about
two weeks ago, when one two release was the main release of
Kubeflow, the repository was a little bit more
clusters. We had a lot more things on the top level so it was difficult
to navigate. But with the new one, these release, we have improved
some of the Kubeflow deployment strategy by changing up
the order of the repository. So again, if you looked
at Kubeflow before, this is what it used to look like. On the left,
the structure. With the latest release we kind of went
through and changed many of the structure by restructuring the
things around. It still does the same exact thing, but it's just restructured to
do things a little bit more cleaner. So how can this
help? So goal is to improve accountability for maintaining components.
Manifests increase modularity. You can pick and choose
the tools you want to install pretty easily and we want to make sure the
deployment experience is smoother. So first time users,
you should have a much smoother experience. If you had tried
Kubeflow in can earlier time, you can also users Kubeflow operator.
So you have the operator that you can make use of and operator
Kubernetes operator is built so
that we have an easier time deploying Kubernetes resources.
Again, I'm going to just skip through the operator part for a second,
because you can use the Qflow operator to deploy
Qflow to Kubernetes or Openshift, and you have documentation here
about it here. But one thing I want to mention, Kubernetes Kubeflow
is an open source project and it's not all sunshine and rainbows.
Some of these difficulty with Qflow is that because it's open
source and it's rapidly growing, there are definitely growing pains that we see
because underlying components that are a bunch of them are also
open source and have their own release cycle. So you have some challenges these,
right, Qflow has many moving parts, each component has their own release
cycle and upgrade path. So as a Qflow,
if you're maintaining Qflow as a distribution
for your company, it is going to be something,
at least at this point. It's quite a big challenge right now.
And we are as the Kubeflow team trying very hard to make sure
that doesn't usually happen. And for the most part, as an individual
end user of Kubeflow, you don't really see a lot of these problems,
but as a maintaining, we see that quite a lot.
Where underlying component is updating, we have to update all of that to
make sure we're in the latest version and we're using the greatest and
the latest changes into these underlying changes. So each
of EE EE EE Ee ML platform kubernetes using
from Azure, from IBM, from AWS, from GCP,
each of them has small differences that kind of add up into the overarching
Qflow deployment. So if you are using Qflow
on miniq versus you want to move your Qflow deployment to AWS per
se, it might not be the exact one to one change as
we like it to happen, but for the most part it is still
very similar to how you would use Qflow
on your local machine versus on cloud the future.
So the Qflow under three, well, I'm saying will, here it
is already here. It already released about one week ago. And all the
distributions like IBM and AWS and GCP
right now are testing to validate that it works on the newest
release. So if you are looking at Qflow in about
a week's time, you should be able to go and users the
one three Qflow on your favorite cloud platform.
And because the manifest repo is being restructured, it's much easier
to navigate. Okay, so some of
the references, if you want to try out Qflow, please go to Qflow manifest and
if you want to join the community for the slack for other mailing lists
you should go to community and
you can learn more about the operator framework and how operator QFlow operator can
improve our quality of installing Qflow.
But before we finish, I want to quickly show you the
Qflow environment. So I have a Kubeflow deployment
on IBM cloud and I'm using app id as authentication mechanism.
By default, Qflow comes with Dex as an
authentication. So IBM authenticating against the Kubeflow environment.
And once I have authenticated I would be here into
my Kubeflow. This is a Kubeflow dashboard you would see for the first time you
come into and on the left side you can see some of the tools Kubeflow
has and I have some of the notebooks created and
the notebook servers as of Kubeflow under three. As we said,
we have the VS code code space here
as well as rstudio as well as node Jupiter lab here.
And this is namespaced by pari user. So right now
I am logged in as my email here. I can also log in
with a different email account with my Google.
So once I do this, once I log in, it will ask me to create
a new namespace. And once I do that, now I
am on the same cluster, IBM a different namespace now.
So if I go to notebooks, the other notebooks was for the other user and
they're not these. So on the same kubernetes cluster you could have multiple
of the team members working simultaneously together next to each other.
I'll actually go back to the other user because I had something
smosh to show on that other user. These I would also have
experiments that I can run. So I have run one experiment, it's a
simple pipeline and this does like a coin flip
and test condition based on something. You can define
your pipelines using a Python DSL and it will
run that pipeline was KFP tecton. You can use that
to build your model and then users KF serving to serve that model
and get the full pipeline of data ingestion, data validation
and these create the model, then serve the model and then use istio
to monitor the information as your traffic gets routed
to your model and traffic gets served by the model.
You also have CaDb for hyperparameter optimization.
If you want to do some of that, you have that option as
well. We also have Pytorch, Mxnet and
Xgboost installed in this. So in pipelines we can make use
of those platforms to build
our models in many ways. Thank you so much for joining me in
this session. If you have any more questions, would like to learn more,
you can go to qflow.org to learn more about qflow and get started with Qflow.
If you have any question to me, you can reach out to me at codes
at any of the social media. It's at codes.
So thank you once again to the conference organizers for giving me this opportunity.
With that, I thank you all. And until next time.