Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good day everyone. The topic for today would be machine
learning and machine learning engineering in these cloud with
Amazon Sagemaker.
I am Joshua Arvin Lat. People call me Arbs.
I am the chief technology officer of Nuworks Interactive Labs.
I'm also an AWS machine Learning hero and I'm the
author of Machine Learning with Amazon Sagemaker cookbook book.
So feel free to check this out. So here we
have about 80 recipes to help data scientists and
developers and machine learning practitioners perform ML
experiments and deployments. So you will see that with just
a couple of lines of code, you will be able to perform a lot of
things with Amazon Sagemaker. So let's start with machine
learning. No machine learning talk is complete without introducing
this quickly. So what is machine planning? So machine learning
is about creating something
which helps you perform an intelligent
decision without having to be explicitly
programmed to do it. So one example of this would be,
let's say we have a picture of a cat. So with your
machine learning model, your machine learning model would then decide
if it's a cat or not a cat. So even
without human intervention, the machine learning model should be able
to know if it's a cat or not a cat. And it
can make use of a lot of training data.
CTO help it prepare and generalize a model
which can be used to identify new
images and process new images if they're cats or not
cats. So this is a very simplified example,
but you would definitely get a better understanding
once you have more examples on what machine learning
can do for us next.
When doing machine learning, we will
definitely start with very simple examples in our local machine.
But once we start to work with teams, once we
start to work with more complex requirements,
it becomes essential that we start using
machine learning frameworks and platforms to make our lives
easier. So why is this important? So let's say that
we were to build everything from scratch.
There's a chance that the other person in your team would
have no idea what you just built, unless of course,
you document it properly.
You share this ways of
working with your code through with documents
and sample source code. But the problem there is
that you will be building everything from scratch, and that will
take time. And the advantage of using machine learning frameworks
would be that these machine learning frameworks and platforms
are already complete in a sense that
they already have a lot of features and capabilities
built in already because a lot of people are using them. So of
course, as these people around
the world are using these tools, the tools are being updated,
even if you yourself haven't
encountered this yet. So once you were to encounter
these specific requirements, then you would probably just
need to use that machine learning frameworks or platforms existing
capabilities, which would save you time. Of course there will
be cases where you will build something from scratch, but try to
make sure that it's practical and it makes sense.
So this is one good example of practical
applications of machine learning, and also
possible pragmatic and practical solutions
using existing tools or services or capabilities
of existing platforms. So if we look at these left
side, we can see here that, yeah, there's anomaly detection,
product recommendation forecasting, image and video
analysis, document classification and language translation.
Just a few of what we can do with machine learning.
On the right side, we have the possible solutions.
So how can we solve an anomaly detection
requirements with just a few lines of code? Yeah,
we can make use of sagemaker random cut forest algorithm,
which is already optimized for the cloud. So it
has made use of existing random
cut forest algorithm, and then the AWS team
optimized it to make it work with Sagemaker and
the cloud resources for product recommendation
we can make use of Amazon personalize, another service in
AWS, which is built to solve this type of problem
for forecasting requirements. We can make use of Sagemaker
deeper algorithm. So it's similar to random cut forest
where we just make use of an existing container
image that the AWS team has provided for
us, so that all we need to do is make use
of that container and perform planning
and deployment to solve forecasting requirements.
And the same goes for the other items in this list.
So of course, you won't need one to two
teams of learning the nitty gritty details of
how these things work. These advantage here is that even
if you are a newbie,
you will be able to get something to work within four CTO 8
hours. And that's pretty cool.
So instead of spending six months to one year just
trying to get everything to work, because you built something from scratch,
you can have something which is already working.
You can present a proof of concept work, CTO,
your boss, or to your clients. And then once you
have approved a certain budget, then that's the time you can deep dive
and let's say configure the hyperparameters, prepare a complete
machine learning engineering system and workflows and so on.
So the advantage here is that you can build something fast and also you
can configure this into something that's production ready.
So what can sage maker do for us? And what is Sagemaker anyway?
Sagemaker is the machine learning platform
of AWS, which helps you work with
more complex and custom requirements.
AWS has a lot of machine learning services, but what
makes Sagemaker amazing is that it has a lot of capabilities that
help you migrate your machine learning
requirements and workflows and code to
the cloud with very minimal changes
in your existing scripts. And what it
offers and provides would be a certain level of abstraction
when dealing with cloud resources. If you were to
prepare and run simple experiments
in your local machine, you may not need very
large and very powerful instances
or computers or servers. However, once you
need to deal with production requirements and once you are
going to work with really large files and really large models,
you will start to realize how hard it is to get this
working in the cloud because of course your local machine wouldn't
be enough to get these requirements running.
So here what sagemaker can do for us, which is just one
of the cool things with Sagemaker, is that
with just a single line of code change, you will
be able to configure these infrastructure strength
needed to run a certain part of the ML workflow.
So for example, if you look at these screen in data
preparation and cleaning, if I need two
instances of a certain instance type,
all I need to do is change one line of code and
then that's going to work right away. And the advantage
here also is that the instances,
these automatically get deleted after the data preparation
and cleaning step has completed, meaning you'll
save money because it's not running at all, and you won't pay for
anything which is not running in AWS,
let's say in model training and hyperparameter tuning.
You can see here that, okay, that training and hyperparameter
tuning step will take time. So there,
all I need to do is specify six instances
of a certain type. And if I need to have a
really strong instance type there, then yeah, I can just configure it there.
And when I need to deploy something, and I'm aware
that I'm going to pay for every r that
that instance is running, of course I would choose a small instance type
because of course the instance needed for
deployment may not necessarily be the same
instance type needed for training and will need less resources
during deployment. So there we can specify one, and with
just a single line of code change, we'll be able to get
this working right away, which is pretty cool. So again,
the infrastructure abstraction component of sagemaker
already solves a lot of problems for us, because that
directly maps to the cost of owning this entire
thing. So of course, enough of the concepts let's
take a look at a bit of code and how does this work?
So you can see these source code in the repository here.
So in GitHub you have Amazon Sagemaker cookbook. So feel
free to check that out so that you can see all the other code
snippets. So you will be surprised
that all it takes is a couple of lines of code to
get something working with Sagemaker, of course you will need to prepare your
data, you will need to perform model evaluation.
But if we were to perform training, it would be very
similar to some of the existing libraries fit function.
So what happens here? So first we
initialize the estimator over here, and then we
set the hyperparameters so we can see these, that we're
dealing with a machine learning algorithm that
deals with time series analysis requirements.
So we have here concepts,
length, time prediction length, and so on.
Because we're trying to make use of the deep AR forecasting
algorithm of Sagemaker, we specify
the data channels on the right hand side. As you can
see here, data channels equals train and test dictionary.
And then with one line of code we can perform the training step
with fit function and we pass the data,
these data channels as the argument.
Next, if we need to deploy it,
all we need to do is a single line of code which is deploy.
And you can see here that it's magic.
So here we run the deploy function.
We just specify the instance type and the instance count,
and there you go. All we need to do is wait for probably three
CTO five minutes and then that production
level endpoint is already working. So we won't have
to worry about the DevOps side of things.
We won't have to worry about the engineering side of things because that's
already handled by Sagemaker. So we don't have
to worry about that. And if we need to delete that endpoint,
it takes one line of code as well.
So what's the best practice when dealing with this
type of approach? So we can optimize cost by
using transient ML instances for training models. And this
is automatically being done by Sagemaker.
So during training and even processes,
we can select the type of instance
or server that's going to run
these processing script or scripts.
So in the first example at the top, we can see here
that we have a large instance.
At these bottom we have a two x large instance.
So these, of course the two x large instance is
more expensive than the large instance, but you won't
probably feel that cost much, especially if that instance
runs for only two minutes because of course,
if you were already using AWS for quite some time,
you may notice that, okay, if an instance is running for
24 hours per day, times seven days, times four weeks,
then of course the cost will add up and you will significantly
fill that cost when you check the bill. But if you
are running the training instance in just two minutes,
then it's not that pricey. And increasing
the size of the instance is preferred here because it will
significantly decrease the amount of time used for
training. And given that we're
dealing with transient ML instances, you won't need to
have a separate program or code just to delete
the instances. The instances will
be created and then will automatically be deleted
after the processing or training jobs have completed, which is
pretty cool. Before, you would have to program that.
Now all you need to do is run the fit function, and then
after the fit function has completed, then the instance would get
deleted automatically. So your next question would
be, so, do I need to create everything
from scratch again? Now that I found out about this new platform?
The answer would be no. Sagemaker has been designed
to help existing machine learning practitioners
migrate and work with their existing
code and set of scripts and work to
sagemaker with very minimal modifications.
And there are a lot of options and layers here.
Of course, if you're just getting started, you can make use of the
built in algorithms, as you can see on the left side, in the
middle, you can even bring your own container or container image.
The advantage here is that you
can compile and prepare
and build your own container image with
all the prerequisites there. And if you have something,
let's say an R package,
an R script, where your model
is going to be built using those existing
custom scripts, then yes, you can also port that to sagemaker by
bringing your own container. And on the right side, you can
even bring your own algorithm and make use of
these smooth integration with existing machine
learning frameworks like Tensorflow, Pytorch.
You can even make use of hugging face transformer
models there. So the advantage there is that in the
different things that you have worked on, there's a counterpart for
it in Sagemaker. And you'll realize that,
oh, I didn't expect it to be that smooth and that
flexible. So what's the best practice?
The best practice here would be to choose
what's best for you. You will be given a lot of options,
and given that sagemaker is flexible, all you need to
do is CTO, be aware of the features,
and what would be a good metric for that?
The metric for that would definitely be time,
because the less time it would take you to
build something or to prepare something, then that's
probably the right way to go. Of course,
you will have other things to worry about, let's say the evaluation
metrics, the cost and so on. But one of
the factors you need to take note of is time. If you can build
something in 3 hours, I would prefer that
over something which can be built in three months.
Because after three months the requirements may have changed,
your clients may have changed their mind, or maybe that
would be too expensive already. Because if you were to think about cost,
it would involve the cost of the infrastructure,
resources, the other overhead cost,
the cost of paying the employees, and so on.
So with less time, you'll definitely save a lot.
So make sure that you take that into account because time
will always be a multiplier.
That said, how can we save time? You can save
time by making use of existing features,
and being aware of these features is the first step.
So let's take a step back and see why do
we have so many features here? The reason why we have so many features
here is that there are a lot of different requirements other
than training and deploying your model. Of course, when you're starting
to learn about machine learning, you'll start off with training
your model, deploying your model, and then evaluating your
model. But in reality, there's a lot more things you
need to worry about once you need to work with teams, once you need to
work with different requirements, once you need to work with legal
and other concerns, you need
to worry about. So first, let's look at
the upper left side, sagemaker processing.
So sagemaker processing is there to help you process
your data with a custom script.
The advantage of using sagemaker processing is that if your
local machine or the machine that you're using
is not able to process a large amount of data,
you can make use of sagemaker processing using the same infrastructure
abstraction capabilities that you're using with training your model.
So if you have big data like data,
then you can use sagemaker processing and just use a large instance
to get the task completed within two to three minutes or something.
With sagemaker experiment. So the one just beside
sagemaker processes here at the upper middle
corner. With sagemaker experiments, we can make use of
that to manage multiple experiments. Of course,
you will not be running just a single experiments, but with Sagemaker
experiments you can run a lot of experiments and
not worry about the details on how to connect the
different artifacts. It will be much easier for you to audit
experiments which have been performed in these past. So you can check
it out, especially when you need to get things
working in production and in work in general,
with automatic model tuning on the upper right hand corner with
just a couple of lines of code, which we will show later,
you can see here that we can get the
best version of a model using
automatic model using. So what happens here is that
we'll be able to test a lot of different hyperparameter
configurations and prepare and build different
models, and then we just compare the models and get the best one.
With automatic model tuning, all you need is probably two or
three additional lines of code in addition to what you
saw earlier. And these, you'll see that, oh,
that's magic. Again, with very minimal code changes,
you'll be able to have something which automatically
gets and prepares the best model for you. So we'll discuss
that later with a couple of examples with built in algorithms.
We can see here that we have about
17, I think 17 built in algorithms
which can be used to solve different machine learning requirements. So some of
these algorithms can be used. CTO deal with
numerical data, can also deal with text data,
and you can also deal with images and even time series
analysis stuff.
So you can already get started with built in algorithms so that
you won't have to use your custom containers and
algorithms, especially if you're still getting started. And most
of the time, these algorithms are not
just on par with what you probably will build,
but it's probably already optimized in
solving most of the use cases. There's also machine
learning and deep learning framework support. So the
great thing here is that if you're already using Tensorflow
or Pytorch or Mxnet in your projects,
then with very minimal adjustments, you can
already port that and use it with Sagemaker. With Sagemaker,
clarify the 6th one, you can use that
to detect pretraining
and post training bias. It can also be
used to enable ML explainability.
And we'll discuss that later in detail, and you'll
see that it can be used to help you manage
the other production requirements which you may encounter later
on when you have to deploy your model, especially the legal and ethical concerns
surrounding the type of problem that you're trying
to solve. Sagemaker debugger we'll actually
discuss this in detail in these next set of slides. But sagemaker
debugger can be used to debug your experiments
in near real time in cloud environments.
So later you'll realize that debugging experiments
locally and debugging experiments in the cloud are
quite different because of course when you're using and working with
different instances and servers during training and
there's an error somewhere, how do you debug that, especially if you're dealing
with a distributed setup?
Sagemaker feature store Sagemaker feature store
is used for feature
store requirements from the name itself. So you will have
these offline feature store, and you will have the online feature store.
And the offline feature store can be used to
deal with data which can be used for training, and then
the online feature store can be used to get data which can be used
for the prediction parts.
Sagemaker autopilot is there to help you with your automl
requirements. So with very minimal human
intervention, probably just the initial configuration part,
you can just pass in your planning data and
then run, and then after a few minutes you
will have a trained model. So that's pretty cool because you
can make use of AutomL and Sagemaker has proper
support for it. Sagemaker Studio
so Sagemaker Studio is there to help
us have an interface and basically
a studio which has a lot of features and capabilities
integrated already so that things
would be pretty smooth when you're dealing with experiments
and deployments when using Sagemaker. So they're continuously
upgrading this studio. CTO make it easy for
you to run your code and then there's an interface
for it so that it's very practical for you to work on
real life experiments. Sagemaker Groundsuit
is there to help you prepare your data. Sagemaker model
monitor from the name itself, it's there to help you monitor deployed
models, manage spot planning if
you're aware of what spot instances are. Those are used to
further reduce the cost when performing training.
So with managed spot training, you won't have to worry about the
nitty gritty details when you're using spot instances,
because all you need to do is update a couple of parameters
and then you'll be able to save on costs, especially when you're dealing with
large instances during training.
Sagemaker pipelines, second to the last, will be
able to create complex machine
learning workflows with just a couple of lines of code.
And then finally Sagemaker data Wrangler is
used to help you prepare your data using
an interface. So these are just a few of
the capabilities and features of Sagemaker. You might
be overwhelmed right now, but do not worry because we will choose
about four or five of them, and we will discuss this in
more detail over the next couple of minutes.
What's important here is that you should have
that mindset or way of thinking that
maybe the problem that you want to solve has
already been solved by an existing tool or framework.
And if you were to use Sagemaker,
probably one of the customers of AWS has already requested
for that already, and there's already a solution already prepared
for it. So before trying to build something on your own,
check if all you need to do is add one to two lines of code
in order to solve your problem. It's not about creating
the coolest solution out there, it's about solving your problem
in the shortest time possible with the smallest
amount of expense. Because if you will get the same
output, or even better, why not use something which already built
for you? So let's start first with sagemaker debugger.
So here you will start to see more code, and this will help you
understand how easy it is to use Sagemaker in general.
And actually some parts of the code here are
just snippets which are already used
in other snippets, as you see in the previous slide.
So here at the bottom, this is the same estimator initialization
code. And what's happening at the top here is that
we're just initializing these debugger objects
and properties there before passing it to the estimator object.
So there all it takes is probably three
additional lines of code, and sagemaker debugger is already
enabled. So what's happening here, what's happening here
is that every two steps we will save some
sort of snapshot data, and then
it will save that in Amazon S three,
and then we'll be able to debug that
and have more visibility on what's happening inside.
And we can specify here
that we need to have a rule that the loss should not
be decreasing, so the value
there should not be decreasing. So if that rule
is violated, then we'll be able to detect that during
the execution phase of the planning step.
So you just specify the configuration with sagemaker debugger,
initialize the estimator object with a debugger configuration
specified and enabled, and then you just run the experiment
normally so you won't have to worry about going
deep into these actual execution of the container inside.
Sagemaker and debugger will do its magic
for you. Pretty cool, right? Let's look at our
automatic model tuning with Sagemaker.
With model training and tuning, we can see here that all
we need is a bunch of hyperparameter
configuration ranges, and we will have
multiple planning instances
running at the same time. The advantage
these is that without much
change in your code, you'll be able CTO
improve your existing experiments and
make it run ten times or 100 times more
without having to worry about the details.
So if you were to look at this slide,
you'll see that the estimator initialization step is
just these same. The same way goes for the set hyperparameters
call function call. So if you look
at the lower left section during the initialization
of the hyperparameter ranges section, we specify
the continuous and integer parameter ranges
for minimum child weight, max step and eta,
and then we initialize the hyperparameter object
with those configuration, and then we just call the fit function.
So the cool thing here is that we just added three to four lines
of code, and then we call the fit function. And then there you go.
It's going to run for probably 15 to 20 minutes,
and then after 15 CTO, 20 minutes, depending on your
configuration, then you'll get the best model based
on the objective metric target.
So if the target is validation area under
the curve, then it will select the model
with the best value for it. The next one would
be ML explainability.
So of course there's a
way for us CTO know which features are
important without having to understand the
actual algorithm. There's a difference between interpretability
and explainability, but with explainability it
will allow us to know which
features actually contributed the most to an existing
output to an output. So if
you look at the screen here, we have feature one and feature
zero, the first two features contributing the most
to the actual output. And feature two and
feature three did not really contribute much to the
output, meaning that if we have new data,
there's no point changing the values for feature
two and feature three because they don't really contribute to the final outcome.
So if there are production columns and then there's a target column,
we're pretty sure that feature one and feature zero contributes
the most CTo the final outcome. So how do we prepare
something like this? We prepare something like
this and get this type of output using shaft
values. So shaft values help us
understand the output and the model better.
So how do you do that? With Sagemaker, we do that
by just configuring the
ML explainability job. So you
initialize the sagemaker, clarify processes,
you configure the
data config and the shap config objects and
these. After that you use the run explainability
function and wait for probably three to seven minutes
to get that completed, depending on the size of your data and
these type of instances that you're using. So after three
to seven minutes, you'll get something like this,
and then you'll be surprised. Okay, I didn't have
to learn much about shaft values, but with just using a couple of
lines of code, I got what I needed. And you
can use that to further improve your analysis of
your experiments. So next, let's now talk
about deployments.
The advantage of using Sagemaker
would be that it has great integration with
the other services and features of AWS.
Of course, you may have your own tech stack
for it, but you'll be surprised that Sagemaker probably has
some sort of integration, let's say with kubernetes,
or even with lambda and so on, or if
you're dealing with a new service,
let's say app runner or something. You'll be surprised
that you can deploy sagemaker models there, and even
in easy to instances. But let's start first with a couple of examples
and patterns which may be applicable to you already.
The first one would be deploying the model inside these lambda function,
so you will save a lot of cost there. But of course there are trade
offs and you won't be able to use the other sagemaker features with the lambda
function. But it's really good for simplified
model deployments. We can also create
a lambda function that triggers an existing sagemaker endpoint, so that you can
prepare and process your data first inside the lambda function,
and then trigger the sagemaker endpoint, and then process the data
again before returning it back to the user. So you can combine
lambda and API Gateway to help abstract the
request and response calls before passing it to the sagemaker
endpoint. The third one in the list is the API gateway
mapping templates, where you won't need a lambda
function at all to trigger a sagemaker endpoint.
The fourth one involves deploying the model
in Fargate, and you'll be able to use containers
there in Fargate. Here's the cool thing here.
If you were CTO, make the most out of Sagemaker.
There's a lot of features and capabilities there,
which just requires probably three to four lines of code,
and you'll be able to get something like this. So the first one would be
Sagemaker multimodal endpoint. Of course, it would be weird
to have a set up where you have one
endpoint for each model. You'll realize
that you can actually optimize this and have, let's say, three models
deployed in a single endpoint. And it not only will it
help you reduce cost, it also enable you to perform
other cool things. Let's say a b testing where you're
deploying let's say two models at the same time, and then
you're trying to check which model is performing better. And you can
also deploy a model inside a lambda function with the lambda functions
container support. So there are a lot of variations here.
And being aware of these variations is the first step.
And having the developers skills to customize the solution
is the second step, especially once you need to customize
things a bit based on your use case. Now let's
talk about workflows. So work automated workflows
are very important because you
don't want to run your experiments manually every single
time. Of course, at the start you
will be running these steps manually because
you'll be experiments if it will work or not. But once you need to,
let's say, retrain your model, it would be really tedious
to do that every month or every two weeks and
running the experiment again and again. What if there's some sort of
automated script or automated pipeline which
helps you perform these steps without you
having to do it manually? So for example,
after one month there's data uploaded in an s
three bucket or storage. You want your
automated workflow to run. And if these model
is, let's say, better than your existing model,
then you replace it. And yeah,
you can do it automatically with the different options available with
Sagemaker. So this is these first one. So this is a very
simplified example. Of course, we won't discuss the more complex examples
here, but these are the building blocks to help you prepare
those more complex examples. So here
this is using to help you prepare a linear workflow
where you have the training step,
the build step, and then the deploy step. With just a couple of lines of
code and using the sagemaker SDK
and the step functions, data science SDK,
we'll be able to make use of two services, the first one
being sagemaker and then the second one being the data science
SDK. And with very
minimal changes in your existing sagemaker code,
you'll be able to create a pipeline like this
one. And you can make use of the
features of step functions to help you debug
and keep track of the different steps being executed
during the execution phase.
The second option would be to use sagemaker
pipelines. So with Sagemaker pipelines,
you can do the same set of things as what you
can do with the sagemaker and data
science SDK combo. But here you
can make use of the dedicated sagemaker pipelines to help you prepare
your model. So this one came in much later,
after more people have requested for it. And you
can see here that, wow, you can have an
interface, a UI chart
or graph like this, and you will know what's happening
in each step. And let's say that you want to know
the details after each step has executed,
let's say the metrics during the train step. Then you
can just click on these train step box
and then you'll see the metrics and these other details
there. So this is the source code for
it. You'll see here that with just a couple of lines of code
added to your existing initial Sagemaker
SDK code, you'll be able to create the different steps.
So most likely probably two lines of code, two to three lines
of code for each block. So let's say that you have
the processing step and then you have the train step. Then probably
you'll need four additional
lines of code because of course, in addition to the original code
that you have, where you have configured the different step, let's say
estimator initialization, Sklearn processes
initialization step. These, you will make use of these
sagemaker pipeline's counterpart objects,
and you will link those and then just chain
those AWS you can see here, create a chain and
prepare the pipeline by combining all the other steps.
And in order to run this, all you need to do is run these execute
function. So there, that's pretty cool.
And you'll see that the more you
use a certain platform like
Sagemaker, you'll realize that, hey, there's a patterns.
If I need something like this, I won't have to worry about
changing the other parts of the code because it's probably just
a configuration code change away. So again,
I'm going to share this slide again so that you can have
a quick look at the different features and capabilities of Sagemaker.
But what I will tell you is that Sagemaker continues to
evolve, even today,
probably in one month there's a new release or
new capability or new upgrade CTO existing features,
and it's better for us to stay tuned there by,
let's say, checking the AWS blog. And yeah,
so again, it's not just powerful, it's also
evolving. And the great thing here is that the more
features and capabilities that you're aware of,
the more you can make use of Sagemaker and
further reduce cost, because the importance of
a good professional would
be his or her own ability
to optimize and solve things using the
knowledge and expertise using specific tools.
So again, thank you very much and hope you learned a lot
of things during my talk, and feel free to check out my book
machine learning with Amazon Sagemaker cookbook because that will help
you understand Sagemaker better with the 80
recipes there, which are super simplified
to help you understand something, even if it's your first time using sagemaker.
So there. Thank you again and have a great day ahead.