Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good day, everyone. The topic for today would be machine learning engineering
with Python. The title of my talk is machine
learning engineering done right, designing and building
complex intelligent systems and workflows with
Python. So first I will introduce myself. I am Joshua Arvin
Lat and I am the chief technology officer of Nuworks Interactive
Labs. I am also an AWS machine learning hero,
and I'm also one of the subject matter experts who
has helped contribute to the AWS certified machine learning specialty exam.
So if you were to take that exam, most likely one of the questions there
was probably from me. I'm also the author of a machine learning
and machine learning engineering book on AWS called Amazon Sagemaker
Cookbook. So Amazon Sagemaker is a machine learning service
and platform from AWS where you can perform experiments
and deployments. So you can use your favorite
machine learning and deep learning framework and still use that with
Sagemaker. And you're able to make the most out of
Sagemaker by using a lot of its capabilities to
help you get your machine learning experiments and deployments successful.
Today, we will talk about ten things. The first one would be
understanding the needs of the business and the customers when dealing with machine learning
and machine learning engineering requirements. The second one would be
about knowing when to write production level python code.
The third one would be enforcing practical Python coding guidelines
for your team. The fourth one would be using Python design patterns and
metaprogramming techniques. The fifth one would be utilizing continuous
integration and deployment pipelines. The 6th one would be on making
the most out of ML frameworks and ML platforms.
The 6th one would be working with automated ML Bias
detection and ML explainability capabilities.
The 8th one would be on reaping the benefits of cloud computing for
automated hyperparameter optimization jobs. So of course
we'll explain what HBO is when we talk about that slide.
And then number nine would be optimizing cost by using
transient ML instances for training models. So later
we'll talk about a quick example on how to fine tune
Bert models later when using Sagemaker. And the number
ten would be securing machine learning environments. So without further
ado, let's have a quick game. You can see a bunch of apples.
So in this game, if you guess it correctly, I may
give you a price. So basically, how would this game work?
So within the next ten to 15 seconds, the goal
here is for us to count the number of apples in this slide.
So again, within the next ten to 15 seconds, I want you guys to
count the number of apples in this slide. So, timer starts
now I'll have a quick countdown. Ten.
9876-5432 and
one. All right, so time's up.
So again, the goal is to count the number of apples. So if you have
answered, let's say, 18. Drumroll, please. That's incorrect.
Unfortunately, that's not the correct answer. So how about
2020 apples? Unfortunately,
that's also incorrect. So what's the correct answer here? The correct answer here
is that it's not possible to count the number of apples in this slide.
So that's sad news for all of us. So the question is, why?
So the first thing here, if you look at the screen, is that we
cannot see the apples underneath this first layer of apples.
And the same way goes when dealing with our day to day
jobs. Sometimes when we are dealing with technical requirements, when we're
using these awesome tools and frameworks to solve our jobs,
the problem there is that we become too focused on what we're doing,
and we tend to forget what the business and the customers need.
So the technique here is to first, let's listen,
let's understand what the context is, because maybe
we may be able to provide the best solution without
any coding work at all. And there may be times
when we can just use a specific AI or machine learning service
where with ten lines of python code, you would be able to solve the customers
problems. So being able to listen to the needs of the customers and
being able to listen to the needs of the business, that's the
number one priority that you have to think about as a
professional. You do not have to be a manager or a
boss to know about these things, because if you're working on something,
you need to make sure that your customers are winning and the business
is winning as well. All right, so the second topic would be
on knowing when to write production level Python code.
So, of course, for us who have been working and using Python
for the past couple of years, you are probably aware that
there's different ways CTO use Python. Let's say that you
are a data scientist and you want to explore the data and to
show a couple of charts showing the properties
and basically the relationships of the data
points in your data set, then that would fall under
machine learning experiment, and you may use tools like
Jupyter notebook to demonstrate and show
the output of your Python code. So there you may not need
engineering techniques to work on your Python
code managing. Even if you're not following certain set of rules,
then that's okay, because that's just for demonstration purposes.
But when you need CTO work on systems,
then definitely you have to follow the engineering techniques and
guidelines to get that to work. So, for example,
if you were to build a machine learning prediction endpoint using
flask and Python, then you would need to follow,
let's say peP eight or the other coding guidelines, as well as
applying the engineering techniques to make sure that
your website or your endpoint is always up and running
and it's going to return a response in less than
1 second, for example. So making sure that your python code is clean is
essential when you're working on engineering tasks.
The third one would be on enforcing practical Python coding guidelines
for your team. So now let's talk about how do you manage teams.
So this is very important for ML engineering managers or maybe
data science leaders, right? So let's say that you have a data science
team and you have a team there focused on building machine
learning, engineering platforms or endpoints, then this
is for you. So what has worked for me in the past is when we
were building a machine learning endpoint for a
product, I realized that it's going to be a bit tricky
when dealing with multiple developers and engineers.
Right? So the goal there, before you can actually perform code reviews,
is that it's better to set standards for the company. So if you're
a CTO, then this is going to be one of your roles, because having
rules allow people and professionals in your
team to have some sort of way to accomplish their work,
right? So if you have rules, if you have standards, then that's going to help
your people perform their jobs better. So one of the rules that definitely
has helped me in the past would be the 20 line rule. So here we
have something like the maximum number of lines inside
a python function or method. So let's say that you have
a function called load model. If the number
of lines in that function exceeds 20 lines, let's say
25 lines, then you have to divide that function into,
let's say three or four sub functions. This allows your code to be cleaner
and more organized. The second one would be following the pep
eight guidelines, or a similar set of guidelines when using Python.
So having something like that would definitely be helpful for your team,
regardless if you're trying to build a machine learning platform
or not. So if you're using Python, try to look at pep eight.
The third one would be avoidance of try catch blocks.
So why? The goal here is to be able to detect
errors as early as possible. The problem when using try
catch blocks, if you're not careful enough, is that sometimes if
you have a transformation and then you just wrap a transaction with try catch
blocks, sometimes the error disappears and you
may not have the ability to debug the problems when
you're dealing with production endpoints and environments. So here,
let's say that you have 10,000 transactions, and then
for some reason in your logs, you only have 9950
records there. So what happened to the 50 records? What went wrong?
So the goal there is to be able to have some
way to know what happened to those 50 transactions.
So if you don't have logs and if you suddenly use a try
catch block to prevent that endpoint from failing,
then you would have no way to debug what went wrong, and those records
and transactions may be lost. The fourth one would be writing testable
Python code. So when you're building systems,
it's very important for us to know that it is an iterative
process. So when you're writing function, you're not just writing one
big block of code. You want to write functions and methods
and classes that allow you to easily debug this code,
let's say with a console. So it's not just about having a web
application ready there, it's about having a console also
to easily debug how a function behaves. So even if you're not practicing
automated testing inside your company, then at least make your code
testable. So try to take a look at that.
The next one would be using Python design patterns and
metaprogramming techniques. So I won't discuss here
all the different Python design patterns and metaprogramming techniques,
but I would mention some of the recommended goals
and techniques that you can use in your company. So one example
of this would be to write your own convenience library that wraps and
abstracts certain operations. This is especially useful
when you're working with a larger team and when you're using a lot of
tools and sdks to perform your job. So let's say
that you have a senior engineer and then you have a junior engineer. Then you
can have your senior engineer work on this so that that senior engineer
can, let's say, prepare a convenience library that works something like,
let's say an orm. So something like SQL alchemy,
where some Python classes and objects would help you perform your job
better. And the junior developers or the mid level developers
would not need to care about the internal details or
the abstracted automation parts when working with your convenience
library. So you can make use of design patterns and
metaprogramming techniques to speed up the work and also
abstract the unnecessary details from your other developers
and engineers. Of course, perform this or do this when it
makes sense. So if you're going to spend three weeks to work
on this and your project is going to last for four weeks,
these, that may not be the best use of your time. But if
you have a super amazing engineer in your team who can work on
something like this for two days, and then you can make the most out of
those two days worth of work for the next three weeks, then that's
a good use of time. The next one would be on utilizing continuous
integration and deployment pipelines. Of course, at the start,
you will be working on these things manually in the sense where,
okay, I need to copy my model, put it inside
a container or something, and then deploy it inside AWS
lambda. So AWS Lambda is a service where you can write
python code and then deploy it in that function as
a service service. So there, the advantage
there is that you only pay for what you use in AWS Lambda.
So enough about AWS Lambda. Let's talk about this topic.
So when you're building something, it usually takes three or
four steps to come up with a deployment package. Of course
you want that deployment package to be final and tested and
working. And also when you're performing multiple deployments,
let's say, per week, and there are a lot of users already using
your system, we should find a way to make sure that the deployment
package is 100% stable. Or if we detect that
there's something wrong with that deployment package, then we should be able to roll back
and revert to a previous deployment package. So knowing about continuous
integration and deployment pipelines and all these other alternatives
similar to that one would help your team work on these types of requirements
better. So this is going to be super helpful, especially when your team
is growing and when you want to enforce standards. So what happens here
is that when one of your python engineers is working on some code,
that person pushes some code to a repo, and then the integration
pipeline activates it, performing some test,
and then maybe at some point there's going to be an approval
manual approval portion there where the engineering manager can just click
on yes, after reviewing the results, and then perform the deployment.
All right, so we're halfway through. We're almost done.
Number six would be making the most out of ML frameworks and ML
platforms. So these are actually three options, not just two.
So the third option here is using, let's say, existing AI
and ML services, where with five to ten lines of code,
you may come up with text to speech, or maybe extracting
text from images but for now, for the sake of complexity,
let's talk about two things. The first one would be building
everything from scratch, and then the second one would be
using frameworks and platforms at
the start, as developers and engineers, we always
have that tendency to build everything from scratch.
So when you are about CTO, learn an existing framework,
there's always a tendency to say, oh, that's going
to take me one week to two weeks to learn that framework. Let's say tensorflow
or Pytorch, or Mxnet. And that's probably true. Sometimes the
examples in the Internet may not work right away, or sometimes you just
have the tendency to enjoy coding. When you're trying to learn about
programming and machine learning and machine learning engineering, you can
try learning these things yourself. But when you have to work with a team,
and when you have to work in a company where the real
things happen, let's say people resigning or people being replaced,
and you trying to work on existing platforms
and engineering systems, then you have to know that
it's more practical in the long run to use machine planning frameworks
and ML platforms. Of course, it may not always
be the case, but being able to do both is the first step.
And then the second step there is knowing when to use what. Because if
you're going to build everything from scratch, these, of course, you have to make sure
that all the requirements and the potential hidden features
may not be supported in your custom code, and it might take you longer to
build that. So being familiar with one or two or three or
more ML frameworks and tools and platforms would definitely
not just help you, but help your company accomplish your goals in a much faster
way. If you were to use an ML platform, let's say, with Sagemaker,
then you can also make use of its existing capabilities
and features. Because for one thing, when you're running machine learning workflows and
workloads in the cloud, you will realize that some
of those experiments will require bigger machines,
and sometimes not just one, but two or three or more. So if you were
to build this yourself, it might take you two to three months to build something
that's super flexible and something that can easily evolve
to more complex use cases. But if you were
to use an ML platform, learning it, let's say, would be
two to three days, and then using it would take an additional
day. So that would be three days. All in all, instead of trying to build
everything from scratch, where you will build it for like
two to three months, only to realize, oh, there's no debugger,
there's no model monitor and all the other high tech features
that the platform or the framework has already provided. So here
let's say that we want to modify the number of computers
or servers or instances that we will use for training the
model and performing hyperparameter tuning jobs. Then here,
if you look at the screen, you can see that, oh, you just set the
parameter to six, and then you can just have six ML
instances there. And then if you want just one instance for model deployment
for that inference endpoint, then you can specify that as well with just a
few lines of code. The advantage these is that the infrastructure is abstracted
and you can just use Python and the objects and classes in
the SDK provided CTO access and manage the
resources. Here, in addition to using ML platforms and frameworks,
you will only also find a lot of documentation online when using these
tools and frameworks and trying to get this to work in different
types of environments. So if you were to build things using your
own custom code, sometimes these disadvantage there is that the errors
are also custom. So when you try to look for the solution
in, let's say, stack overflow, you may not find that right away
unless you are very experienced. Here, you can also make the
most out of, let's say, AWS lambda serverless.
So if you are just calling an endpoint four times
per day, then why have an instance for it? So with this
one, you can technically get these almost for free. Because if
you're just going to use AWS lambda for 4 seconds for
a day, and it's under the free tier, then it's much, much cheaper
than having an ML instance running there with your
endpoint, right? With your deployed model. So you can use AWS
lambda, let's say with scikit learn, you can use it with Tensorflow,
and you can maybe deploy Facebook profit inside a
lambda function, Facebook profit model inside a lambda function.
And you can combine that with, let's say API gateway, which is a service
that allows users to deploy an endpoint
and then having that endpoint trigger
the AWS lambda function that you have prepared. There are also a
lot of deployment solutions out there. Of course it's super important if
you know how to build this from scratch, but there are ways to speed
up solving these types of problems in just a couple of hours.
So let's say that if you were to build something for four weeks, maybe you
can do the same thing in two to three weeks, especially if your team is
already using that platform or these tools.
So the first one would be deploying a model in an easy to instance.
So that's one of the most customizable options out there. So if you
want to build everything from scratch, then yes, you can deploy that inside
an easy to instance or alternative using a different platform. The second one would
be deploying the model in a container in an easy to instance.
The third one would be using a built in algorithm for
training and then deploying that in a sagemaker endpoint with, let's say ten
lines of code. So this is very helpful when you're trying to
provide proof of concept work to your boss before approving
your machine learning project. The fourth one would be using custom
containers. So building your own docker container images and
then using that and deploying that in a sagemaker
endpoint with just a couple of lines of code. The advantage there
is that you can make the most out of an existing platform's
features, let's say model monitor, to help you detect model
drift. So what is model drift? Most of the machine learning practitioners
are only aware of how to train, build and
deploy a model, but in reality that model deployed
in production may degrade over a couple of weeks
or a couple of months. So being able to detect model drift
and being able to replace that model is essential.
And knowing that models really degrade over time is an
essential item for veteran machine learning practitioners.
A model can also be deployed inside a lambda function as shared
earlier, and then we can also use lambda
to trigger a sagemaker endpoint. So if you have deployed a model
in a sagemaker endpoint, then you can use AWS lambda
to perform some custom things first before triggering
the sagemaker endpoint, giving you that flexibility,
especially if you need to pre process your data first before performing
the prediction. Also, you can use API gateway mapping templates
with sagemaker so that there's going to be
no lambda function between those two services.
Here you just use VTL to map
the input data to the sagemaker endpoint directly,
and then you can also deploy the model inside Fargate.
So Fargate is a service which can help you work on
containers and container images in AWS. So feel
free to use these concepts when you're using other
platforms as well, because most likely they may have similar
services out there, especially if your team is already using those
platforms when you're using, let's say, Sagemaker. I would like to
add that these are also combinations where for more
complex use cases, you can technically deploy multiple
models inside a single machine learning instance. So there
are different use cases there. And you can also deploy sagemaker
multicontainer endpoints where that endpoint can
have multiple containers with different models. So let's
say that you're using a custom model using a specific deep learning
framework. Then you can have that model inside these container. And then
let's say you have four containers these,
then you can deploy that inside a single endpoint. So you can just
select which container to use when you're performing
inference. So very useful when you're trying to compare different models
in a production environment. The next one would be setting
up a b testing using production variants in sales maker.
So let's say you are running a model deployed in an endpoint.
You can do let's say 80 20 where 80% of the traffic is
being handled by one model and then a new
model is going to handle 20% of the traffic. And then you're going
to compare the performance of those two models before trying
to replace the first model with the second one. So of course
if the second model is performing better than the first one, then that's the time
to replace the first model. And then here you can also deploy
the model inside a lambda function with containers.
So recently I think about five months ago,
AWS has released this feature where in addition to writing
Python code inside a lambda function, you can use your own custom container
images to load your model. So this is very helpful.
Also when you're using deep learning frameworks and
trying to get that to work with AWS Lambda. So here
you can make the most out of both worlds where you can use
your customization capabilities, especially your DevOps skills,
to prepare that custom container image, loading the model
and then it's going to work with AWS Lambda where you just
pay for what you use. So if you're just using it for 3 seconds per
day, then you're only going to pay for 3 seconds per day, which is
super cool. You can also use, let's say the data science
libraries with Sagemaker. So if you were to use
that these you can easily build machine learning workflows using
AWSF functions. So here you can see that
you can automate the entire process. And if you want to
perform model retraining, you can do something like this.
So let's say that you have uploaded your files or your new
data in a bucket or in a storage service, let's say s
three, then this automation workflow would help
you automatically trigger the training step and then evaluation
and then deployment. And then if, let's say that
code is performing better than your previous model,
then you can automatically replace your existing model in production so,
pretty cool, because you're not just stuck with manual steps, but you
can just leave these running,
especially if you want to work on other machine learning projects.
You can also make the most out of, let's say, sagemaker clarify
to automate bias detection. So here we can
see that you have your data and maybe your model,
and then you pass those as parameters. Cto your sagemaker
clarify jobs. So, of course,
what's bias? So, ML Bias is something that you
would probably be aware of if you've been working in the industry for quite some
time. And when you're deploying a model and using that in production,
it's not just about performing these right prediction, it's also about
making sure that you're following the guidelines and that making sure
that your model is not bias towards certain groups.
So, we will not talk about this in detail here, but it's important to note
that there are a lot of metrics that you can check when checking
and working with bias. Let's say class imbalance,
or maybe DPPL, or maybe treatment and quality. And here
you can fix your data after you have detected that your
data has issues after reviewing these metrics here.
So here's some sample python code when using Sagemaker clarify.
So here you just specify, let's say, the instance count, the instance type,
and then you pass in your data, and then maybe a few
configuration parameters. So instead of you trying to learn how
to detect bias and planning all these formulas yourself,
why not just use a tools which can provide you the metrics
right away, as you can see here. So if you were to detect,
let's say, class imbalance with, let's say, ten to 15 lines of
code waiting for ten minutes, then you would have something like this,
where you can detect if there's class imbalance in
your data set the same way with ML explainability.
So what is ML explainability? So, if you are working with more complex
algorithms and models, of course the output, these may
not necessarily be easily explainable.
And the more we are able to explain a model better,
the more we are able to get an organization to use that model
or algorithm. So, in ML explainability, let's say
for this one, you can use shaped sales to explain your model.
So here, how do we interpret this? We can say that out of
the four features that we have in our training data
set, we can say that only two features actually contribute
to the final outcome of the prediction of the model.
So, let's say that there's ABC and D. Only A and B
actually contribute to the final outcome when performing
the prediction. So this is an example of what we
will get if we were to use Sagemaker, clarify to use
and compute the Shaq values after you
have passed your data. So the next one is a really exciting topic.
So this is called automated hyperparameter
optimization. So what is automated hyperparameter optimization? So the
first one is understanding what hyperparameters are.
Hyperparameters are configuration parameters that you can
set before the training job. So one training experiment, one set of
hyperparameters. Of course, when you're trying to create models,
it's critical that we all know that after one
experiment, we are not really sure if that model
is these best model for that problem. So the technique
there is to configure the hyperparameter values,
perform the experiments again, and compare the evaluation metrics
with the evaluation metrics of a previous model. So of
course this would be very time consuming. So how do we solve
this in a more practical manner? So, with cloud computing,
you can easily spin up a lot of resources,
let's say for three minutes each, and then perform one
training experiment for each, let's say ML instance.
So after 15 minutes, you would be able to come up with,
let's say 15 different experiments and 15 different models,
and come up with a fine tuned model
where that model has the best metric values compared
to the other models produced by the tuning job. In a similar
fashion, you can perform automated hyperparameter tuning across
different model families. So let's say that you have
here in the screen, you have a custom algorithm using
Apache Mxnet deep learning framework, and the second model family
would be using the linear learner built in algorithm.
Then you can perform a single hyperparameter tuning job
where this model family would use a certain set of
configuration hyperparameter ranges, and then the second family would use
a different set of configuration parameters. Those training
jobs would run, and then the best model would be used in
the final model deployment step here in optimizing
cost by using transient ML instances for training models,
we can make the most out of transient ML instances where the
ML instances would run for, let's say ten minutes, and then it's
going to turn off automatically. So this is very helpful when you're trying
to train or fine tune existing models
where you would need a lot of resources. The example of this
one would, let's say, be using BErT models.
Let's say you have hugging face and then you have BErt, and you would need,
let's say p two x large instances, which are super
expensive, right? But if you were to run that in just two to three minutes,
then that's better compared to running that
same large instance for 3 hours. So having
transient ML instances to run your training jobs is super important
when managing cost. Finally, when securing machine learning environments,
it's critical that you take care of both process and
tech side of things. So knowing about principle of
least privilege is important because of course, when you're preparing
your environments, you have to prepare and manage the security
configuration first and make sure that from the beginning this is properly
set up so that you can leave your engineers
working without you having to worry about security every day.
So set the rules, set the guidelines, set the restrictions
so that these can only perform what they should be doing and does
not apply only to humans.
This can also be used when dealing with resources in
the cloud. So here this is an example of a potential risk
when using a library. So here, this library
allows you cto load and save models.
But if you were to use a model from an untrusted source,
and that model may run, let's say,
arbitrary malicious code, when you load the model these
technically your system has been compromised. So what can you
do here? These you can solve this problem by limiting the
permissions for that set of resources loading
this model. So let's say that you have a container using Python
loading this model from an unauthorized source,
then you can limit that resource to only perform
certain actions. So if, let's say, case one, you have super
admin permissions for that resource, and that
model has been loaded in that resource, then the problem there is that
malicious code can perform super admin actions. On case
two, if that resource loading that model has
limited permissions, then the advantage these is then
the malicious code can only perform a limited set of actions as well. So at
least you can limit the damage when an accident happens.
So that's pretty much it. So thank you again for
listening to my talk. So again, you have learned
a lot in this short session, so make sure to use that knowledge
in your day to day machine planning
life. So thank you again and feel free to reach out to me
via email or LinkedIn. So thank you again and
hope you learned something from my talk.