Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to today's session on building machine learning environments
for regulatory customers. In today's session, we will be looking at
the best practices of building machine learning environments on AWS
for regulatory customers. These customers can be in banking,
insurance, life sciences, healthcare, energy,
etc. Regulated customers are using machine learning models
in order to transform their businesses. There are different use cases which you
may be already aware of, for example, fraud detection, market surveillance,
trade execution and even pharmaceuticals. Machine learning has
the ability to learn from your business data
and create these predictions which can be used
for improving your processes and your business.
But we should be very careful in terms of how these models
are being deployed, what kind of security guardrails are being applied and
what are the regulatory requirements whenever you are running such
models, and finally, to ensure that they are secure.
With that being said, let's get started with today's session.
So machine learning went from being this aspirational technology
to a mainstream technology extremely fast. For a very
long time, the technology was limited to these few technology companies
and hardcore academic researchers because there was simply
no access for machine learning toolkit to a normal developer like
you and me. But things have begun to change. When cloud computing entered
mainstream, the compute power and the data became more
available. And quite literally, machine learning is now making an
impact across every industry, be it fashion,
retail, real estate, healthcare,
there are many more industries. It's moving from being
on the periphery of the technology ecosystem to
now being a core part of every business and industry.
Here at AWS, we have been seeing a tipping point where
AI and ML in the enterprise is addressing the
use cases that create measurable results. The customer
experience is being transformed via capabilities such as conversational
user interfaces, smart biometric authentication,
personalization and even recommendation. The business
operations are also being improved. For example, in retail,
AI and ML was able to reduce the error rates
by 30% to 50%. Automation is making supply
chain management more efficient. We can kind
of conclude here that AI and ML is ultimately helping
the companies make better and faster decisions.
Machine learning is by far the most disruptive technology in
the recent years, and today, more than 100,000
customers use AWS for
running machine learning workloads and for creating more personalized customer
experience, or for even developing personalized pharmaceuticals,
for that matter. Now, let's look at the AWS
ML stack. I'll be talking through the different
services which are being offered by AWS. As part of
the ML stack, we are innovating on behalf of our customers
to deliver the broadest and deepest set of machine learning capabilities
for the builders at each layer of the stack, we are investing
in removing the undifferentiated heavy lifting so that your teams
can move faster. These services are applicable across a
broad spectrum of companies, and we have also heard from customers
that they want specific solutions that are purpose built.
So let's go layer by layer here. The first layer
are the AI services, which would be services like Lex
speed, services like poly and transcribe code,
and DevOps services like code Guru and DevOps Guru. These services
are essentially the pre trained models and they provide ready made intelligence
to your applications and workflows. It helps you do things like personalizing
the customer experience, forecasting the business metrics,
translating the conversations, extracting meaning or extracting meaning
from different documents. Essentially,
the AI services is to make machine learning more available to
developers who are not the core machine
learning developers. These are your developers who would want to just
invoke an API and get some outcome out of that.
With machine learning stack number two, that is
the middle layer you have the Amazon Sage maker.
With Amazon Sagemaker, it gives you the ability
to build, train and deploy machine learning models,
and it provided every developer and data scientist the ability to do that.
It removes the complexity from each and every step of the machine learning workflow
so you can easily deploy your models. Towards the end of
the session, we will see a code example on how a model can be
deployed by using Sagemaker Studio. The last
layer is machine learning frameworks
and infrastructure. This is basically tensorflow Pytorch,
and this is basically for folks who are
experts in machine learning and they would want to develop their own
framework of their own choosing by using the deep learning AmIs
and they can fully configure this solution. Obviously in today's session
I wouldn't be going through each and every layer of
the stack, but rather I'll be focusing upon Amazon Sagemaker.
So Amazon Sagemaker has been built to make machine
learning more accessible. And as I mentioned before, it helps
you build, train and deploy machine learning models quickly
and at a lower cost by provided the tools required for it.
In fact, we have launched 50 plus capabilities of
machine learning in Amazon Sagemaker in past year
alone. And finally with SageMaker Studio, it brings
it all together on a single pane of glass.
So to summarize on SageMaker itself, it's the most complete end
to end machine learning service. Sagemaker has a lot of
features, and obviously we wouldn't be covering all the features
today, but it can go through these four main pillars
which are there. First off, it provided users with an integrated workbench.
The users can launch Jupyter notebooks, they can launch Jupyter
lab experiments, and they can instantly see these things on the
Sagemaker studio. Sagemaker also provides
complete experiment management, data preparation,
pipeline automation and orchestration. So if you were
to look at the overview of Sagemaker, it will
help you prepare your data, it will help you
building your model. You can train and tune your model and
ultimately deploy and manage your model. These are the four categories that
really addresses the needs that machine learning builders have
when they are dealing with each stage of a model's lifecycle.
With that being said, let's move on to see
how to build the machine learning environment on AWS.
So what did our customers ask? The customers asked for
a solution which can enable the business data
scientist to deliver a secure machine learning based solution
and where they can train their models on
highly sensitive data. And this data can be customer data,
it can be company data, but essentially the security would be the priority number
one here. And for this kind of an ask,
let's come up with a tentative environment
or constraints or requirements here.
Obviously, there wouldn't be any Internet connectivity in the AWS accounts
of such customers because you wouldn't want such accounts
to be having direct Internet access. So most of these accounts
that we are going to talk about are accounts which are having private
VPC with no Internet connectivity. Second is when it
comes to large enterprise customers, you always have a
cloud engineering team, and cloud engineering team
is responsible for the platform itself. They are responsible for making
the platform secure. They are responsible for building reusable solutions
which can be leveraged by the applications team. They are responsible
for monitoring the platform. But if you
rely too much on the core engineering team, the application team
would feel that it's a bottleneck because they would want to do something and
you want to give the autonomy to the application
team to build their own infrastructure as and when needed. So that's
where the self service model comes in, where the application team
should have the capability of provisioning the machine
learning resources. The third point would be centralized governance.
The centralized governance and guardrails for the infrastructure
is also an important part, because if as an application team
member, I am building something and then
I'm deploying it as much as I am responsible
for managing that solution, there has to be a centralized governance
from a security office and also from the platform team.
In this case, it would be the cloud engineering team on what kind of guardrails
is being applied on the infrastructure. The last part is the observability
of the solution itself with all these requirements.
Let's look at the target architecture. The target architecture would be
where you would want to leverage the multi account structure of AWS
workspaces, a private VPC network,
and all the traffic going over VPC endpoints.
Pypy Mirror using AWS code artifact so why
would you need a Pypy mirror? Well, as an application
team, if I am deploying certain models on AWS
in that secure environment, I also need a capability of installing
new libraries. Now I can install these new libraries by
directly connecting to the Internet, which is not available to me.
So obviously I need a pipeline mirror from where these libraries can be
downloaded and installed on my notebook or studio. And these libraries
are on top of what already comes out of the notebook and studio by
default by AWS AWS service
catalog for provisioning the resources, Amazon Cloudwatch
for observability and finally transit gateway for network connectivity
to corporate data centers. I won't be talking about
the transit gateway part today. It is mainly as
an informational point that is being included here.
But we will touch upon all the other points that you have seen in the
target architecture. Now let's look at the architecture diagram here.
This is the diagram where I have tried to depict all the points that I
mentioned in the previous slide. You can see that there are
four accounts. Ignore the sagemaker service
account, we'll come to that later. You have an application account
which is the main account. So let's say an application team Alpha
wants to deploy the application in
that account. So that will be the account that they'll be using. You have a
security account, and that security account is a
customer security account being managed by the CSO, possibly where
all the cloud trail logs are coming in. All the flow logs are coming in.
As you see in the diagram which is being analyzed, it is
being worked upon to see if there is any kind of bad
traffic, any suspicious activity which is happening.
You have a customer networking account and the networking account is where you
have the transit gateway which is being shared. And finally
you have a shared services account where you would want to keep
a code artifact which is a pypy mirror. It's kind of like
a central repository where all the different teams
would be able to pull down their libraries as per
their liking. So let's go step by step. The first
thing would be the customer application account. You can see that there is a
VPC here, and within that VPC there are three private subnets.
Within the three private subnets you see two ENIs
and the two Enis are pointing to the
Amazon Sagemaker notebook and Sagemaker studio.
The notebook and studio is not residing in your account. Rather they are residing
in a separate sagemaker service account which is transparent
to the customer. You wouldn't be seeing
that account at all. What you will be seeing is an instance
of notebook running in your account and an instance of studio which is running
in your account. And for the VPC,
you would want to have the VPC endpoints because it's a
VPC which is having no Internet connectivity and everything is private.
The only way that you can access the AWS services
like ECR S three sts kms
is via the VPC endpoint. You would also want the VPC endpoint
for accessing the code artifact. So this is the overall architecture.
If I am going to provision this kind of a structure,
the first thing that I have to ensure is any notebook or studio which is
being provisioned is being provisioned in that VPC.
Because if I give the application team complete access on provisioning
a notebook as per their liking, they can also provision a notebook
without using the given VPC, which will
enable it to run with Internet connectivity. So there are certain guardrails
which you want to enforce on the notebook or the studio
which is being provisioned by the application team. The second
thing is obviously the network. Whenever I'm creating a notebook
and a studio, I want the Enis to be residing in
that new VPC which I have created for the account.
So this new VPC which I have created for the account is what you are
seeing as the application team VPC. The studio EFS directory
is again automatically created when you are provisioning the Sagemaker studio.
Now that you have an idea of the architecture, let's go into
the implementation side of it on how you're going to actually provision
these. So before we go into the provisioning part of it, we want to
understand the service catalog piece
and how it is going to add value here. I spoke earlier about
organizations having a central cloud, engineering having a
central security, and then the application team itself. In this case,
those would be the folks who are the end users. As an application team
member, what I need is speed.
I want to create a notebook, I want to delete a notebook, I want to
create a studio, run a machine learning algorithm
in there, and I want to immediately run some POC. Obviously,
if I am not having a self service model, I wouldn't be having the
speed or the agility which I'm looking for as an application team,
especially when I'm using AWS
for all provisioning activities. On the other side
of the spectrum, we have the security team or the
central engineering team which wants to ensure that there
is compliance. There is standardization, there is curation.
A simple example is there are ten different app teams
who want to create notebooks, and all of them
have a slight variation in the notebook that they are creating.
Some of them want a notebook which is a 50 gb
volume is available. Others would want 25 gb if
there is any specific model or image which they
want to add to their notebook, or there is a new library which they want
to add to the notebook, or they want to have a new lifecycle to
the notebook. And these are things which can
differ as per the team which is trying to create the notebook.
As a central engineering team, they would want to create these reusable patterns
which can be used across teams, more like templates. So if
you want to do that, how would you do that? So that's where AWS
service catalog helps you. It helps the central
engineering team accomplish their goal of security,
curation, compliance, standardization, and it helps the
application teams to accomplish their goal of
speed, agility, self service model, and obviously the time
to market how quickly they can create a PoC and run with it and see
what kind of an outcome is there. Now, before we go into the specifics of
the service catalog, we want to understand few items which
are there in service catalog. The first thing is a product.
Now, a product can be a cloud formation template. If I
am having a cloud formation template, which is creating
an EC two instance, or it is creating a notebook instance,
since we are talking about Sagemaker, that can be equated to a product.
Once I create a product, the next step would be to put
it in a portfolio. Now, this portfolio can be created
by the core engineering team. So let's say a core engineering team creates
a portfolio named Central IT engineering, and it puts a product in there,
which is a sagemaker notebook cloud formation template.
I know that that particular cloud formation template has all the
guardrails which I am expecting for
any notebook which would be coming up. Example would be no Internet
connectivity. There is direct Internet connectivity is put in as false,
no root access. They would be having
the network interfaces. That would be the VPC where
it is supposed to be run. And also you
maybe want to associate a git repository to it. That would be
a code commit repository. So these are the, I would say
guardrails, which the application team wouldn't
want to keep repeating, but the central team wants to enforce it.
So the central team can create a cloud formation template,
and they can put that as a product into AWS service
catalog. Once it goes into the AWS services catalog,
it would then go on into the portfolio. Once it ends up
in the portfolio, you can have constraints associated with
it. There are different kinds of constraints that you would want to have. So there
can be a launch constraint where you're saying that only
these roles would be allowed to launch this product.
And additionally, you can add certain roles to the groups
which would allow only certain app team members
or certain app teams itself to be able to view that
portfolio and operate on that portfolio, or invoke the product.
And those kind of constraints can be added as well. Once the
product list is available, the users can see the products
and they will be able to launch the product. Now, when they launch the product,
obviously the maximum they can do is pass the parameters. They wouldn't
be able to change the product and remove the guardrails,
which I had put in as a central engineering team, into the product
as cloud formation templates. And finally, when the
product is launched, you would be having a provision product as an output.
And this provided product would be a resource which
would be a sagemaker studio, or it would be a
sagemaker notebook, which the application team can use.
And this is where the segregation happens. As an administrator,
I am able to control the product that an application team can use
and also apply the guardrails which an application team would want to use.
And that's the whole advantage of having the self service model.
With the self service model, you will be able to leverage
the infrastructure as code and define your infrastructure,
your compute layer, your storage and other cloud resources,
and using a JSoN or a YamL,
or even terraform scripts or files. Once you have these
things, you can put them as a product and then this product
will be standardized AWS, a best practice across your
organization by this central engineering team. And that
can be one version of the product. An example is
today it's Sagemaker. Tomorrow, if you have example
of a three tier stack with EC two rds and
S three, you can obviously make use of that and you will be
able to have a standardized format. Okay, this is how my
three tier stack is going to be. And multiple app teams can go ahead
and provision that. That's another example. So that's the whole advantage of AWS
service catalog, where the customer can create AWS based
solutions and the product can be exposed
by the central engineering team to the application teams. And once
it has been exposed, the application team would be just
provisioning it and because it has been created by
the central engineering team, you can have the constraints applied to
it, you can have the security controls applied to it, any kind of
tag enforcement, any kind of restrictions, like no Internet
on the studio and no root access
on the notebook, all these things can be put into place.
Now let's look at the second part of the requirement, which we
had spoken up earlier, that as an application team I
want to install some new libraries into my studio or
into my notebook. This is where you would need a pypy mirror.
I will share a link towards the end of the end
of this particular talk, which will give
you steps on how you can set up a secure environment
via a workshop. But before that, you would want
to understand what AWS code artifact is bringing
to the table. If you want to set up a pipi mirror,
you can make use of AWS code artifact which is sitting in a
shared services account. If you recollect from
the previous architecture diagram that we had a look at,
there was this shared services account which was having
a code artifact. And in that code artifact
you are able to put in your libraries and
you can download the libraries from the upstream pypy library.
This is a fully managed artifact repository service and
it supports NPM maven Python Nougat
package formats. And currently you can make use of AWS code artifact
with different package managers like Maven cradle, et cetera.
The idea here is to have AWS code artifact
sit in a central shared services account and
different application teams. As and when they have a requirement,
they would be able to pull down the curated list of
libraries from that code artifact repository service,
and they can go ahead and install it in their notebook or in
their studio. Now let's look a little bit in
depth on code artifact and what it is doing the
same thing that I explained just now. You can have a public artifact
repository. In this case it would be a Pypy public repository and you
can create a domain. Now what's a domain? A domain is a code
artifact specific construct that allows grouping
and managing multiple code artifact repositories together.
So if an organization is creating a central repository for
sharing packages, they can have this domain being created and it
can be shared across multiple teams. And when you have
a repository, it contains a set of packages. So I can have a package
on service catalog tools, I can
have a package on the request package of Python
or even sagemaker sagemaker release,
which is Sagemaker 2.0 release that we have with
one of the PiP packages. So as you see on the right
there is this pull application dependencies for development. The development
team will be able to just pull these dependencies as and when
they need it. And you can also have this integrated into your
CI CD pipelines by using codebuilt or other tools.
That's the whole point of having code artifact. So a quick revision we
saw what's the impact of service catalog which helps
you create these curated products which can be reused
by different application teams, and code artifact helps you create this
centralized repository of PIP dependencies which can again be reused
by different application teams. In that way you're able to
provide this centralized governance or certain aspects of
the machine learning resources which you would be using.
And along with that you give the flexibility to the application
teams to have a self service model where they can just pull down a product
from the service catalog, provision it, and then they can go about
doing their own application specific development within the resources like a
studio or a notebook. With all that said and done, let's have a
look at how you will be building these infrastructure
components by using AWS cloud formation.
We spoke about VPC networking and we mentioned that it's
going to be a private VPC. Now here you can see that it's a private
subnet and I'm having map public IP on launch
as false, which ensures that my subnet which is getting created
on the VPC is a private subnet.
If you have a look at the security group, I am only exposing four
four three. So security group ingress and Egress
is ensuring that only four four three traffic can come in and go out.
And the cider IP is the cider IP of the VPC itself.
You're not exposing the security group outside for ICMP
pings or anything else other than four four three.
And you know that four four three will be only going to your
VPC endpoints. And because
it's private VPC that you
are using, you need the VPC endpoints for any communication with
other AWS services. The second part
is enabling the VPC endpoints. Here you have
the sagemaker runtime VPC endpoint which is sagemaker
runtime, and you have the sagemaker API endpoint.
Without these endpoints you wouldn't be able to interact with sagemaker in
a private VPC. You can see that there are three subnets which are provided
subnet one, two and three. All three are created by the previous
VPC network that we spoke about. And again, the VPC id
is going to be the same VPC id. And you can see the private DNS
is enabled as true. Going back to the previous slide,
you would notice that in terms of the VPC networking,
we have set in the map public
IP as false. So none of these VPC subnets would
be having connectivity to the Internet.
The third part is the flow logs. We had seen that there
is a central security account and that security account was responsible
for analyzing the VPC flow logs.
VPC flow logs allows you to look at the traffic which
is flowing in and out of certain enis. And if you're applying
at the VPC level, it would look at the entire VPC traffic and
tell you which traffic has been accepted and which has been rejected.
Because you are having it in a central account, you would want to keep it
in s three and that s three bucket, the log
destination you are giving. I'm just giving an example of say Doc example
bucket and you would want to give some kind of structure like flow
logs and the account number from where this flow log is coming up
here. I'm putting all the traffic here for
tracking purposes and a maximum aggregation interval of 60 seconds.
So this is where the fun happens. You have the Amazon Sagemaker
studio and you have the notebook. Within the Amazon Sagemaker studio
and the notebook here you can see that the
KMS key id and the role ARN has
been provided because you're providing the KMS key id, you are ensuring
that you're using a CMk as a customer managed key for
encrypting the Sagemaker notebook. And the same has been applied
for Sagemaker Studio as well in terms of the execution rule.
So as a central engineering team, when I
am creating these products, by ensuring that the direct Internet
access is disabled in the notebook instance, and by
ensuring that the app network access type is VPC only,
I'm ensuring that notebook and studio is never going to communicate with
any traffic outside the VPC.
Secondly, the root access has also been disabled on the
notebook. You would see that the security groups which are being
imported is coming from the sagemaker environment and default
security group and default security group id that
has been imported from a previous stack. Now that previous
stack is the VPC stack that we had seen earlier where
the VPC has been created and it is exporting these ids
so that it can be imported into another product. And finally you have
a volume size being provided. But if as an
application team, if I'm looking at this stack and if I'm looking at this cloud
formation, there is no way I'm going to change the direct Internet access.
There is no way I'm going to change the KMS ID. I can't get the
root access enabled. So these kind of controls help you
build the compliance into the product which is existing in
the service catalog, and that way you will be able to share this product
confidently with your application teams and you will be
able to create this reusable pattern where multiple
teams can go ahead and reuse this product.
So that's everything on the cloud formation side of it.
We also spoke about the multi account
structure using AWS organizations. The multi
account structure using AWS organization needs
a service catalog and it
also has the service control policies which are being applied now.
What are these service control policies? Service control
policies are applied at the OU level, which is the organization
unit, and they help you provide these
broad strokes on certain restrictions which you would want to
apply across the organization. So in next slide,
we will be looking at what kind of service
control policies can be applied for data.
We know that we can control the compliance and restrictions
on a product side. What we don't know is
how to ensure that the data is always encrypted.
Well, that can be done by using this service control policies.
If you are applying the service control policy at an OU level,
I'm saying that whenever you are creating an automl job,
or a model, or a labeling job, or a processing job, or a
training job, in all cases it is
mandatory to give a sagemaker volume KMs key.
So you can see at the top that the effect has been marked as deny.
That means in case you are not provided a KMS
key for the volume, then these actions will not
be executed and you will not be allowed to execute these actions.
The same applies for the output KMS key. So this
ensures that every time you're creating a model, you're creating
a training job, you're creating a transformation job, or you're
creating a processing job. These actions are governed
by the fact that you need to use a KMS key for the
encryption of the data, and this often happens to be one
of the guardrails or the requirements when it comes to regulated customers,
that all the data has to be encrypted in transit or
at rest. AWS has the facility of
using the KMS keys, but in order to enforce
it at an organization level, you would want to include it in the service control
policies. The same applies for the traffic and network.
In case you want to have intercontainer traffic encryption,
you again can apply it as a service control policy at an
organization level, and by doing so, you will be able
to ensure that every job which is created.
So here you can see that the effect is again deny. And in case someone
wants to create a processing job or a training job or a monitoring job,
they have to keep the traffic encryption as true.
Because if the intercontainer traffic encryption happens to be
false. So the condition says if the intercontainer traffic encryption
is false, then the action is to deny all of these.
Same with network isolation. If the network isolation
happens to be false, then deny these. So for these actions
to work, you have to provide the network isolation, you have to
enable the traffic encryption. And the same with before,
I'll just go over these two policies again, just for the
clarity purpose of it, what the policies are saying is deny
these actions if the volume KMS key happens to
be null. So if you can see the condition, if it is null,
which is true, then deny these actions. So in short, you can
only execute these actions. If you're using a KMS key for the volume
encryption and for the KMS key for the output and
with respect to traffic and network, in case the intercontainer
traffic encryption is false, then you will not be able to do
processing job and training job. And same applies with network isolation.
It will be worth to visit the documentation on the service control policies
on how they work. But essentially these are guardrails which you
can apply at the organization unit level, across your
organization by the central platform, the platform
team, and that will help you enforce these for all
the application teams which are trying to create a model or train a job,
or create a processing job, or do some kind of monitoring schedule,
et cetera. Moving on,
certain AWS AI services may also store and
use the customer content for processing these services and for
continuously improving the Amazon AI services. As an AWS
customer when it comes to regulated industry,
these customers would want to opt out from using their
data in terms of improving the Amazon AI
services. So there is a mechanism for you to
opt out of this by applying this particular
policy at your root level. Essentially what the policy
is saying is do not allow any data
from all the accounts under this organization to
be used for improvement of the Amazon AI services and
technology. So as an AWS customer, you can
just opt out from using the
data to improve the Amazon AI services. Now,
we spoke a lot about service catalog. We spoke a lot about
the code artifact and how the guardrails can be put in. Now let's
see some screenshots on how these things actually look on the AWS console.
This is where you can see the provisioning of the products using
AWS service catalog. So when you create a service
catalog product as an application team member, this is how I will
be looking at it. You can see there is a sagemaker studio
user, there is a studio, there is a notebook, and then there is a data
science environment. As an application team member,
I can go ahead and click on the data science environment and I
can provision it and you can see that the provisioning is happening
where the data science environment is coming up.
And once it has come up, you would be able to see the VPC
which has been created as part of your data science environment. If you
want to create multiple Sagemaker notebooks, just click on the Sagemaker notebook
product that you see on row number three and then you will be able to
provision the sagemaker notebook as well. So far
we have seen about the preventative controls. Now we will be looking
at the detective controls. You can make use of AWS config
in order to enforce the detective controls. Now these detective controls
would be by using the existing
rules which are available in AWS config. And once
you enable it, you can see the non compliant resources as you see
in the screenshot where default security group
is not closed or the sagemaker endpoint configuration is
not yet available. These kind of controls are
applied by using AWS config.
The next slide is about the centralized governance
and the centralized governance again using code artifacts. So here you can see there
is a central it pypy mirror and you can see that
it's connected to the public repository and it would be continuously
downloading the packages that you need. And then the central
it team or the platform team will be able to control what
all packages will be part of this. And as an application team
member, if this is residing in your shared services account or
something, you can just connect to it and then you can download the
required dependencies. And that's where the centralized governance
of the PIP dependencies come into picture. We have spoken
about so many different aspects of governance and monitoring
and also on the service catalog side, let's see how this
would look if you want to build an entire end
to end pipeline on provisioning of the artifacts.
So we have what is called AWS service catalog tools which
allows you to build this sort of a pipeline
by using CFN Nag cloudformation Rspec,
and you can validate the cloudformation templates which are sitting in your git repository.
And from there it can go ahead and provision these
products into the accounts that you need or share these products into
the accounts that you need. I have a link towards the end of
this talk where you can go ahead and play around with the AWS
service catalog tools. That is service catalog factory and service
catalog puppet. Not to be confused with the open source puppet configuration
tool. This is again open source tooling from
AWS under AWS Labs, and you should be able to
see the service catalog tools link at the end of this
talk. By using the service catalog tools you can create these
end to end pipelines, and these pipelines will be responsible
for getting your cloud formation template from the git
repository where you have and converting it into a product
which can be shared with multiple accounts. And then those
accounts will be having a similar view of how the application
team and how the service catalog is
running. I'm just putting back
the next slide now. So this is how you will be
writing a Jupyter notebook. And here you can see that
when you are creating a Jupyter notebook, you are making use of
the session, which is a sagemaker session, and you are passing the
boto three session in here. By passing the
boto three session, you are allowing the sagemaker
session to Piggybank on the previous boto three calls.
And the clients like sagemaker client and the Sagemaker runtime
client can be reused by the sagemaker session for executing
or executing your code like the estimator or deploying
your model, et cetera. This is an example which you
can get from the Sagemaker notebook, which is on the video games
Xgboost algorithm, and it will allow you to
just run through this example and see how you can run the notebook
and it's available from Amazon. In the next slide
you would see the guardrails which we are putting in. So the service
control policies we had mentioned that without providing a volume
kms key and an output kms key, you would not be able
to run an estimator. And the same applies for
enable network isolation and intercontainer traffic.
So you can see that these four attributes have been passed here,
along with the subnets and the security groups.
That's the enforcement that you're doing or the guardrails that you're applying.
And without these guardrails, it is possible to run
the estimator and it is able to train and deploy a model. But that's
not the point here, right? We are trying to enforce certain guardrails,
especially when it comes to building a secure machine learning environment for
regulated customers. And in that case, you want to run
everything within a VPC. So the first architecture diagram
that we saw where everything was running in a private VPC.
We are enforcing it here by using the subnets,
by using the security group ids, by using the volume
KMS key, the output KMS key, the encryption intercontainer
traffic being set as true, and finally the enable network isolation
being set as true. With all these four or five factors
which have been added to the existing estimator, you are ensuring
that the sagemaker job which is being run, which is being trained,
and the model which is being deployed. The second line that you see where
the model is again having a KMS key ARN being
passed around, all these things are encrypted as per the standards
that you would be having within that organization. So that's
the whole point of building these environments.
In terms of whatever we have discussed before, we are applying that in
the code and enforcing it, because if as an application team
member, I decide to mark the network isolation AWS false,
my estimator is not going to get deployed. It will give me an error because
I'm not allowing someone to run a processing job without ensuring
network isolation. So that's the advantage of making use of these
guardrails and the service control policies, and also
the service catalog products where you are able to enforce it for the different
products. Finally, it comes to monitoring, monitoring the deployed
models. How would you monitor it? Now here is an
example of a model which has been deployed and there have been 45
invocations of the model. So the model has an endpoint.
And in the previous slide we saw that the model
has been deployed with XGB deploy.
It has been deployed on an ML M five X large instance.
And there is a KMS key which is being used for encrypting it.
In terms of the monitoring side, we have 45 invocations which
have happened on the model. And this is where we are using cloud watch.
There are no errors, no 500 errors, no 400 errors.
And then you can also look at the model latency and the overhead latency.
Now what's a model latency? That's the interval time taken by
the model to respond to a request. And that's
just from the viewpoint of sagemaker. So it would include the local
communication that is happening to send the request
and then to fetch the response and the overhead latency,
that's the interval which is measured from the time sagemaker receives
the request. So one is the model latency, which is the model invocation
and the response time itself. And then there is the overhead latency.
There are more dashboards which you can obviously building based on
the metrics which are exposed in Cloudwatch. This is just an example
on how you can leverage Cloudwatch for doing this.
You can also look at the resource metrics like the cpu
utilization of the model, the memory utilization and the disk utilization.
So it gives you a very good visibility into the model itself.
Along with this, the flow logs which were using earlier into
a central security bucket. Those can be leveraged
to look at network traffic. That gives you again a visibility into
what is there in the network. And finally you have
the cloud trail which is there for every API call of sagemaker.
The best practice would be to monitoring these cloud trail events
which are again happening. So to conclude,
what did we learn? We are using multi account
structure to improve the security and segregation of responsibilities.
We are using SCPs and IAM policies to set
up the preventative guardrails. We are leveraging AWS
config for the detective controls. And finally, we are
giving the application teams autonomy via self service products
which are being shared by AWS service catalog using
a combination of all these different features which are there
on AWS. It gives you a capability to
build a secure machine learning environment for
a customer. And that's the whole, I would say,
objective of this talk where I wanted to go through the best practices
which can be applied when it comes to running Sagemaker,
which is a managed compute and it gives you
the capability of having all these different controls put
in place. It gives you the capability of running your machine learning models
at scale. And with the above mentioned security,
I would say security practices, you can ensure that
your workloads are running in a safe manner. And this
is the last slide where it is basically the references that I have been
talking about. You can go into the Sagemaker workshop and have
a look at how service catalog has been used, how the controls
have been put in. There are examples on GitHub,
on Sagemaker and finally service catalog tools workshop
as well, which gives you that centralized pipeline on code pipeline and
how you can share the product with different teams. And do have a look at
the white paper as well on machine learning with financial services on AWS
which talks about how you can secure the data and what are the best
practices when it comes to that being said, that brings me to the close
of my talk. Thank you so much for your time and take
care.