Building ML environment for regulatory customers

Video size:

Abstract

Regulatory customers have multiple guardrails when running workloads on managed compute provided by AWS.

This talk will focus on the setting up guardrails, deployment and monitoring of the ML services using Service Catalog Tools.

Summary

In today's session, we will be looking at the best practices of building machine learning environments on AWS for regulatory customers. These customers can be in banking, insurance, life sciences, healthcare, energy, etc. But we should be very careful in terms of how these models are being deployed.
Machine learning went from being this aspirational technology to a mainstream technology extremely fast. It's moving from being on the periphery of the technology ecosystem to now being a core part of every business and industry. More than 100,000 customers use AWS for running machine learning workloads.
Customers asked for a solution which can enable the business data scientist to deliver a secure machine learning based solution. The target architecture would be where you would want to leverage the multi account structure of AWS workspaces. The third point would be centralized governance.
As an application team member, what I need is speed. On the other side of the spectrum, we have the security team or the central engineering team which wants to ensure that there is compliance. That's where AWS service catalog helps you. It helps the application teams accomplish their goal of speed, agility, self service model.
A product can be a cloud formation template. Once it goes into the AWS services catalog, it would then go on into the portfolio. There are different kinds of constraints associated with it. Once the product list is available, the users can see the products and they will be able to launch the product.
If you want to set up a pipi mirror, you can make use of AWS code artifact which is sitting in a shared services account. A domain is a code artifact specific construct that allows grouping and managing multiple code artifact repositories together. You can also have this integrated into your CI CD pipelines.
How you will be building these infrastructure components by using AWS cloud formation. It's going to be a private VPC. You need the VPC endpoints for any communication with other AWS services. Third part is the flow logs. This allows you to look at the traffic flowing in and out of certain enis.
Service control policies are applied at the OU level, which is the organization unit. They help you provide these broad strokes on certain restrictions which you would want to apply across the organization. In next slide, we will be looking at what kind of service control policies can be applied for data.
Certain AWS AI services may also store and use the customer content for processing these services. As an AWS customer, you can just opt out from using the data to improve the Amazon AI services. You can make use of AWS config in order to enforce the detective controls.
The next slide is about the centralized governance of the PIP dependencies. By using the service catalog tools you can create these end to end pipelines. These pipelines will be responsible for getting your cloud formation template and converting it into a product which can be shared with multiple accounts.
We are trying to enforce certain guardrails, especially when it comes to building a secure machine learning environment for regulated customers. In that case, you want to run everything within a private VPC. All these things are encrypted as per the standards that you would be having within that organization.
Finally, it comes to monitoring, monitoring the deployed models. And this is where we are using cloud watch. You can also look at the model latency and the overhead latency. The best practice would be to monitoring these cloud trail events.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello and welcome to today's session on building machine learning environments for regulatory customers. In today's session, we will be looking at the best practices of building machine learning environments on AWS for regulatory customers. These customers can be in banking, insurance, life sciences, healthcare, energy, etc. Regulated customers are using machine learning models in order to transform their businesses. There are different use cases which you may be already aware of, for example, fraud detection, market surveillance, trade execution and even pharmaceuticals. Machine learning has the ability to learn from your business data and create these predictions which can be used for improving your processes and your business. But we should be very careful in terms of how these models are being deployed, what kind of security guardrails are being applied and what are the regulatory requirements whenever you are running such models, and finally, to ensure that they are secure. With that being said, let's get started with today's session. So machine learning went from being this aspirational technology to a mainstream technology extremely fast. For a very long time, the technology was limited to these few technology companies and hardcore academic researchers because there was simply no access for machine learning toolkit to a normal developer like you and me. But things have begun to change. When cloud computing entered mainstream, the compute power and the data became more available. And quite literally, machine learning is now making an impact across every industry, be it fashion, retail, real estate, healthcare, there are many more industries. It's moving from being on the periphery of the technology ecosystem to now being a core part of every business and industry. Here at AWS, we have been seeing a tipping point where AI and ML in the enterprise is addressing the use cases that create measurable results. The customer experience is being transformed via capabilities such as conversational user interfaces, smart biometric authentication, personalization and even recommendation. The business operations are also being improved. For example, in retail, AI and ML was able to reduce the error rates by 30% to 50%. Automation is making supply chain management more efficient. We can kind of conclude here that AI and ML is ultimately helping the companies make better and faster decisions. Machine learning is by far the most disruptive technology in the recent years, and today, more than 100,000 customers use AWS for running machine learning workloads and for creating more personalized customer experience, or for even developing personalized pharmaceuticals, for that matter. Now, let's look at the AWS ML stack. I'll be talking through the different services which are being offered by AWS. As part of the ML stack, we are innovating on behalf of our customers to deliver the broadest and deepest set of machine learning capabilities for the builders at each layer of the stack, we are investing in removing the undifferentiated heavy lifting so that your teams can move faster. These services are applicable across a broad spectrum of companies, and we have also heard from customers that they want specific solutions that are purpose built. So let's go layer by layer here. The first layer are the AI services, which would be services like Lex speed, services like poly and transcribe code, and DevOps services like code Guru and DevOps Guru. These services are essentially the pre trained models and they provide ready made intelligence to your applications and workflows. It helps you do things like personalizing the customer experience, forecasting the business metrics, translating the conversations, extracting meaning or extracting meaning from different documents. Essentially, the AI services is to make machine learning more available to developers who are not the core machine learning developers. These are your developers who would want to just invoke an API and get some outcome out of that. With machine learning stack number two, that is the middle layer you have the Amazon Sage maker. With Amazon Sagemaker, it gives you the ability to build, train and deploy machine learning models, and it provided every developer and data scientist the ability to do that. It removes the complexity from each and every step of the machine learning workflow so you can easily deploy your models. Towards the end of the session, we will see a code example on how a model can be deployed by using Sagemaker Studio. The last layer is machine learning frameworks and infrastructure. This is basically tensorflow Pytorch, and this is basically for folks who are experts in machine learning and they would want to develop their own framework of their own choosing by using the deep learning AmIs and they can fully configure this solution. Obviously in today's session I wouldn't be going through each and every layer of the stack, but rather I'll be focusing upon Amazon Sagemaker. So Amazon Sagemaker has been built to make machine learning more accessible. And as I mentioned before, it helps you build, train and deploy machine learning models quickly and at a lower cost by provided the tools required for it. In fact, we have launched 50 plus capabilities of machine learning in Amazon Sagemaker in past year alone. And finally with SageMaker Studio, it brings it all together on a single pane of glass. So to summarize on SageMaker itself, it's the most complete end to end machine learning service. Sagemaker has a lot of features, and obviously we wouldn't be covering all the features today, but it can go through these four main pillars which are there. First off, it provided users with an integrated workbench. The users can launch Jupyter notebooks, they can launch Jupyter lab experiments, and they can instantly see these things on the Sagemaker studio. Sagemaker also provides complete experiment management, data preparation, pipeline automation and orchestration. So if you were to look at the overview of Sagemaker, it will help you prepare your data, it will help you building your model. You can train and tune your model and ultimately deploy and manage your model. These are the four categories that really addresses the needs that machine learning builders have when they are dealing with each stage of a model's lifecycle. With that being said, let's move on to see how to build the machine learning environment on AWS. So what did our customers ask? The customers asked for a solution which can enable the business data scientist to deliver a secure machine learning based solution and where they can train their models on highly sensitive data. And this data can be customer data, it can be company data, but essentially the security would be the priority number one here. And for this kind of an ask, let's come up with a tentative environment or constraints or requirements here. Obviously, there wouldn't be any Internet connectivity in the AWS accounts of such customers because you wouldn't want such accounts to be having direct Internet access. So most of these accounts that we are going to talk about are accounts which are having private VPC with no Internet connectivity. Second is when it comes to large enterprise customers, you always have a cloud engineering team, and cloud engineering team is responsible for the platform itself. They are responsible for making the platform secure. They are responsible for building reusable solutions which can be leveraged by the applications team. They are responsible for monitoring the platform. But if you rely too much on the core engineering team, the application team would feel that it's a bottleneck because they would want to do something and you want to give the autonomy to the application team to build their own infrastructure as and when needed. So that's where the self service model comes in, where the application team should have the capability of provisioning the machine learning resources. The third point would be centralized governance. The centralized governance and guardrails for the infrastructure is also an important part, because if as an application team member, I am building something and then I'm deploying it as much as I am responsible for managing that solution, there has to be a centralized governance from a security office and also from the platform team. In this case, it would be the cloud engineering team on what kind of guardrails is being applied on the infrastructure. The last part is the observability of the solution itself with all these requirements. Let's look at the target architecture. The target architecture would be where you would want to leverage the multi account structure of AWS workspaces, a private VPC network, and all the traffic going over VPC endpoints. Pypy Mirror using AWS code artifact so why would you need a Pypy mirror? Well, as an application team, if I am deploying certain models on AWS in that secure environment, I also need a capability of installing new libraries. Now I can install these new libraries by directly connecting to the Internet, which is not available to me. So obviously I need a pipeline mirror from where these libraries can be downloaded and installed on my notebook or studio. And these libraries are on top of what already comes out of the notebook and studio by default by AWS AWS service catalog for provisioning the resources, Amazon Cloudwatch for observability and finally transit gateway for network connectivity to corporate data centers. I won't be talking about the transit gateway part today. It is mainly as an informational point that is being included here. But we will touch upon all the other points that you have seen in the target architecture. Now let's look at the architecture diagram here. This is the diagram where I have tried to depict all the points that I mentioned in the previous slide. You can see that there are four accounts. Ignore the sagemaker service account, we'll come to that later. You have an application account which is the main account. So let's say an application team Alpha wants to deploy the application in that account. So that will be the account that they'll be using. You have a security account, and that security account is a customer security account being managed by the CSO, possibly where all the cloud trail logs are coming in. All the flow logs are coming in. As you see in the diagram which is being analyzed, it is being worked upon to see if there is any kind of bad traffic, any suspicious activity which is happening. You have a customer networking account and the networking account is where you have the transit gateway which is being shared. And finally you have a shared services account where you would want to keep a code artifact which is a pypy mirror. It's kind of like a central repository where all the different teams would be able to pull down their libraries as per their liking. So let's go step by step. The first thing would be the customer application account. You can see that there is a VPC here, and within that VPC there are three private subnets. Within the three private subnets you see two ENIs and the two Enis are pointing to the Amazon Sagemaker notebook and Sagemaker studio. The notebook and studio is not residing in your account. Rather they are residing in a separate sagemaker service account which is transparent to the customer. You wouldn't be seeing that account at all. What you will be seeing is an instance of notebook running in your account and an instance of studio which is running in your account. And for the VPC, you would want to have the VPC endpoints because it's a VPC which is having no Internet connectivity and everything is private. The only way that you can access the AWS services like ECR S three sts kms is via the VPC endpoint. You would also want the VPC endpoint for accessing the code artifact. So this is the overall architecture. If I am going to provision this kind of a structure, the first thing that I have to ensure is any notebook or studio which is being provisioned is being provisioned in that VPC. Because if I give the application team complete access on provisioning a notebook as per their liking, they can also provision a notebook without using the given VPC, which will enable it to run with Internet connectivity. So there are certain guardrails which you want to enforce on the notebook or the studio which is being provisioned by the application team. The second thing is obviously the network. Whenever I'm creating a notebook and a studio, I want the Enis to be residing in that new VPC which I have created for the account. So this new VPC which I have created for the account is what you are seeing as the application team VPC. The studio EFS directory is again automatically created when you are provisioning the Sagemaker studio. Now that you have an idea of the architecture, let's go into the implementation side of it on how you're going to actually provision these. So before we go into the provisioning part of it, we want to understand the service catalog piece and how it is going to add value here. I spoke earlier about organizations having a central cloud, engineering having a central security, and then the application team itself. In this case, those would be the folks who are the end users. As an application team member, what I need is speed. I want to create a notebook, I want to delete a notebook, I want to create a studio, run a machine learning algorithm in there, and I want to immediately run some POC. Obviously, if I am not having a self service model, I wouldn't be having the speed or the agility which I'm looking for as an application team, especially when I'm using AWS for all provisioning activities. On the other side of the spectrum, we have the security team or the central engineering team which wants to ensure that there is compliance. There is standardization, there is curation. A simple example is there are ten different app teams who want to create notebooks, and all of them have a slight variation in the notebook that they are creating. Some of them want a notebook which is a 50 gb volume is available. Others would want 25 gb if there is any specific model or image which they want to add to their notebook, or there is a new library which they want to add to the notebook, or they want to have a new lifecycle to the notebook. And these are things which can differ as per the team which is trying to create the notebook. As a central engineering team, they would want to create these reusable patterns which can be used across teams, more like templates. So if you want to do that, how would you do that? So that's where AWS service catalog helps you. It helps the central engineering team accomplish their goal of security, curation, compliance, standardization, and it helps the application teams to accomplish their goal of speed, agility, self service model, and obviously the time to market how quickly they can create a PoC and run with it and see what kind of an outcome is there. Now, before we go into the specifics of the service catalog, we want to understand few items which are there in service catalog. The first thing is a product. Now, a product can be a cloud formation template. If I am having a cloud formation template, which is creating an EC two instance, or it is creating a notebook instance, since we are talking about Sagemaker, that can be equated to a product. Once I create a product, the next step would be to put it in a portfolio. Now, this portfolio can be created by the core engineering team. So let's say a core engineering team creates a portfolio named Central IT engineering, and it puts a product in there, which is a sagemaker notebook cloud formation template. I know that that particular cloud formation template has all the guardrails which I am expecting for any notebook which would be coming up. Example would be no Internet connectivity. There is direct Internet connectivity is put in as false, no root access. They would be having the network interfaces. That would be the VPC where it is supposed to be run. And also you maybe want to associate a git repository to it. That would be a code commit repository. So these are the, I would say guardrails, which the application team wouldn't want to keep repeating, but the central team wants to enforce it. So the central team can create a cloud formation template, and they can put that as a product into AWS service catalog. Once it goes into the AWS services catalog, it would then go on into the portfolio. Once it ends up in the portfolio, you can have constraints associated with it. There are different kinds of constraints that you would want to have. So there can be a launch constraint where you're saying that only these roles would be allowed to launch this product. And additionally, you can add certain roles to the groups which would allow only certain app team members or certain app teams itself to be able to view that portfolio and operate on that portfolio, or invoke the product. And those kind of constraints can be added as well. Once the product list is available, the users can see the products and they will be able to launch the product. Now, when they launch the product, obviously the maximum they can do is pass the parameters. They wouldn't be able to change the product and remove the guardrails, which I had put in as a central engineering team, into the product as cloud formation templates. And finally, when the product is launched, you would be having a provision product as an output. And this provided product would be a resource which would be a sagemaker studio, or it would be a sagemaker notebook, which the application team can use. And this is where the segregation happens. As an administrator, I am able to control the product that an application team can use and also apply the guardrails which an application team would want to use. And that's the whole advantage of having the self service model. With the self service model, you will be able to leverage the infrastructure as code and define your infrastructure, your compute layer, your storage and other cloud resources, and using a JSoN or a YamL, or even terraform scripts or files. Once you have these things, you can put them as a product and then this product will be standardized AWS, a best practice across your organization by this central engineering team. And that can be one version of the product. An example is today it's Sagemaker. Tomorrow, if you have example of a three tier stack with EC two rds and S three, you can obviously make use of that and you will be able to have a standardized format. Okay, this is how my three tier stack is going to be. And multiple app teams can go ahead and provision that. That's another example. So that's the whole advantage of AWS service catalog, where the customer can create AWS based solutions and the product can be exposed by the central engineering team to the application teams. And once it has been exposed, the application team would be just provisioning it and because it has been created by the central engineering team, you can have the constraints applied to it, you can have the security controls applied to it, any kind of tag enforcement, any kind of restrictions, like no Internet on the studio and no root access on the notebook, all these things can be put into place. Now let's look at the second part of the requirement, which we had spoken up earlier, that as an application team I want to install some new libraries into my studio or into my notebook. This is where you would need a pypy mirror. I will share a link towards the end of the end of this particular talk, which will give you steps on how you can set up a secure environment via a workshop. But before that, you would want to understand what AWS code artifact is bringing to the table. If you want to set up a pipi mirror, you can make use of AWS code artifact which is sitting in a shared services account. If you recollect from the previous architecture diagram that we had a look at, there was this shared services account which was having a code artifact. And in that code artifact you are able to put in your libraries and you can download the libraries from the upstream pypy library. This is a fully managed artifact repository service and it supports NPM maven Python Nougat package formats. And currently you can make use of AWS code artifact with different package managers like Maven cradle, et cetera. The idea here is to have AWS code artifact sit in a central shared services account and different application teams. As and when they have a requirement, they would be able to pull down the curated list of libraries from that code artifact repository service, and they can go ahead and install it in their notebook or in their studio. Now let's look a little bit in depth on code artifact and what it is doing the same thing that I explained just now. You can have a public artifact repository. In this case it would be a Pypy public repository and you can create a domain. Now what's a domain? A domain is a code artifact specific construct that allows grouping and managing multiple code artifact repositories together. So if an organization is creating a central repository for sharing packages, they can have this domain being created and it can be shared across multiple teams. And when you have a repository, it contains a set of packages. So I can have a package on service catalog tools, I can have a package on the request package of Python or even sagemaker sagemaker release, which is Sagemaker 2.0 release that we have with one of the PiP packages. So as you see on the right there is this pull application dependencies for development. The development team will be able to just pull these dependencies as and when they need it. And you can also have this integrated into your CI CD pipelines by using codebuilt or other tools. That's the whole point of having code artifact. So a quick revision we saw what's the impact of service catalog which helps you create these curated products which can be reused by different application teams, and code artifact helps you create this centralized repository of PIP dependencies which can again be reused by different application teams. In that way you're able to provide this centralized governance or certain aspects of the machine learning resources which you would be using. And along with that you give the flexibility to the application teams to have a self service model where they can just pull down a product from the service catalog, provision it, and then they can go about doing their own application specific development within the resources like a studio or a notebook. With all that said and done, let's have a look at how you will be building these infrastructure components by using AWS cloud formation. We spoke about VPC networking and we mentioned that it's going to be a private VPC. Now here you can see that it's a private subnet and I'm having map public IP on launch as false, which ensures that my subnet which is getting created on the VPC is a private subnet. If you have a look at the security group, I am only exposing four four three. So security group ingress and Egress is ensuring that only four four three traffic can come in and go out. And the cider IP is the cider IP of the VPC itself. You're not exposing the security group outside for ICMP pings or anything else other than four four three. And you know that four four three will be only going to your VPC endpoints. And because it's private VPC that you are using, you need the VPC endpoints for any communication with other AWS services. The second part is enabling the VPC endpoints. Here you have the sagemaker runtime VPC endpoint which is sagemaker runtime, and you have the sagemaker API endpoint. Without these endpoints you wouldn't be able to interact with sagemaker in a private VPC. You can see that there are three subnets which are provided subnet one, two and three. All three are created by the previous VPC network that we spoke about. And again, the VPC id is going to be the same VPC id. And you can see the private DNS is enabled as true. Going back to the previous slide, you would notice that in terms of the VPC networking, we have set in the map public IP as false. So none of these VPC subnets would be having connectivity to the Internet. The third part is the flow logs. We had seen that there is a central security account and that security account was responsible for analyzing the VPC flow logs. VPC flow logs allows you to look at the traffic which is flowing in and out of certain enis. And if you're applying at the VPC level, it would look at the entire VPC traffic and tell you which traffic has been accepted and which has been rejected. Because you are having it in a central account, you would want to keep it in s three and that s three bucket, the log destination you are giving. I'm just giving an example of say Doc example bucket and you would want to give some kind of structure like flow logs and the account number from where this flow log is coming up here. I'm putting all the traffic here for tracking purposes and a maximum aggregation interval of 60 seconds. So this is where the fun happens. You have the Amazon Sagemaker studio and you have the notebook. Within the Amazon Sagemaker studio and the notebook here you can see that the KMS key id and the role ARN has been provided because you're providing the KMS key id, you are ensuring that you're using a CMk as a customer managed key for encrypting the Sagemaker notebook. And the same has been applied for Sagemaker Studio as well in terms of the execution rule. So as a central engineering team, when I am creating these products, by ensuring that the direct Internet access is disabled in the notebook instance, and by ensuring that the app network access type is VPC only, I'm ensuring that notebook and studio is never going to communicate with any traffic outside the VPC. Secondly, the root access has also been disabled on the notebook. You would see that the security groups which are being imported is coming from the sagemaker environment and default security group and default security group id that has been imported from a previous stack. Now that previous stack is the VPC stack that we had seen earlier where the VPC has been created and it is exporting these ids so that it can be imported into another product. And finally you have a volume size being provided. But if as an application team, if I'm looking at this stack and if I'm looking at this cloud formation, there is no way I'm going to change the direct Internet access. There is no way I'm going to change the KMS ID. I can't get the root access enabled. So these kind of controls help you build the compliance into the product which is existing in the service catalog, and that way you will be able to share this product confidently with your application teams and you will be able to create this reusable pattern where multiple teams can go ahead and reuse this product. So that's everything on the cloud formation side of it. We also spoke about the multi account structure using AWS organizations. The multi account structure using AWS organization needs a service catalog and it also has the service control policies which are being applied now. What are these service control policies? Service control policies are applied at the OU level, which is the organization unit, and they help you provide these broad strokes on certain restrictions which you would want to apply across the organization. So in next slide, we will be looking at what kind of service control policies can be applied for data. We know that we can control the compliance and restrictions on a product side. What we don't know is how to ensure that the data is always encrypted. Well, that can be done by using this service control policies. If you are applying the service control policy at an OU level, I'm saying that whenever you are creating an automl job, or a model, or a labeling job, or a processing job, or a training job, in all cases it is mandatory to give a sagemaker volume KMs key. So you can see at the top that the effect has been marked as deny. That means in case you are not provided a KMS key for the volume, then these actions will not be executed and you will not be allowed to execute these actions. The same applies for the output KMS key. So this ensures that every time you're creating a model, you're creating a training job, you're creating a transformation job, or you're creating a processing job. These actions are governed by the fact that you need to use a KMS key for the encryption of the data, and this often happens to be one of the guardrails or the requirements when it comes to regulated customers, that all the data has to be encrypted in transit or at rest. AWS has the facility of using the KMS keys, but in order to enforce it at an organization level, you would want to include it in the service control policies. The same applies for the traffic and network. In case you want to have intercontainer traffic encryption, you again can apply it as a service control policy at an organization level, and by doing so, you will be able to ensure that every job which is created. So here you can see that the effect is again deny. And in case someone wants to create a processing job or a training job or a monitoring job, they have to keep the traffic encryption as true. Because if the intercontainer traffic encryption happens to be false. So the condition says if the intercontainer traffic encryption is false, then the action is to deny all of these. Same with network isolation. If the network isolation happens to be false, then deny these. So for these actions to work, you have to provide the network isolation, you have to enable the traffic encryption. And the same with before, I'll just go over these two policies again, just for the clarity purpose of it, what the policies are saying is deny these actions if the volume KMS key happens to be null. So if you can see the condition, if it is null, which is true, then deny these actions. So in short, you can only execute these actions. If you're using a KMS key for the volume encryption and for the KMS key for the output and with respect to traffic and network, in case the intercontainer traffic encryption is false, then you will not be able to do processing job and training job. And same applies with network isolation. It will be worth to visit the documentation on the service control policies on how they work. But essentially these are guardrails which you can apply at the organization unit level, across your organization by the central platform, the platform team, and that will help you enforce these for all the application teams which are trying to create a model or train a job, or create a processing job, or do some kind of monitoring schedule, et cetera. Moving on, certain AWS AI services may also store and use the customer content for processing these services and for continuously improving the Amazon AI services. As an AWS customer when it comes to regulated industry, these customers would want to opt out from using their data in terms of improving the Amazon AI services. So there is a mechanism for you to opt out of this by applying this particular policy at your root level. Essentially what the policy is saying is do not allow any data from all the accounts under this organization to be used for improvement of the Amazon AI services and technology. So as an AWS customer, you can just opt out from using the data to improve the Amazon AI services. Now, we spoke a lot about service catalog. We spoke a lot about the code artifact and how the guardrails can be put in. Now let's see some screenshots on how these things actually look on the AWS console. This is where you can see the provisioning of the products using AWS service catalog. So when you create a service catalog product as an application team member, this is how I will be looking at it. You can see there is a sagemaker studio user, there is a studio, there is a notebook, and then there is a data science environment. As an application team member, I can go ahead and click on the data science environment and I can provision it and you can see that the provisioning is happening where the data science environment is coming up. And once it has come up, you would be able to see the VPC which has been created as part of your data science environment. If you want to create multiple Sagemaker notebooks, just click on the Sagemaker notebook product that you see on row number three and then you will be able to provision the sagemaker notebook as well. So far we have seen about the preventative controls. Now we will be looking at the detective controls. You can make use of AWS config in order to enforce the detective controls. Now these detective controls would be by using the existing rules which are available in AWS config. And once you enable it, you can see the non compliant resources as you see in the screenshot where default security group is not closed or the sagemaker endpoint configuration is not yet available. These kind of controls are applied by using AWS config. The next slide is about the centralized governance and the centralized governance again using code artifacts. So here you can see there is a central it pypy mirror and you can see that it's connected to the public repository and it would be continuously downloading the packages that you need. And then the central it team or the platform team will be able to control what all packages will be part of this. And as an application team member, if this is residing in your shared services account or something, you can just connect to it and then you can download the required dependencies. And that's where the centralized governance of the PIP dependencies come into picture. We have spoken about so many different aspects of governance and monitoring and also on the service catalog side, let's see how this would look if you want to build an entire end to end pipeline on provisioning of the artifacts. So we have what is called AWS service catalog tools which allows you to build this sort of a pipeline by using CFN Nag cloudformation Rspec, and you can validate the cloudformation templates which are sitting in your git repository. And from there it can go ahead and provision these products into the accounts that you need or share these products into the accounts that you need. I have a link towards the end of this talk where you can go ahead and play around with the AWS service catalog tools. That is service catalog factory and service catalog puppet. Not to be confused with the open source puppet configuration tool. This is again open source tooling from AWS under AWS Labs, and you should be able to see the service catalog tools link at the end of this talk. By using the service catalog tools you can create these end to end pipelines, and these pipelines will be responsible for getting your cloud formation template from the git repository where you have and converting it into a product which can be shared with multiple accounts. And then those accounts will be having a similar view of how the application team and how the service catalog is running. I'm just putting back the next slide now. So this is how you will be writing a Jupyter notebook. And here you can see that when you are creating a Jupyter notebook, you are making use of the session, which is a sagemaker session, and you are passing the boto three session in here. By passing the boto three session, you are allowing the sagemaker session to Piggybank on the previous boto three calls. And the clients like sagemaker client and the Sagemaker runtime client can be reused by the sagemaker session for executing or executing your code like the estimator or deploying your model, et cetera. This is an example which you can get from the Sagemaker notebook, which is on the video games Xgboost algorithm, and it will allow you to just run through this example and see how you can run the notebook and it's available from Amazon. In the next slide you would see the guardrails which we are putting in. So the service control policies we had mentioned that without providing a volume kms key and an output kms key, you would not be able to run an estimator. And the same applies for enable network isolation and intercontainer traffic. So you can see that these four attributes have been passed here, along with the subnets and the security groups. That's the enforcement that you're doing or the guardrails that you're applying. And without these guardrails, it is possible to run the estimator and it is able to train and deploy a model. But that's not the point here, right? We are trying to enforce certain guardrails, especially when it comes to building a secure machine learning environment for regulated customers. And in that case, you want to run everything within a VPC. So the first architecture diagram that we saw where everything was running in a private VPC. We are enforcing it here by using the subnets, by using the security group ids, by using the volume KMS key, the output KMS key, the encryption intercontainer traffic being set as true, and finally the enable network isolation being set as true. With all these four or five factors which have been added to the existing estimator, you are ensuring that the sagemaker job which is being run, which is being trained, and the model which is being deployed. The second line that you see where the model is again having a KMS key ARN being passed around, all these things are encrypted as per the standards that you would be having within that organization. So that's the whole point of building these environments. In terms of whatever we have discussed before, we are applying that in the code and enforcing it, because if as an application team member, I decide to mark the network isolation AWS false, my estimator is not going to get deployed. It will give me an error because I'm not allowing someone to run a processing job without ensuring network isolation. So that's the advantage of making use of these guardrails and the service control policies, and also the service catalog products where you are able to enforce it for the different products. Finally, it comes to monitoring, monitoring the deployed models. How would you monitor it? Now here is an example of a model which has been deployed and there have been 45 invocations of the model. So the model has an endpoint. And in the previous slide we saw that the model has been deployed with XGB deploy. It has been deployed on an ML M five X large instance. And there is a KMS key which is being used for encrypting it. In terms of the monitoring side, we have 45 invocations which have happened on the model. And this is where we are using cloud watch. There are no errors, no 500 errors, no 400 errors. And then you can also look at the model latency and the overhead latency. Now what's a model latency? That's the interval time taken by the model to respond to a request. And that's just from the viewpoint of sagemaker. So it would include the local communication that is happening to send the request and then to fetch the response and the overhead latency, that's the interval which is measured from the time sagemaker receives the request. So one is the model latency, which is the model invocation and the response time itself. And then there is the overhead latency. There are more dashboards which you can obviously building based on the metrics which are exposed in Cloudwatch. This is just an example on how you can leverage Cloudwatch for doing this. You can also look at the resource metrics like the cpu utilization of the model, the memory utilization and the disk utilization. So it gives you a very good visibility into the model itself. Along with this, the flow logs which were using earlier into a central security bucket. Those can be leveraged to look at network traffic. That gives you again a visibility into what is there in the network. And finally you have the cloud trail which is there for every API call of sagemaker. The best practice would be to monitoring these cloud trail events which are again happening. So to conclude, what did we learn? We are using multi account structure to improve the security and segregation of responsibilities. We are using SCPs and IAM policies to set up the preventative guardrails. We are leveraging AWS config for the detective controls. And finally, we are giving the application teams autonomy via self service products which are being shared by AWS service catalog using a combination of all these different features which are there on AWS. It gives you a capability to build a secure machine learning environment for a customer. And that's the whole, I would say, objective of this talk where I wanted to go through the best practices which can be applied when it comes to running Sagemaker, which is a managed compute and it gives you the capability of having all these different controls put in place. It gives you the capability of running your machine learning models at scale. And with the above mentioned security, I would say security practices, you can ensure that your workloads are running in a safe manner. And this is the last slide where it is basically the references that I have been talking about. You can go into the Sagemaker workshop and have a look at how service catalog has been used, how the controls have been put in. There are examples on GitHub, on Sagemaker and finally service catalog tools workshop as well, which gives you that centralized pipeline on code pipeline and how you can share the product with different teams. And do have a look at the white paper as well on machine learning with financial services on AWS which talks about how you can secure the data and what are the best practices when it comes to that being said, that brings me to the close of my talk. Thank you so much for your time and take care.

Slides

Download slides (PDF)

See all 23 talks at this event!

Conf42 Machine Learning 2021 - Online

July 29 2021

Building ML environment for regulatory customers

Video size:

Abstract

Summary

Transcript

Slides

Suraj Muraleedharan

Senior DevOps Consultant @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2021 - Online

July 29 2021

Building ML environment for regulatory customers

Video size:

Abstract

Summary

Transcript

Slides

Suraj Muraleedharan

Senior DevOps Consultant @ AWS

Join the community!