Conf42 Cloud Native 2024 - Online

Cloud agnostic & multi tenant application challenges and solutions

Video size:

Abstract

Building a cloud-agnostic multi-tenant Software as a Service (SaaS) application poses several challenges, given the diverse nature of cloud platforms and the need to cater to different tenants. In this session, I’ll share lessons learned after building a few cloud-agnostic multi-tenant SaaS applications.

Summary

  • Today's topic is cloud agnostic multitenant SaaS application challenges and solution. Cloud agnostic refers to a cloud design strategy in which applications, tools and services are designed to migrate seamlessly between multiple cloud platforms. Building a cloud strategy that meets a unique needs of your company isn't as simple as spinning up a few workloads.
  • Cloud based technology is becoming an increasingly popular choice for businesses around the world. By 2028, cloud computing will shift from being a technology disruptor to becoming a necessary component for maintaining business competitiveness. Why businesses moved or still moving to the cloud?
  • One of the advantage of cloudagnostic is avoid the risk of vendor lock in. It provides flexibility as well because developers are not restricted to one cloud platform. There are some disadvantages as well. The complexity of developing cloud agnostic application and features.
  • Stateful applications require persistent data storage, which can be difficult to manage in a containerized environment. Other challenges are related to failover or resilience and latency. Monitoring stateful application on Kubernetes can be a challenge.
  • SaaS means software as a service. Why companies are moving or asking for a SaaS solution? Because it provides a greater emphasis on the customer experience, rapid response to customer feedback and it promotes active customer engagement.
  • We need to think about the SaaS or multitenant at every layla level. For example, on the front end, how are we authenticating our user that are coming to access our system. Of course, all these things are running over some infrastructure, so we need to take care about the infrastructure provisioning or isolation or maintenance.
  • Tenant isolation is like silo model where every tenants gets their own environment. Second possible approach is like we can also isolate in a single environment using runtime policies. Data partitioning is not related to tenant solutions.
  • I started my talk with the cloud agnostic, why is it important and how to solve the solution. Then we moved into the multi tenant SaaS part. I mean how to why andhow to create a multi tenant saaS app. That's all for today's talk, thanks for listening, have a good day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, happy to present this topic in Con 42 and thanks for taking time to attend this. Today's topic is cloud agnostic multitenant SaaS application challenges and solution. My name is Abaidat Baroha. I have around 16 years of experience in building software and since last three years I'm managing engineering team s where we work on real time upstream operational data, ingestion and delivery. I work with ishrambajay, which is the world's number one company in upstream oil services. I have around 14 years of experience in various aspects of the oil field where we deal with the operational and technical okay, so jumping on to the today's topic, first I would like to start with the cloudagnostic part. So what is cloud agnostic? Building a cloud strategy that meets a unique needs of your company isn't as simple as spinning up a few workloads in the cloud. Today's sophisticated cloud deployments follow unique design patterns such as cloudagnostic strategy to meet a variety of unique businesses and technical requirements. So what is cloud agnostic? Cloud agnostic refers to a cloud design strategy in which applications, tools and services are designed to migrate seamlessly between multiple cloud platforms or between on premises and cloud in a hybrid model without disruption of services. Some of the advantages of a cloud diagnostic approach are it support seamless portability independent of the underlying operating system, to ensure limited disruption of the workloads in migration and to limit the risk of application downtime while enhancing cost efficiency. Before even going deeper into the cloud agnostic part, let's start with why we even started with the cloud. I mean, why was so special about the cloud? I mean, cloud is not a new word. Now, cloud based technology is becoming an increasingly popular choice for businesses around the world because cloud can help businesses to streamline their processes and operations, allowing them to focus more on their core business. As per Gartner report, by 2028, cloud computing will shift from being a technology disruptor to becoming a necessary component for maintaining business competitiveness. And why businesses moved or still moving to the cloud? Because there are several benefits, let's quickly go and check what are the significant ones. So the first one is pay as you go model, where it offer a tailored and cost effective software solution. But how cloud services and resources offer cost effective solutions as businesses only pay for their services and can reduce costs when demand decreases without worrying about wasted hardware investment. Moving from the capital heavy expense of installing, maintaining and upgrading on premises it infrastructure to the operational cost of a SaaS subscription provides a greater clarity on the cost of using a software solution in terms of license or maintenance and infrastructure cost. Second obvious benefit is scalability. Cloud resources can provide businesses the flexibility to adjust to changing needs. With cloud resources, companies can quickly scale up or down depending on their current requirements. This is especially beneficial for businesses that have unpredictable peaks and thoughts in their demand as they don't have to invest heavily in infrastructure, part that becomes redundant during the quieter periods. Other benefit is like automated backups and disaster recovery. In general, cloud solutions include automated backups. Out of the box, cloud vendors can perform daily backups and weekly and monthly backups so that you are sure that your data is in safe hands. In addition to backups, cloud vendors can offer advanced disaster recovery programs to protect you from unexpected disruption. Production data synchronized regularly to a secure server in a remote location. In the event of a disaster, the production server is updated with the latest backup of the remote server. Other benefits include up to date software upgrades so cloud system provide a higher uptime compared to on premises cloud system, help reduce technology complexity and rely on an enhanced and secure cloud infrastructure. There's also no need to plan for costly it upgrades. It eliminates the hassle of managing upgrades or any other it expansion. As your business grows, it ensures that your software solution is always up to date by letting the software vendor manage upgrades. One last advantage is enhanced data sick security. Cloud solutions are even more secure than on premise solutions, enabling you to store your strategic data on a secure infrastructure. Cloud providers are large companies with high technical expertise and hire. They can hire the certified professional as well. They comply with many international regulations and use the most recent. Cloud providers also run powerful cybersecurity software to prevent attacks and protect. So as we can see, there are lots of benefits of moving. Now let's talk about what are the pros and cons for the cloud agnostic? We discuss about the benefits of cloud, but why cloud agnostic now? So ever since public clouds were introduced, organizations have increasingly adopted the greater feature that cloud solution provide. Almost infinite scalability, cost efficiency, reduced management overhead and high flexibility are just some of the features that public cloud provides. All these can be used to gain an advantage over competitors. But as the IT industry goes, there has always been one thing that seems never to have changed over the years. That is a vendor lock in. One of the advantage of cloudagnostic is avoid the risk of vendor lock in. But what is this vendor lock in? So vendor lock in has been present in many forms ever since the first commercial software was introduced. Cloud providers practice vendor lock in as s web. They implement their infrastructure in such a way as to make it more difficult to migrate to their competitors. Software companies deploying their software on public cloud infrastructure such as Microsoft Azure, Google Cloud and AWS must keep that in mind. After all, nothing's future proof and even the biggest companies can fail significantly, raise their prices, change license or do pretty much anything that can make life more difficult. To avoid this, the concept of the cloud agnostic application was introduced. What are other benefits so performance, wide range of features and options that customer can use to maximize the performance. It provides flexibility as well because now developers are not restricted to one cloud platform capabilities or tooling and they can incorporate the open source tools libraries. It also help to increase the resiliency application, provide redundancy and improve recovery speed in the event of a failure and services can be switched to another platform if the initial platform experience some kind of downtime, but there is no free lunch. There are some disadvantages as well. So implementing or designing your application or services with cloud agnostic approach it's not easy. It is challenging because cloud agnostic multi happen at the developer level. Making implementation challenges other challenges is the time to market. The complexity of developing cloud agnostic application and features means that it can take longer for projects to get off the ground. How to design a cloudagnostic architecture? Simple answer. Use Kubernetes and you are done. What we can do, we can build our services using containerized workload. Our friend Docker is available and we can deploy containers to the Kubernetes and these Kubernetes offered as a service by major cloud provider in terms of aks, eks or GKE from Azure and AWS and go Google Cloud. Whenever we want to add new capability, just add a new container and as long as we can switch from one public cloud to other cloud because kubernetes is available. So in this way we can easily design a cloud agnostic architecture. But is it really simple that way? So let's take one example. Okay? If I want to run some kind of messaging, let's say rapid MQ or Kafka, just eat new container. I want some cache, run another container if I want. Redis has high availability. We can run our redis cluster in container. We want some dbms. Okay fine. You can run PostgreSQL in containers. You need some object storage like AWS S three. We have a solution. Minio in containers unit monitoring. We can add more containers for elasticsearch, logistache and Kibana. So what we are doing exactly, we are adding more and more containers to our solutions let's take a step back. Why we move so one reason for companies to move to the cloud is to reduce the engineering effort when we are adding more and more containers. So we are basically increasing the engineering efforts in maintaining this container instead of focusing on providing a solution or feature to our customer. Using a SaaS database or a SaaS message broker or a SaaS Kubernetes is a great for various reasons because it can reduce our operational effort and vendor takes care of patching and updating. We can focus on our product instead of building an internal engineering effort. How to maintain a cloud balancer to help our business. We can move faster and more efficiently because the provider scales up and down. New products can be used by triggering the cloud vendor APIs, but if we examine this cloud agnostic approach, the implication shows that unintentionally building a custom data center instead of leveraging the cloud provider capabilities, instead of using the SaaS capabilities offered by the cloud, we are creating an often worse data center or infrastructure. So we are increasing the engineering effort. As you can see, once these components are deployed, then we need to patch and maintain and our engineering team is always busy in this stuff. Let's talk about some other challenges comes with running the stateful application on Kubernetes. So what is a stateful application? So whenever we are running any database cache or messaging on Kubernetes. So kubernetes provide a workload API that call stateful set API. So what is stateful set API? So stateful set is a workload API object used to manage the stateful applications. It manages the deployment scaling of a set of pods and provides guarantees about the ordering and uniqueness of these pods. A stateful maintains a sticky identity for each of its pod. These poses are created from the same spec, but are not interchangeable. Each has a persistent identifier that it maintains across any rescheduling. So if you want to use storage volumes from the cloud to provide the persistence for your workload, you can use a stateful set as part of your solution. Although individual ports in a stateful set are susceptible to failure, the persistent port identifier make it easier to match existing volumes to the new poses that replace any that have failed. But it's not easy to run your stateful application by stateful set because there are several challenges. One of the primary challenge of running a stateful application on Kubernetes is managing a persistent data storage. A traditional stateless application can simply be replicated across multiple poses, but a stateful application also require persistent data storage, which can be difficult to manage in a containerized environment. Kubernetes provides several options for data storage. It includes a local storage or network attached storage or cloud storage, but choosing the right storage solution can be challenging. Other challenges is about networking. Because stateful applications typically require communication between nodes, it's important to ensure that the networking infrastructure is designed to support this. Kubernetes provide several networking solutions include container networking or pod networking and service networking, but configuring these options correctly can be complex. Security is another key challenge for stateful apps. Because stateful apps often store sensitive data, it is important to ensure that the container environment is secure. And Kubernetes provides several security features like role based access control, port security policies, and network. But properly configuring these features can be finally, monitoring stateful application on Kubernetes can be a challenge. Because stateful applications require persistent data storage, it is important to monitor the health and performance of the data storage system. Kubernetes provides several monitoring tools like Prometheus and Grafana and Kubernetes dashboard, but configuring these tools to monitor stateful application there are other challenges as well. Different cloud providers provide different capability. Sometimes we can compare these capabilities but most of the time not. For example, looking at the distribution of data center, the global cloud does not seem so global after all. For example, if we are building a system for a bank, then we have to meet the GDPR regulator requirements. That means we are not free to use any capability worldwide. So building an architecture around the available data center is a leaky abstraction. Other challenges are related to failover or resilience and latency. It all depends on the location of the data center. If one provider offers fewer locations than another, then we are logged in. We need to be aware of this fact and consider the impact when moving from one cloud to another. If we require a special hardware or dedicated server, we will find out pretty quickly that limitless scale may be a problem too. Other problems are related to networking. Unlike AWS and Azure, Google Cloud provides virtual private cloud resources that are not tied to any specific reason. It is a global resource and also we need to think in terms of the data cost as well because ingress is free, but egress can be expensive and different cloud providers have their own policy of egress. Last but not the least, infrastructure as a code, it is considered a good practice to automate infrastructure environment using tools like terraform or AWS CDK. This helps reduce configuration drift and increases the overall quality of the infrastructure. However, the capabilities of the underlying cloud provider tend to get baked into the infrastructure code. Moving infrastructure code from GCP to Azure effectively means rewriting everything. Sure, the concepts of the high level architecture may be similar, but for the code, this is similar to moving an application from Java to Golang. In terraform, switching from GCP to an Azure means throwing everything away. Okay, so let's talk about some solution. What kind of approach, in my experience worked well for me. So it's a well known design prince principle, what we call facade. So a facade is basically a structural design pattern that provides a simplified interface to a library, a framework or any other complex set of classes. We can apply the same facade principle while designing a cloud agnostic application. As you can see, we have this blue box with cloud agnostic microservice which is talking to some messaging facade, and messaging facade is basically talking to a specific cloud provider adapter. In GCP, we want to connect to cloud pub sub in Azure, Azure service bus in AWS, we want to run SNS. So whenever we are changing any cloud provider, we just need to write another adapter and our service is intact. And this messaging facade, we need to change this until, unless we have this flexibility available where we are not touching the cloudagnostic microservice code, then we can always switch easily from one cloud provider to other other thing like what worked well. We need to identify the areas where lock in must be kept to a minimum and we should only focus on using product that have corresponding counterparts on other platform. If you are choosing some relational database, then GCP has a cloud SQL. Azure has Azure database for PostgresQ, SQL for runtime we have public offering for kubernetes like GKE or EKS. For serverless we can use knative, for time series database we can use GCP bigtable or AWS DynamoDB. So what I'm trying to say, let's focus on using the cloud provider capability. Whenever you are running any stateful workload for database or messaging and at application level, wherever you are writing some new service, use facade better. After following these two approaches, I was able to solve a lot of problem which was there when we are not using the cloudagnostic approach. Now I am coming to the second part of my talk, that is SaaS. SaaS means software as a service. But why SaaS? I mean why companies are moving or asking for a SaaS solution? Because it provides a greater emphasis on the customer experience, rapid response to customer feedback and it promotes active customer engagement, higher value on operational and there are some fundamentals of any SaaS application. As you can see on my screen I'm showing few gray boxes, so let me quickly go through it. So the first important part of any SaaS application is onboarding. How the tenant introduced in your environment and how the infrastructure is provisioned, how the tier selection is happening for billing. Other important thing is authentication and authorization to associate these onboarded tenants to some notion of identity. And other important part is how to easily flow the tenant context across moving part of our complex system. There are two important things. One is like tenant solutions. How does your architecture ensure that one tenant can't access the resources of another tenant? And how do you instrument your application to meter tenant activities so that we can charge the bill and money? And at last we need to aware about the tenant operation. So before going deeper into the SaaS part, let's first discuss about the multitenant impact in our microservice application. So we need to think about the SaaS or multitenant at every layla level. For example, on the front end we need to think about how are we authenticating our user that are coming to access our system. We need to gather information needed to route these requests to the proper backend resource. Because it's SaaS, we will be running a single version of software for all customer. But if some client needs some different experience then we need to support feature flag to turn some things on or off. When you are going to the API gateway here, you need to make the decision about authorization or throttling or caching. Now we are going to hit our business logic. We are hitting our microservice where we need to gather data based on some tenant context. We need to add some metrics and logging and metering based on that particular tenant. This microservice we talk to some data persistence or data access layer. Then we need to think about how we partitioned our data resources so that we know where is the data located for a specific tenant. Of course, all these things are running over some infrastructure, so we need to take care about the infrastructure provisioning or isolation or maintenance. Overall tenant lifecycle let's take an example of a normal microservice flow. So what happens in a normal microservice flow? Let's say there is some client using our app, is accessing our app using some mobile client or it could be a computer as well. That person basically do a login operation through the UI. Then UI is basically talking to some identity provider. It could be OIDC flow to get the beta token. Then the third part is we are transferring that beta token and later that UI application is going through some API gateway and optionally it is talking to identity provider again to validate the token and do the authorization part. And if everything works well then at the end is talking to microservice and microservice is making call to some database. So this is like a typical flow of any microservice app going deeper into how this SaaS thing impacting our microservice application. Let's quickly check what is a non microservice? How does a non microservice code looks like? Okay, on right hand side I am showing a simple API. I mean it's not a working code, it's for a demonstration purpose only. What we are doing, we are creating a dynamoDB instance, there is some table name and we have some get data API. What we are doing, we are just getting the key from the request, from the query parameters coming into the API. Then we are referencing a table in dynamodb and after that we are getting the response and at the end we are doing some exception handling. But one important thing is missing in this example. If we want to write this API in, here's the example for a SaaS code or if you are writing some microservice to support multi tenants SaaS workflow. So what is the difference in this code from the last code? What we are doing? We are getting some tenant id from the request header, the highlighted code in yellow and rest of this stuff looks but why this tenant id matters? Why this tenant id matters because as I talked earlier, we need to take care about the tenant isolation and data partition and metric and billing. So we need to know about the tenants id and we need to flow that tenant id in the tenants context across all services. Otherwise there is no way to know, okay, how to serve the request for a specific tenant or how to charge the bill to my tenant. But how to get this tenant id and how our typical microservice workflow change. If you are talking about multitenant or SaaS workflow, go to a workflow and talk about how we want to provision or onboard our tenant. So we will go to the same example. We have some mobile client then right now the important part is we need to provision our tenants first. Why? So that we can flow these tenant ids as a tenant context across our app. So there could be a tenant provisioning service which may call to some user management service to create users and claims by using some or talking to some identity provider or OIDC workflow and also we can apply some access policy as well, like if we want to provide some rights like someone is admin or some other kind of tenant policies. And once these things are done then we can optionally called a tenant management service where we can create a tenant id and with this tenant management service we can get the tenant id or tier plan or the status of tenant and we can use these things to flow further downstream to our other microservice service and we can use the same tenant context to the metering and building service as well. But the question is how to flow these tenant context. So there are several ways to do this. Most used option is using the JWT token, which is what we call JSON web token as well. So what is a JWT token? It is basically a base 64 encoded string. Spread it into three parts, header, payload and signature header gives us the type and hashing algorithm. Payload is a list of key value pairs or sometimes we call it that tell us about the token, who issued it, when is it going to expire for intended audience we can add our own key value pair to these tokens and we will use this poses along the tenant id. The last part is signature combination of encoded header payload and secret and we can use it to ask identity provider to confirm that no one has modified it. Other ways to pass token is in the URL query string parameter. We can use the query string parameter and we can pass that token tenant id information. Third option is we can pass in the request header as some custom header like x tenant id. Last option is we can also create a separate microservice. So whenever request is coming to our business logic we can always make a call to some other service to get the tenant id. But there is one downside of this approach here we are always creating a single point of failure by making call to one service. And the disadvantage is let's say that service is not available, then we don't have any tenant information available and our workflow will not work. Other important thing, what we need to keep in mind, whenever we are building a multi tenant SaaS application we can move our code to some common libraries. For example for this repeating task where we always need to get the tenant id from JWT token. So the steps involved we need to read the header authorization header from the request. Then we need to get the vera token and claims and after that we can get this tenants id. So this is like a common piece of code, what we need to execute with each API call. So it's good to have this code in some library and other services are using this code from that shared library. Other important thing is it's good to capture this tenant context in the logging. So we can always use some common libraries for structured logging and we can capture all these tenant context in log itself. Last but not the least, we need to flow these tenant contexts to the metric as well. So here I'm giving you example of open telemetry. So as you can see on left hand side there are microservices and shared infra and we are running some open telemetric collector for doing the instrumentation part and we can easily flow these metrics to some time series database or tracing or some column store. So in this way we can capture this metric with the provided tenant id and context. Also we need to talk about how are we going to partition our data. So there are different ways available. One option is to go with the silo approach where we are creating a separate database for each tenant. Second option is the pool based model where database is shared between all tenants and we have a single schema and we have one column with a tenant id that is indexed. And whenever we are making our queries, all queries are following that index and we are returning the response to our customer. The last option is the bridge approach where we have a single database but multiple schemas. So in my personal experience I use this approach because it helped me to get some kind of isolation in place and I'm not sharing the data from one tenant to other tenant. But at the end it's a microservice based decision. I mean, we need to think in terms of compliance and security, what your client wants, how it is going to impact your performance and data distribution, because there are always a problem. I mean, if we are going with the approach like a pool based approach or bridge based approach, then it may cause a noisy neighbor problem where one tenant is more active than other tenant and it is taking your all iops cycle or cpu. Let's quickly talk about the tenant solutions part. So what is tenant isolation? So one thing I would like to emphasize here, I mean, data partitioning is not related to tenant solutions. Tenant isolation is like here also we have two, two approaches. One is like silo model where every tenants gets their own environment. When I'm saying environment, for example, let's say whenever we are running some Kubernetes cluster and we are deploying our app. So tenant one will get its own environment and tenant two will get its own. The second possible approach is like we can also isolate in a single environment using runtime policies. How does it work? Let's take one example. Let's say we have some client logging to our application using some mobile device and after that it is going to some OidCu workflow to get the mirror token and after going to the API gateway it is hitting our microservice and now we reach to our data access layer where we are basically getting data. So before making this call to our database with the tenant scoped access, what will we do? We will just go and call the tenant scope credential from some tenant access manager service. It will return back us tenant policies and by using those policies now we can make the call to the database. So it will help us to achieve the tenant isolation using the runtime policy mechanism. Okay, so now here I am on the last part of my today's so takeaways is okay, I started my talk with the cloud agnostic, why is it important and how to solve the solution. What we face using kubernetes. So one approach is like using the facade pattern and whenever we are writing any microservice use that facade pattern and write our code in cloud agnostic manner. That is also called a loosely coupled architecture. The other thing is we can focus on strategic locking. We can still use offerings from public cloud wherever we can use and we can only use Kubernetes to run our services instead of running database or messaging. Then we moved into the multi tenant SaaS part. I mean how to why and how to create a multi tenant SaaS app. I mean we covered the tenant lifecycle part, how to flow the tenant context using the JWT token or request header are and at last we covered what are the different models of data partition and how to achieve the tenant solutions. That's all for today's talk, thanks for listening, have a good day.
...

Abhay Dutt Paroha

Software Team Leader @ Schlumberger

Abhay Dutt Paroha's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways