Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, happy to present this topic in Con
42 and thanks for taking time to attend this.
Today's topic is cloud agnostic multitenant
SaaS application challenges and solution.
My name is Abaidat Baroha. I have around 16
years of experience in building software and since last
three years I'm managing engineering team s where we
work on real time upstream operational data,
ingestion and delivery. I work with ishrambajay,
which is the world's number one company in upstream oil services.
I have around 14 years of experience in various
aspects of the oil field where we deal with the operational
and technical okay, so jumping on to
the today's topic, first I would like to start
with the cloudagnostic part. So what is cloud
agnostic? Building a cloud strategy that meets
a unique needs of your company isn't as simple as spinning
up a few workloads in the cloud.
Today's sophisticated cloud deployments follow unique design
patterns such as cloudagnostic strategy to meet
a variety of unique businesses and technical requirements.
So what is cloud agnostic? Cloud agnostic
refers to a cloud design strategy in which applications,
tools and services are designed to migrate seamlessly
between multiple cloud platforms or between on premises
and cloud in a hybrid model without disruption of
services. Some of the advantages of a cloud diagnostic
approach are it support seamless portability
independent of the underlying operating system, to ensure
limited disruption of the workloads in migration and to
limit the risk of application downtime while enhancing
cost efficiency. Before even going deeper into
the cloud agnostic part, let's start with why
we even started with the cloud. I mean, why was so special
about the cloud? I mean, cloud is not a new word. Now,
cloud based technology is becoming an increasingly popular choice
for businesses around the world because cloud
can help businesses to streamline their processes and operations,
allowing them to focus more on their core
business. As per Gartner report, by 2028,
cloud computing will shift from being a technology disruptor
to becoming a necessary component for maintaining business
competitiveness. And why
businesses moved or still moving to the cloud?
Because there are several benefits, let's quickly
go and check what are the significant ones. So the
first one is pay as you go model, where it
offer a tailored and cost effective software
solution. But how cloud services
and resources offer cost effective solutions as
businesses only pay for their services and can reduce costs
when demand decreases without worrying about wasted
hardware investment. Moving from the capital heavy expense
of installing, maintaining and upgrading on premises
it infrastructure to the operational cost of a SaaS
subscription provides a greater clarity on the cost of
using a software solution in terms of license or
maintenance and infrastructure cost. Second obvious
benefit is scalability. Cloud resources can provide businesses
the flexibility to adjust to changing needs. With cloud
resources, companies can quickly scale up or down
depending on their current requirements. This is especially
beneficial for businesses that have unpredictable peaks
and thoughts in their demand as they don't have to invest heavily
in infrastructure, part that becomes redundant
during the quieter periods. Other benefit
is like automated backups and disaster
recovery. In general, cloud solutions include
automated backups. Out of the box, cloud vendors
can perform daily backups and weekly and monthly backups
so that you are sure that your data is in safe hands.
In addition to backups, cloud vendors can offer
advanced disaster recovery programs to protect you from
unexpected disruption. Production data
synchronized regularly to a secure server in
a remote location. In the event of a disaster,
the production server is updated with the latest backup of the
remote server. Other benefits include up to date
software upgrades so cloud system provide
a higher uptime compared to on premises cloud system,
help reduce technology complexity and rely on an enhanced and
secure cloud infrastructure. There's also no
need to plan for costly it upgrades.
It eliminates the hassle of managing upgrades or any
other it expansion. As your business grows, it ensures that
your software solution is always up to date by letting
the software vendor manage upgrades.
One last advantage is enhanced data sick security.
Cloud solutions are even more secure than on premise solutions,
enabling you to store your strategic data on a secure infrastructure.
Cloud providers are large companies with high technical
expertise and hire. They can hire the certified professional
as well. They comply with many international regulations
and use the most recent. Cloud providers
also run powerful cybersecurity software to prevent attacks
and protect. So as we can see,
there are lots of benefits of moving. Now let's
talk about what are the pros and cons
for the cloud agnostic? We discuss
about the benefits of cloud, but why cloud
agnostic now? So ever since
public clouds were introduced, organizations have increasingly adopted
the greater feature that cloud solution provide.
Almost infinite scalability, cost efficiency,
reduced management overhead and high flexibility are just
some of the features that public cloud provides.
All these can be used to gain an advantage over competitors.
But as the IT industry goes, there has always been
one thing that seems never to have changed over the years.
That is a vendor lock in. One of the advantage
of cloudagnostic is avoid the risk
of vendor lock in. But what is this vendor lock in?
So vendor lock in has been present in many forms ever since the
first commercial software was introduced. Cloud providers
practice vendor lock in as s web. They implement their infrastructure
in such a way as to make it more difficult to migrate to
their competitors. Software companies deploying their software
on public cloud infrastructure such as Microsoft Azure,
Google Cloud and AWS must keep that in mind.
After all, nothing's future proof and even the biggest
companies can fail significantly, raise their prices,
change license or do pretty much anything that can make
life more difficult. To avoid this, the concept of the
cloud agnostic application was introduced. What are other
benefits so performance, wide range of
features and options that customer can use to maximize the performance.
It provides flexibility as well because now developers
are not restricted to one cloud platform capabilities
or tooling and they can incorporate the open source tools
libraries. It also help to increase
the resiliency application, provide redundancy
and improve recovery speed in the event of a failure
and services can be switched to another platform if the initial platform experience
some kind of downtime, but there is no
free lunch. There are some disadvantages as
well. So implementing or designing your application
or services with cloud agnostic approach it's not
easy. It is challenging because cloud
agnostic multi happen at the developer level.
Making implementation challenges other
challenges is the time to market. The complexity of developing cloud agnostic
application and features means that it can take longer for projects to
get off the ground. How to design a cloudagnostic
architecture? Simple answer. Use Kubernetes
and you are done. What we can do, we can build our
services using containerized workload. Our friend Docker is
available and we can deploy containers to the Kubernetes
and these Kubernetes offered as
a service by major cloud provider in terms of aks,
eks or GKE from Azure and AWS
and go Google Cloud. Whenever we want to add new capability,
just add a new container and as long as we can
switch from one public
cloud to other cloud because kubernetes is available.
So in this way we can easily design a cloud agnostic
architecture. But is it really simple
that way? So let's
take one example. Okay? If I want
to run some kind of messaging, let's say rapid MQ
or Kafka, just eat new container.
I want some cache, run another
container if I want. Redis has high availability.
We can run our redis cluster in container.
We want some dbms. Okay fine.
You can run PostgreSQL in containers. You need
some object storage like AWS S three. We have
a solution. Minio in containers unit monitoring.
We can add more containers for elasticsearch,
logistache and Kibana. So what we
are doing exactly, we are adding more and more containers
to our solutions let's take a step back.
Why we move so one reason for companies
to move to the cloud is to reduce the engineering
effort when we are adding more
and more containers. So we are basically
increasing the engineering efforts in maintaining
this container instead of focusing on providing a solution
or feature to our customer. Using a
SaaS database or a SaaS message broker or a SaaS
Kubernetes is a great for various reasons
because it can reduce our operational effort
and vendor takes care of patching and
updating. We can focus on our product instead of building
an internal engineering effort. How to maintain
a cloud balancer to help our business.
We can move faster and more efficiently because
the provider scales up and down.
New products can be used by triggering the cloud vendor APIs,
but if we examine this cloud agnostic approach,
the implication shows that unintentionally
building a custom data center instead of
leveraging the cloud provider capabilities,
instead of using the SaaS capabilities offered by the
cloud, we are creating an often worse data
center or infrastructure. So we
are increasing the engineering effort. As you can
see, once these components are
deployed, then we need to patch and maintain and
our engineering team is always busy in this stuff.
Let's talk about some other challenges
comes with running the stateful application
on Kubernetes. So what is a stateful application?
So whenever we are running any database
cache or messaging on Kubernetes.
So kubernetes provide a workload API that call
stateful set API. So what is stateful
set API? So stateful set is a workload API
object used to manage the stateful applications.
It manages the deployment scaling of a set of pods and
provides guarantees about the ordering and uniqueness
of these pods. A stateful maintains a sticky identity
for each of its pod. These poses are created from the same
spec, but are not interchangeable.
Each has a persistent identifier that it maintains across any
rescheduling. So if you want to use
storage volumes from the cloud to provide the persistence
for your workload, you can use a stateful set as part of your
solution. Although individual ports in a stateful set are
susceptible to failure, the persistent
port identifier make it easier to match existing volumes to
the new poses that replace any that have failed.
But it's not easy to run your stateful
application by stateful set
because there are several challenges.
One of the primary challenge of running a stateful application on
Kubernetes is managing a persistent data storage.
A traditional stateless application can simply be replicated
across multiple poses, but a stateful application also
require persistent data storage, which can be difficult to manage
in a containerized environment. Kubernetes provides
several options for data storage. It includes a local storage
or network attached storage or cloud storage, but choosing
the right storage solution can be challenging. Other challenges
is about networking. Because stateful applications
typically require communication between nodes, it's important to ensure
that the networking infrastructure is designed to support this.
Kubernetes provide several networking solutions include
container networking or pod networking and service
networking, but configuring these options correctly can be complex.
Security is another key challenge for stateful apps.
Because stateful apps often store sensitive data,
it is important to ensure that the container environment is
secure. And Kubernetes provides several security features like role
based access control, port security policies, and network.
But properly configuring these features can be finally,
monitoring stateful application on Kubernetes can be a challenge.
Because stateful applications require persistent data storage, it is
important to monitor the health and performance of the data storage system.
Kubernetes provides several monitoring tools like Prometheus
and Grafana and Kubernetes dashboard, but configuring these
tools to monitor stateful application there are other
challenges as well. Different cloud providers
provide different capability. Sometimes we can compare these capabilities
but most of the time not. For example,
looking at the distribution of data center,
the global cloud does not seem so global after all.
For example, if we are building a system for
a bank, then we have to meet the GDPR regulator
requirements. That means we are not free
to use any capability worldwide. So building
an architecture around the available data center is
a leaky abstraction. Other challenges are related to
failover or resilience and latency.
It all depends on the location of the data center. If one provider
offers fewer locations than another, then we are logged in.
We need to be aware of this fact and consider the impact
when moving from one cloud to another. If we require
a special hardware or dedicated server, we will find
out pretty quickly that limitless scale may be a problem too.
Other problems are related to networking.
Unlike AWS and Azure, Google Cloud provides virtual
private cloud resources that are not tied to any specific
reason. It is a global resource and
also we need to think in terms of the data cost
as well because ingress is free, but egress can be expensive and
different cloud providers have their own policy of egress.
Last but not the least, infrastructure as a code,
it is considered a good practice to automate infrastructure
environment using tools like terraform or AWS CDK.
This helps reduce configuration drift and increases
the overall quality of the infrastructure.
However, the capabilities of the underlying cloud provider
tend to get baked into the infrastructure code.
Moving infrastructure code from GCP to Azure effectively
means rewriting everything. Sure, the concepts of the high level architecture
may be similar, but for the code, this is similar to moving an
application from Java to Golang. In terraform, switching from
GCP to an Azure means throwing everything
away. Okay, so let's talk about
some solution. What kind of approach, in my
experience worked well for me. So it's a
well known design prince principle, what we call
facade. So a facade is basically a structural design
pattern that provides a simplified interface to a library,
a framework or any other
complex set of classes. We can apply the
same facade principle while designing a cloud agnostic
application. As you can see, we have
this blue box with cloud agnostic microservice which
is talking to some messaging facade,
and messaging facade is basically talking to
a specific cloud provider adapter. In GCP,
we want to connect to cloud pub sub in Azure,
Azure service bus in AWS, we want to run SNS.
So whenever we are
changing any cloud provider, we just need to write another adapter
and our service is intact. And this messaging
facade, we need to change this until,
unless we have this flexibility available
where we are not touching the cloudagnostic
microservice code, then we can always switch easily from
one cloud provider to other other thing like
what worked well. We need to identify the areas
where lock in must be kept to a minimum and
we should only focus on using product that have corresponding counterparts
on other platform. If you are choosing some relational
database, then GCP has a cloud SQL.
Azure has Azure database for PostgresQ,
SQL for runtime we have public offering for
kubernetes like GKE or EKS.
For serverless we can use knative, for time series database
we can use GCP bigtable or AWS DynamoDB.
So what I'm trying to say, let's focus on
using the cloud provider capability. Whenever you are running any
stateful workload for database or messaging
and at application level, wherever you are writing some
new service, use facade better. After following
these two approaches, I was
able to solve a lot of problem which was there
when we are not using the cloudagnostic approach.
Now I am coming to the second part of my talk,
that is SaaS. SaaS means
software as a service. But why SaaS?
I mean why companies are moving or asking for a
SaaS solution? Because it provides a greater emphasis
on the customer experience, rapid response
to customer feedback and it
promotes active customer engagement,
higher value on operational and
there are some fundamentals of any SaaS application.
As you can see on my screen I'm showing few gray boxes,
so let me quickly go through it.
So the first important part of any SaaS application is
onboarding. How the tenant introduced in your
environment and how the infrastructure is
provisioned, how the tier selection is happening for billing.
Other important thing is authentication and authorization
to associate these onboarded tenants to some notion of
identity. And other
important part is how to easily flow the tenant context
across moving part of our complex system.
There are two important things. One is like tenant solutions.
How does your architecture ensure that one tenant can't access the
resources of another tenant? And how
do you instrument your application to meter tenant activities so
that we can charge the bill and money?
And at last we need to aware about the tenant operation.
So before going deeper into
the SaaS part, let's first discuss about the
multitenant impact in our microservice application.
So we need to think about the SaaS
or multitenant at every layla level.
For example, on the front end we need to think about
how are we authenticating our user that are coming to access
our system. We need to gather information needed
to route these requests to the proper backend resource.
Because it's SaaS, we will be running a single version of software
for all customer. But if some client needs some different experience
then we need to support feature flag to turn some things on or
off. When you are going to the API gateway here,
you need to make the decision about authorization or throttling
or caching. Now we are going to hit our business
logic. We are hitting our microservice where we need
to gather data based on some tenant context. We need to
add some metrics and logging and metering based on that particular tenant.
This microservice we talk to some data persistence
or data access layer. Then we need
to think about how we partitioned our data resources
so that we know where is the data located for a specific tenant.
Of course, all these things are running over some infrastructure,
so we need to take care about the infrastructure provisioning or isolation
or maintenance. Overall tenant lifecycle let's
take an example of a normal microservice flow.
So what happens in a normal microservice flow?
Let's say there is some client using our app,
is accessing our app using some mobile client or it could be
a computer as well. That person basically do
a login operation through the UI.
Then UI is basically talking to some identity
provider. It could be OIDC flow to
get the beta token. Then the third part is
we are transferring that beta token and later
that UI application is going through some API gateway
and optionally it is talking to identity provider again
to validate the token and do the authorization part.
And if everything works well then at the end is talking to microservice
and microservice is making call to some database. So this is
like a typical flow of any microservice
app going deeper into how this SaaS thing
impacting our microservice application. Let's quickly
check what is a non microservice? How does a non
microservice code looks like? Okay, on right
hand side I am showing a simple API.
I mean it's not a working code,
it's for a demonstration purpose only. What we are doing, we are creating a
dynamoDB instance, there is some table name and
we have some get data API. What we are doing,
we are just getting the key from the request,
from the query parameters coming into the API. Then we are
referencing a table in dynamodb and after that we
are getting the response and at the end we
are doing some exception handling.
But one important thing is missing in this example.
If we want to write this API in,
here's the example for a SaaS code
or if you are writing some microservice to support multi
tenants SaaS workflow.
So what is the difference in this code from the
last code? What we are doing? We are getting
some tenant id from the request header, the highlighted
code in yellow and rest of this stuff
looks but why this tenant id matters?
Why this tenant id matters because as
I talked earlier, we need to take care about the tenant
isolation and data partition and metric and billing.
So we need to know about the tenants id
and we need to flow that tenant id in the tenants context across
all services. Otherwise there is no way to know,
okay, how to serve the request for a specific tenant
or how to charge the bill to my tenant.
But how to get this tenant id and
how our typical microservice workflow change.
If you are talking about multitenant or SaaS
workflow, go to a
workflow and talk about how we want to provision or
onboard our tenant. So we will go
to the same example. We have some mobile client then
right now the important part is we need to provision our
tenants first. Why? So that we can flow
these tenant ids as a tenant context across our app.
So there could be a tenant provisioning
service which may call to some user management
service to create users and claims
by using some or talking to some identity provider or OIDC
workflow and also we can apply some access policy
as well, like if we want to provide
some rights like someone is admin or some other kind of tenant policies.
And once these things are done then
we can optionally called a tenant management service where
we can create a tenant id and with
this tenant management service we can get
the tenant id or tier plan or the status of tenant
and we can use these things to flow
further downstream to our other microservice
service and we can use the same tenant context to
the metering and building service as well.
But the question is how to flow these
tenant context. So there are several
ways to do this. Most used
option is using the JWT token,
which is what we call JSON web token as well.
So what is a JWT token? It is
basically a base 64 encoded string. Spread it into three parts,
header, payload and signature header gives
us the type and hashing algorithm. Payload is a list of key value
pairs or sometimes we call it that
tell us about the token, who issued it, when is it going to
expire for intended audience we can add our
own key value pair to these tokens and
we will use this poses along the tenant id.
The last part is signature combination
of encoded header payload and secret and we can use it to
ask identity provider to confirm that no one has modified it.
Other ways to pass token is in the URL
query string parameter. We can use the query string
parameter and we can pass that token tenant id
information. Third option is we can pass in
the request header as some
custom header like x tenant id.
Last option is we can also create a separate microservice.
So whenever request is coming to our business
logic we can always make a call to some other service to get the tenant
id. But there is one downside of this approach
here we are always creating a single point of failure by
making call to one service. And the
disadvantage is let's say that service is not available, then we
don't have any tenant information available and our workflow will not
work. Other important thing, what we need to keep in
mind, whenever we are building a multi
tenant SaaS application we can move
our code to some common libraries.
For example for this repeating
task where we always need to get the tenant id from
JWT token. So the steps involved
we need to read the header authorization header from the request.
Then we need to get the vera token and claims and
after that we can get this tenants id. So this is like a
common piece of code, what we need to
execute with each API call. So it's good to have
this code in some library and other services are using
this code from that shared library.
Other important thing is it's good to capture
this tenant context in the logging.
So we can always use some common libraries for structured logging
and we can capture all these tenant context in log itself.
Last but not the least, we need
to flow these tenant contexts to the
metric as well. So here I'm giving you example of open telemetry.
So as you can see on left hand side there are microservices
and shared infra and we are running some open telemetric
collector for doing the instrumentation part and
we can easily flow these metrics
to some time series database or tracing or some column store.
So in this way we can capture this metric with the provided
tenant id and context. Also we need to
talk about how are we going to partition our data.
So there are different ways available.
One option is to
go with the silo approach where we are
creating a separate database for each tenant.
Second option is the pool based model where database
is shared between all tenants and we have a single
schema and we have one column with a tenant
id that is indexed. And whenever we are making our queries,
all queries are following that index and we are
returning the response to our customer. The last option
is the bridge approach where we have a single database but
multiple schemas. So in my personal experience
I use this approach because it helped
me to get some kind of isolation
in place and I'm not sharing
the data from one tenant to other tenant. But at the end
it's a microservice based decision. I mean,
we need to think in terms of compliance and security, what your
client wants, how it is going to impact your performance
and data distribution, because there are always
a problem. I mean, if we are going with the approach like a pool based
approach or bridge based approach, then it may cause a noisy neighbor
problem where one tenant is more active than other tenant and
it is taking your all iops cycle
or cpu. Let's quickly talk about the tenant solutions part.
So what is tenant isolation? So one thing I would
like to emphasize here,
I mean, data partitioning is not related to tenant solutions.
Tenant isolation is like here also we
have two, two approaches. One is like silo model
where every tenants gets their own environment.
When I'm saying environment, for example,
let's say whenever we are running some Kubernetes cluster
and we are deploying our app. So tenant one will
get its own environment and tenant two will get its own.
The second possible approach is like we can also
isolate in a single environment using runtime
policies. How does it work? Let's take one example.
Let's say we have some client logging to our
application using some mobile device and
after that it is going to some OidCu workflow to
get the mirror token and after going to
the API gateway it is hitting our microservice and
now we reach to our data access layer where we are basically getting
data. So before making this call to
our database with the tenant
scoped access, what will we do? We will just go
and call the tenant scope credential
from some tenant access manager service. It will
return back us tenant policies and
by using those policies now we can make
the call to the database. So it will
help us to achieve the tenant isolation using
the runtime policy mechanism. Okay, so now
here I am on the last part
of my today's so takeaways is okay, I started
my talk with the cloud agnostic, why is it important and how
to solve the solution. What we
face using kubernetes. So one approach
is like using the facade pattern and whenever we
are writing any microservice use that facade
pattern and write our code in cloud agnostic manner.
That is also called a loosely coupled architecture. The other thing is
we can focus on strategic locking. We can still use offerings
from public cloud wherever we can use and we
can only use Kubernetes to run our services
instead of running database or messaging. Then we
moved into the multi tenant SaaS part. I mean how to
why and how to create a
multi tenant SaaS app. I mean we covered the tenant
lifecycle part, how to flow the tenant
context using the JWT
token or request header are and at
last we covered what are the different models of data partition and
how to achieve the tenant solutions.
That's all for today's talk, thanks for listening,
have a good day.