Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone and welcome to my talk. Distributed rogue based access
control with open policy agent I'm Serei Komachi,
senior technical leader at Miaplatform and what I will present you is
the work of the recent month when my team and I had to face a
new challenge. We had to change our authorization model from
simple session validation to using codebase access control.
In fact, in this talk I will present you our journey of discovery
for implementing a solution for errorback that worked for
our use case, but is built to be easily enough and
extensible to be generic and work
well in any context and platform. We will dive into technical
problems we faced with using a stack based on CNCF technologies
and how these problems affected our design decisions.
We will work a bit with Kubernetes open policy agent go,
and a few more. Of course, in this talk we may reference to Miaplatform as
it was the primary subject for our RBAC, but the same approach and
consideration can be applied to any other context. Before going any farther,
let's take a moment to review the basic of RBAC, which is
an authorization model meant to group actions user can take in your
platform into roles representing specific job functions.
These roles are then used to be assigned
to users. This model simplifies access control governance and makes your
platform more secure, allowing for easier
auditing and easier permission updates. Since you are limited to
take action on a specific role rather than each individual user.
Okay, that's enough. Introduction with the basic of errorback
set, let's start our journey. Even though using RBAC
seems pretty reasonable, more security, easy auditing and updates
is cool. Errorback is not something you may see every day as it
does not come for free. You either buy a ready made solution or
invest much time and resources to build your own.
So why did we need to introducing erbuck? Well, when we
platform users using our product are capable of
performing several different actions impacting their software application
running on kubernetes. To say a few users can
configure a microservice, define a new data model, deploy,
monitor and scale their services on a runtime environment. As you
can understand, it is not safe enough to let any user perform every
actions available. For this reason we identified different
personas, to say a few examples. We have owners
that can do any actions they want. They are the owners. We have
developers who may be able to perform limited actions on
their configuration or deploy on certain runtime environments,
and we have guests who have read only access to a specific subset
of resources. Based on these personas, we identified.
We then mapped each of our APIs that requested
a different access level, and each of these APIs is
mapped to a specific permission. We defined our own naming
conventions for permission and eventually grouped our list of capabilities
into roles. After defining our roles,
we decided to take a step farther as we noticed that API access control
has basically two different behaviors. In fact, we had a
set of API that has to be completely blocked
from unrestrict undesired access, while other APIs
were open to many different users. But the data this
user may see have to be different, either because we have
to filter some list of documents or because some
fields in the response body payload have some
visibility restrictions. For these reasons, we defined three requirements
our Airbux solution had to meet in order for us to be usable
as we wanted to decide whether a request is allowed
or not. So grant access to it or deny access to it.
And we wanted to filter data in two different ways.
We wanted to be able to filter data before they were retrieved
from the database in order for our services to only operate with
previously authorized data and we wanted to modify
the response body in order to take
much more granular control on the field visibility
of their users. Okay, so we identified our user,
our roles and our permission. We have the requirements set,
we start coding, right? No, unlikely, no opening
our editor and starting implementing our solution
has to wait. In fact, we have to address two important concerns.
First of all, where are we going to write our code and
how are written RBAC decisions? And second concern is
we know for sure that every single request in our platform will
run through RBAC. So how do we make it
resilient enough to very high request volumes?
Let's try and answer these questions. So where is the code?
Well, the first thing that came to our mind was this easiest one.
Let's embed herback code into each service. That's a
hard solution as we know for sure that we were incurring a lot of code
duplication. So yeah, we can write some software
library to abstract the complexity, right? Of course
we can do write libraries, but in an application composed of
different microservices written in different languages,
we would have to write a lot of libraries.
Also, we did not want to create a potential barrier
for adopting of new technologies that needed the errorback
library to be written first before they were usable.
And eventually recoding errorback into each service would
be extremely disruptive for our code base. We would
have to change the code of many services and with the risk of
introducing bugs and regression in existing code base,
that risk is too much to be taken lightly.
For these reasons, we decided that there should be some new component
in our architecture that holds all the arabic code.
Now for the second question, it gets a
bit trickier because we decided to introducing a new component in our
Kubernetes architecture and we know for sure that this component will
be contacted for all the API requests.
So how do we make it resilient? Okay,
we could deploy a new service,
centralized service, and horizontally scale it to sustain
high volumes. However, we would incur into two problems.
The first one being that we're introducing a single point of failure.
No matter the scaling, if it goes down, everything is down.
And the second problem is how is it invoked?
Should every service invoke it? Again, that's disrupting for the code
base. And so we took our second decision. Our backcode
should be somehow intercepting requests and be distributed
among the service that need it. Okay, now that we addressed
our concerns, we can proceed with the design. Since we are
running our application as pods in a Kubernetes cluster, we decided
to adopt the sidecar container pattern. And so we deploy our
erbux sidecar with all the services holding ownership on
a specific resource. To make a clear example,
the services that provide APIs for managing the configuration of
a project is in charge of making airbag decisions for that
resource. So it is that service that will block any attempt
to change the configuration of a project by a user that has not
enough permission to do so. Also, in order to operate
the sitecare container, intercepts all the incoming
requests, performs all the necessary
authorization controls, and if everything is fine,
the request is proxied to the application container. Otherwise the
API is immediately rejected. Okay, but how
is the sitecare built? Creating a new service allows us to be
language agnostic and so we can adopt any language we prefer.
We decided to use go as it helps us keeping a lower
source consumption profile while being able to keep a high request throughput.
Please note that these two are not the only factors that contributed
in our language choice, but for now let's accept them as they
are. And so the design is now clear without our buck,
any request the user does to an API is received by the
application container and the response is immediately returned to it.
However, as soon as we introduce our back
sitecare container, we are able to intercept each request
and decide whether it should be rejected or allowed.
Okay, at this point we were happy with the design, but we had to
ask ourselves, does this design still meet our original
requirements? Remember that we wanted to be able to assign
job function roles such as guest developer maintainer to
the users in order to decide whether to give access or
not to a specific API in order to filter data before they
were retrieved from the database, and to manipulate the response body
in order to restrict even more data access.
Luckily, the sitecare design doesn't provide any obstacle to
these requirements. However, the sitecare itself must be implemented
from scratch. All these requirements must be mapped into
code and some configuration that let the sitecare knows how to behave.
Based on each API we would have to write the service that receives
some big configuration maps, every API with an action to take,
which properties to filter, and so on and so forth. So that
seemed a bit complicated and the team was worried and asked,
do we really have to make this? Luckily, the answer was
no. In fact, we decided to adopt open policy
agent, which is an open source general purpose policy engine
that can be used to perform pretty much any kind of query thanks
to Rego, a declarative language designed for policymaking.
Rego is a full fledged programming language, and so it
allows us to write security policies as code,
test them, and deploy production with it.
OPA provide sdks in several languages to
directly integrate it in your application, and it is here
that their Golang SDK shines, allowing us to take
full control of their engine. With direct SDK,
we are able to do much more than simply running policies. We can
also isolate the engine recompile policies to improve
evaluation performance, prepare data stores, and create dynamic input
information to be supplied to the running policies, allowing us
to make decisions having everything we need to know
from the API that we are protecting.
So how is it actually made?
As anticipated, in order to integrate with OPA and Rego
policies, our service has to gather a few information,
bundles them together, and eventually run the policy evaluation.
To do so, it collects data from the request of the order response.
Depending on the flow we are protecting, we take the headers,
the complete URL and its parameters, and the request body.
Then we fetch data from our user role binding database
to be able to understand what the users can do
based on the permission mapping. And eventually,
after a few data preparation, OPA policy are loaded from a
configuration and run using all previously bundled
data. Okay, let's see some policy examples that can
be used to solve our constraints and see if they're good enough for us
to use. So the allowed policies example that we
see here is an example of a grant or deny access
policy. So this policy takes the project
Id provided in the input request path parameters,
defines an iterator, and then looks through each of the
resource bindings that are defined in our role
binding database. If there is
a binding that is mapped to this specific project id and
holds the permission project, view the policy results
to true if any of this assertion is false. Otherwise the
policy is rejected and the API is not proxy to
the other container. In this other example,
we see a query generation policy which is a bit more trickier
as it uses the OPA concept of unknown data to
be able for the policy to return a set of variables definition
in the form of the assertion you see here. These definitions
are later used by the service to generate a query
that is provided to the API underneath and so it
can filter data. In this last example,
we are running the policy in the response flow and we are able to
provide rego with the regional response body received from the
application service. The policy can then manipulate it.
In this example, we are going to drop the sales forecast property,
for instance from each document in the list using a simple
list comprehension okay, we are at the end of this
talk, but before saying goodbye, let's share some final thoughts on this
design with this solution. In fact, we were able to create a
platform that complies with many best practices. In fact,
by writing security policy as code, we are able to test
them and make sure we don't introduce any regression in our
authorization model. Also, having all the policies centralized
in a single place allows us to define Orego functions to
abstract complex logics and always have an overall overview
on what our policies do. Also, from a scaling
and high availability perspective, even though we introduced a
new hope in the request call chain, we measured a very
low latency introduction and having a background as a sidecar
gives us the possibility to scale and boost resources on the most
requested services while keeping other with tighter resources.
Also, thanks to OPA and Rego, we can express any kind
of security policy we want. So for real, the sky
is the limit okay, now the talk is over and I leave
you here some links the first one is OPA. It is an
amazing tool as you may have seen, and I strongly recommend you checking
it out if you don't know it. The second one is a blog post I
wrote for mia platform and it dives into more details about our
use case, so if you're interested, you can check it out too.
Now, whether you like it or not, these talk I would
really appreciate if you leave a feedback, so please scan the correct
code and submit the form. Thank you for your time.
I really hope that you found this talk interesting and
if you wish to follow me I don't use much social network, but you can
find me on GitHub and LinkedIn. I am even on Discord.
So if you have any questions, you can find me there now.
Thanks a lot,