Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Today I'm going to talk about how we scaled multi cluster kubernetes at
teleport cloud. So what is teleport? So, if you don't know, teleport is
an open source infrastructure access platform that makes use of
reverse tunnels, like for example SSH tunnels, to provide
audited access to infrastructure for developers, also for CI
CD systems or other use cases like Internet
of things. And so Teleport provides access to Kubernetes
SSH database, like postgres, for example,
web application and Windows desktop resources.
It understands the different protocols behind those
resource types so that it can provide application
level auditing, right? It understands what Kubernetes
commands look like, what SSH commands look like,
to be able to audit the connections to those different
resources that you might be wrong. And so if you look at teleport's architecture,
the kind of key components that run on the server side are the
teleport proxy, which provides connectivity,
like for instance, managing reverse tunnels. And the Teleport
auth server, which does authentication and authorization,
manages like user roles, provides the kind of back
end logic. And so if
you might imagine what a client connection to a resource that's managed by
teleport would look like, say you're connecting to a Kubernetes cluster,
you might run Kubecontrol get pods, and you're
not connected to directly to a Kubernetes cluster, but instead you're pointed at
a teleport proxy through an mtls connection.
And then the teleport proxy also has a reverse tunnel
coming from that Kubernetes cluster egressing
out through a firewall. And the
connection is coming from a teleport agent that's running in a pod in that cluster.
And so your Kubecontroll
get pods command goes through the proxy through the reverse tunnel
and reaches the cluster on the other end to provide access. And so
Teleport cloud is a kind of hosted
version of teleport that's offered by teleport
as a company. And for teleport cloud,
we run a dedicated instance of teleport for every
customer. And so that means we're running a deployment of teleport,
many deployments of teleport,
where we're operating over 10,000 pods, we're operating over 100,000
reverse tunnels at a given time. And this
happens across six regions to provide global availability
anytime any of those tunnels are disrupted, that would
disrupt access to the underlying resources. And so it's really, really important that
we provide a very stable network stack for this
platform. Another kind
of important detail here is that those proxies, they're running each of
those separate kubernetes clusters in each region,
peered connections. So a client might connect to a proxy in one region,
and a peered GRPC connection between the proxy
from that proxy to a proxy that the resource they're trying to access
is running in happens, and then the connection
goes through the reverse tunnels to that resource. For a long
time we ran teleport cloud on gravity. We've recently switched to eks,
where we have an eks cluster in each of the regions that we provide connectivity
for. And so if you break down the kind
of major needs that we had when putting together this platform,
I'll just kind of focus on three really important things.
Ingress. So maintaining
highly available, ultra long lived reverse tunnels to the resources
that people need access to is a top thing.
It was really important that we be able to do coordinated rollouts across these regional
clusters. And so we
often want to update auth servers when we do an upgrade of teleport first,
and then upgrade the proxies because of
how the Auth servers cache proxy heartbeats, or we
might want to not roll the proxies in every single region. At the same time,
we might need a more coordinated deployment strategies
for some different customer
use cases. And finally, container networking.
All these clusters have to speak to each other. The proxies need to peer.
Everything needs connectivity to auth servers, but AutH servers don't
run in every region. And so just focusing on ingress first,
I'll kind of go through a journey of things. We tried for all of these
and how we arrived at the architecture that we have
today. So first we tried anycast, actually,
and you might imagine how this went.
Anycast, we really liked any cast. We tried out anycast
through AWS global accelerator.
We liked it because it provided really stable ip
addresses. DNS never had to change. We could give everybody in the world a single
ip address for teleport cloud,
and they'd be able to really reliably connect to that. But at the
same time, as you might imagine, anycast didn't quite provide stable
enough routing. I think there have been a lot of success stories
with anycast and streaming video,
like kind of long lived TCP connections in that context. But we found that
even with a lot of logic, that would resume reverse tunnels quickly if
they dropped. It really just wasn't stable
enough to provide really consistent,
really highly available connectivity.
And so any cast didn't work out for us. We also tried
open source Nginx as a kind
of ingress routing layer in each of the clusters.
And we really liked Nginx because it supported ALPN
routing. So Nginx
will let you route connections from different clients using.
So ALPN is kind of similar to SNI. It's some metadata
in a TLS connection that'll let you make decisions about how to route that
connection without terminating
it, without decrypting it. And so the ingress stack could
separate client connections from agent connections and route to the correct proxies
using that metadata. And Nginx did
a great job with that, but it had no in process configuration reloading.
And so anytime we'd need to make a change to these routes,
we'd need to start a new instance of NginX. And eventually the memory usage
got really high. It just didn't work well for
how often we needed to change the ingress configuration. And so
we didn't go with Nginx either. And so now I'll talk about what we did
do. So our ingress stack
at the DNF level, we used route 53 latency records
with the external DNS operator. And so
the external DNS operator would publish latency records
to route 53 based on the
nlbs that separate
network load balancers that would sit in each region.
And we use network load balancers instead of
any cast because they're kind of stateless.
So when there are changes to them underneath, they don't necessarily drop connections.
And they let us provide really stable reverse
tunnels. Instead of NgINX, we went with
envoy proxy, which also supports ALPN routing,
which is great. And we configuration it using Gateway API,
which is a kind of new set of APIs for
ingress in Kubernetes that aims
to replace ingress resources
that have existed in the past. And we used our
own fork of envoy gateway that had some. For one,
it supports using annotations to do ALPN routing
on TLS routes, but we also made some kind of stabilizing
changes there that we're working to get upstream.
And finally we have a little hack for doing zero downtime deployments
using min ready seconds on top of deployments that
I'll talk. And so, to kind of break down what this looks like,
imagine you're
trying to connect to your Kubernetes cluster using Kubecontrol and
the teleport agents running in the Kubernetes cluster, which is behind a firewall.
So the relevant pieces of teleport
you'll see here are the proxy pods running in each of in this case,
we'll just talk about three regions where the client is connecting
through proxies in us east one to
proxies in AP southeast one where the Kubernetes cluster is running.
All the proxy pods have a kind
of streaming connections to the auth server,
so they get up to date information on roles
and information needed to make
authorization decisions. All the
proxies cache this information so that they can really quickly
make these decisions without having to reach out to auth for all of it.
So we have the reverse tunnel that's coming from the Kubernetes
cluster, and we have the agent connections,
these Goop control, that get routed to the closest proxy
through a peered connection between the proxies and then finally back up to
the Kubernetes cluster. So if we take auth out of this for a second and
just look at the connectivity piece, what this looks like is we
have the external DNS controller running in each cluster,
and we have nlbs sitting in front of
the nodes that run the
proxy pods in each of the clusters. And external DNS is
reading the ips of those nlbs and reporting them back
to route 53 so that it can report the ip of the closest
proxy when your end user,
client or agent needs to connect to a proxy so that
it gets the closest regions.
And so how do we do a zero downtime deploy on top of this
architecture? And this gets a little coordinated because we have
these reverse tunnels that are open all the time connected to proxies,
but we need to update those proxies, right?
So if you look at the architecture we had before, what this looks
like is that we have envoy running in each cluster,
configured by Envoy gateway routing to different
proxy pods for different customers.
And when we want to do a deploy, we use
the ALPN routing feature in envoy in order to
control whether connections
land on the old pods or the new pods. So an interesting thing we did
here is that we used a feature of Kubernetes deployments
that you might not know about called min ready seconds.
And so this is kind of similar to a termination grace
period where you keep multiple sets of
pods running at a time, old generation and new generation
running at the same time. But with min ready seconds, you can keep both generations
responding to new network requests at the same time for
a certain period of time until the new set of
proxy pods is not just considered ready,
but after the min ready seconds is considered fully available, and then the old
pods will terminate. So we use this period of availability of
readiness at the same time for both generations
of pods to allow new
tunnels to come from the existing agents and
hit the newly spun up pods until
all the new tunnels are fully established before we make
the ingress changes that start routing connections through the new set of pods.
And then after that happens, and after old connections
drain off through the old network pathway,
then we shut down the old set of proxy pods. And the way we do
this flip from one set of pods to another is that
we have a custom controller that changes the labels
on a service to point from the old generation to the new generation
of pods using our own custom controller logic underneath.
And so that's basically an overview of
our ingress stack for how we route really
ultra long lived tunnels. We route client connections through those
long lived tunnels for Kubernetes clusters in six
regions. The next thing
I'll talk about is deployment. So when we need
to upgrade teleport, I talked about how we do it like a zero downtime upgrade
for an individual cluster for proxies, but how do we
upgrade teleport, auth and proxy across six regions
in a coordinated way? And so we tried a couple of
different options here. The first thing we tried was Gitops.
So what's the most you're doing? Deployment to Kubernetes clusters. What's the first thing
you think of? Right, use flux CD or a similar tool to deploy it
from a git repo. And so we had our
own custom controller, a CRD that's reconciled by
a controller we have called tenant controller, and the CRD is of course called tenant.
And we thought about storing that configuration in git for
each customer and then applying that to all of the clusters.
And a disadvantage of this approach we found, number one was
we really wanted all the data to stay in postgres. We didn't want to
start writing a bunch of customer data into a git repo and having to manage
that git repo over time. Another need we
had there, another thing didn't work. Sorry about the Gitops approach with Flux,
was that flux is very unidirectional. So we didn't just want information synced from
a git repo into clusters, we wanted to pull information out
of those clusters in order to be able to progress the deploy to more
steps. So auth servers finished deploying. Now we want to update proxies,
maybe we want to update proxies. We don't want to update every region at the
same time.
So that approach didn't work. We didn't go with Gitops.
We tried cross cluster reconcilers after that,
where we had a controller running in each regions,
but we didn't have a CRD in each of those regions. They all
reconciled against a custom resource, that tenant custom
resource in a namespace in the management cluster.
So we have a namespace for each customer in every cluster,
but the custom resource
only lives in the management cluster. In this proposal,
this didn't work very well either. So if
we'd gone with this, we would have created a big single point of failure for
the whole platform in that one cluster. We didn't like that. We wanted everything to
be able to operate without management.
And there were some difficulties in,
like we'd have to have all of the regional clusters write
to the same status field of that
shared tenancy are, and it leads to conflicts and other problems.
And so what we arrived at was
neither of those things. We really liked kubefed. We thought Kubefed kind
of maybe was on the right track with exposing
APIs from one cluster into another cluster, so you could have a controller
that understands how to operate custom resources that
it's not reconciling, which are reconciled somewhere else.
We really like that model, but the project isn't active anymore,
and it seems like it'd be a big risk to pick up Kubefed
if there really wasn't a lot of activity there. And so we built something that
just solves this problem in a really narrow way.
It's called sync controller. We just open sourced it a couple of days
ago, so you can check it out if you want to. The way
sync controller works is it let us build this architecture
where we could have a management cluster that
is driven by that tenant custom resource inside of a customer
namespace. But the controller for the tenant
resource is just responsible for creating additional teleport
deployment resources, one for each regions that the
teleport needs to operate in for that customer. And then
that custom resource is synced
just to an individual instance of the
same resource that lives in each of the different
regions. And so tenant CR might create, if there's customers
in three regions, US west two, US east one, and AP Southeast
one, and each of those are then picked
up by sync controller running in those regions, which then creates a namespace,
creates the resources, and then reconciles that resource
there. And so to kind of dig into what that looks like really in
more detail in the management cluster,
this diagram here isn't specific to teleport,
just generally how you use sync controller. I'll show you a teleport specific version in
a second but in this instance
you have synccontroller running regionally. It watches the
spec of the resource in the management cluster. It copies
any changes that it sees to the spec
of that resource into the instance of the resource in the regional
cluster, where it's then reconciled. The reconciler
then writes the latest status of that resource in the
regional cluster, and then sync controller is also watching
that regional status and copying that back
to the management cluster. So that from both the
regional cluster's perspective and the management cluster's perspective, you have
the same resource,
but the management cluster can create and
kind of operate a selection of these resources from the outside,
whereas the regional cluster can do the actual reconciliation of the resource
and create the necessary pods. And so for teleport.
So here's the kind of teleport specific version of what
this architecture looks like. We have sync controller
running in the regional cluster. It's copying the teleport deployment
spec into the regional cluster for management, and where it's being reconciled
by the teleport controller that creates the auth server and proxy
pods, and then any changes to the status are sent back
to the management cluster. And then in the management cluster we have a tenant controller
that's for one, it's creating, doing any
centralized work. So it creates the dynamodB tables or
Athena resources for audit logging, all of
those things that are shared. And then it creates a selection of
teleport deployment resources that configure each region. And the nice thing
is that it can react to changes in those, so it can listen to the,
it can watch for status changes in us west two know when
auth is finished deploying and tell regions that
they can update their proxies, for example. So it can make decisions based on the
challenging state in the different clusters. And this architecture
worked really, really well for us. It's really nice because we can lose the entire
management cluster, and all of our regional clusters
are still operatable.
We have to change the teleport deployments manually in
the different regions, but they can keep reconciling forever in that state.
So on top of this,
I mentioned earlier that we wanted to store customer data in postgres,
right? So we
kind of built the configuration storage into our customer portal.
And so when customers sign up or when employees want
to manage customer information, obviously through teleport,
all the data is stored in postgres, but the data that
needs to live on the cluster as well is stored as
a custom resource, as a tenant custom resource in
JSON B and postgres. And then whenever there's a change to
that data. In postgres we have a sync service that
reads that change and sends
the change version of that to the management cluster. A really cool thing
we did here is we took the open API schema validations
that the cluster uses for that CRD, and we also apply
them to validate the request to
change customer data at the portal level so we don't end up with a
CRD that wouldn't apply to the cluster getting stored in the database. That was
surprisingly easy to do, using kind of the open source tooling
for open API available on GitHub. And so that
covers the way we do our coordinated rollouts across
clusters. The next kind
of big important piece of all this is how did we do container networking?
How do we let proxies talk to auth servers?
How do we let proxies peer with each other across regions for
this massive multi cluster deployment? And what
we found was that Psyllium Global services worked really,
really well. We tried some different deployment architectures
with psyllium. We found that having a dedicated
ETCD can perform the best. It let
us deal with a lot of pod churn. So whenever we do a big update
for many tenants at the same time, and we have a lot of new pods
spinning up and shutting down,
or also when we update selenium itself, and there's a
lot of things that get reconfigured, a dedicated ETCD
ended up performing the best. There are some other ways of deploying
psyllium global services that you can look into,
but that's what worked for us. And so to kind of break down what
that looks like, I'll trade this diagram that you
saw earlier where you have the teleport deployments in each region, if we focus
on the services that get created by the teleport controller there for a second.
So we have an auth service and we have a proxy service,
but we don't have auth running in every region. So whenever proxies
need to speak to auth in us east one, those connections get
redirected to Auth, to a
service that has the same name that cilium automatically provides forwarding connectivity to
in us east
one, to us west two. So that lets us run
our Auth servers in multiple availability zones in one region,
but not have to run them in every region. And proxies in all
regions have cached access to those
auth server pods. In some cases,
we don't have proxies available at a region,
and in those cases, our custom controllers also can create a global
service that redirects proxy connectivity to
the closest region which can calculate. And so
that is our user journey through teleport cloud architecture
covering ingress deployment and container networking.
Just as a reminder, a lot of the stuff you saw today
is open source sync controller.
You know, as Apache two licensed,
we actually just open sourced this a couple of days ago. Please check it out.
It's not something you deploy to a Kubernetes cluster. It's a tool you
can use to build. It's a reconciler you
can import into your own controller manager that will let you create
a management plane using your own custom resources.
We also maintain a fork of envoy gateway that supports ALPN
routing and has a couple of other changes we made and a
lot of them we got upstream actually to kind of
stabilize some parts of envoy gateway work.
And finally, of course, teleport is Apache two licensed.
You can check that out as well, deploy the open source version and see
what you think. And last but not least,
I want to give a huge thanks to everybody on the teleport
cloud backend team. Carson, David Tobias and Bert Bernard.
You can kind of see the parts they worked on here. This was a huge
team effort, wasn't just this team, right,
to kind of get this platform together.
And that's all I got. Thanks everybody.