Abstract
How did we build a k8s operator that allows 100% up time for a high availability high workload database?
Operating a distributed high load, high throughput database in the cloud comes with several interesting challenges. In order to manage real-time serving of mission critical workloads at 100% availability we developed a Kubernetes operator that handles the operational complexities.
We needed to handle the following requirements:
- Apply live patches
- Replace live cluster with tens of nodes
- Handle degraded/crashed nodes
Under these conditions:
- High Availability - remain 100% online with no down time
- Operate under very high workloads and traffic
- Manage replicated records across different hardware failure groups (rack awareness)
Due to its stateful nature and the type of workloads that are usually handled, cluster management and recovery are non-trivial. We are using the Operators API to handle that complexity and control the clusters from within Kubernetes.
In this talk we’ll cover the steps we took to plan and execute and the challenges we faced and share the best practices.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, we'd like to invite you to the talk about the Kubernetes
operator that we built at Aerospike for a distributed database and join
the talk you it's the
first time that I'm speaking in a conference in such a format of a YouTube
stream and I'm excited. If you learned anything interesting, please tag me.
Please tag the conference and please use the conference hashtags conf
42 and cloud native. My name is Natalie Pistunovich.
I'm a developer advocate lead at Aerospike, a Google developer expert
for Go, an OpenAI developer ambassador and I
organize the Berlin user groups for the go community
and women tech makers. I'm also organizing the conferences
go for fun Europe cloud nine ha.
And besides Berlin. And you're welcome to follow me on Twitter. My handle
is Natalie piss. So what's in our agenda today? We're going to talk about
what are Kubernetes operators, what is Aerospike?
Then we're going to see high level design of these aerospike
Kubernetes operator and I will tell you some of the engineering
challenges that we faced developing it. So let's start with saying what
is a Kubernetes operator? It's designed for automation.
Kubernetes offers out of the box automation and with it you
can automate and deploy and running workloads.
You can also automate how Kubernetes does that.
The core of Kubernetes controlled plane is the
API server. It exposes an HTTP API that
lets end clusters, different parts of the cluster and external
components communicate with each other. We're here today to
talk about the operators. So what is the definition of an operators?
Operators are clients of the Kubernetes API. They act as
controllers for a custom resource. In Kubernetes,
a controller is a control loop that watches the state of
the cluster and it makes or request changes where needed.
Each controller tries to move the current cluster state closer to the desired
state. The controller tracks at least one Kubernetes resource
type and as we said, operator tracks
the state of a custom resource. A custom resource is
an object that extends the Kubernetes API or allows
you to introduce your own API into a project or a cluster.
A custom resource definition CRD is a file
that defines your own object kinds and it lets the API
server handle the entire lifecycle. For example, aerospike is
a database and it's a custom resource. Our engineers built an operator for
it because it's not part of the Kubernetes ecosystem. The operator
pattern is used for automating repeatable tasks and it combines
custom resources and custom controllers. It is used
to take the human out of the equation, because doing such
things is boring. And when you do boring things, you make mistakes.
One more time for the people in the back, in case you did not write
your own operator or did not dive into the ones that we're using,
maybe this was already a little confusing. So let's say that
we start with a user that tells that it wants to
do things. So it sends a command to the Kubernetes cluster. The API server
exposes an HTTP API that, as we said,
lets end users different parts of the cluster and the external components
communicate with one another. Then Kubernetes creates pods to
host the application instance. Each pod is tied to a node.
A cluster is a set of worker machines called nodes that run
containerized apps. Every cluster has at least one
worker node. And let's say in our example we have n pods
that are tied to n nodes. They all are
created by a deployment object. Deployments are usually
used for stateless applications like web servers.
Pods that are deployed by deployment are identical and interchangeable,
and they're created in a random order with random hashes
in their pod names. On the other hand, stateful sets are
used for stateful applications when dealing with databases.
You definitely want to have a stateful set because we do want to store the
data. Pods that are deployed by a stateful set component are
not identical. They each have their own identity, which they keep
between restarts, and each can be addressed individually.
Service is an object then an abstract way
to expose an application running on a set of nodes as a network
service. A config map is an API object used to store
non confidential data in key value pairs.
Pods can consume config maps as environment variables,
as command line arguments, or as configuration files in a volume.
A config map allows you to decouple environment specific configurations
from your container images so that your applications are easily
portable. In kubernetes, controllers are control loops
that watch the state of your cluster and then, as we said,
make a request or just change the situation in a way that's
needed in order to bring the cluster state closer
to the desired stateful. And it tracks at least one
resource type. And these there is custom resource.
For example, our database and this one is handled
by these operator. And both concepts,
controller and operator, they represent patterns.
They don't involve language specific implementation framework,
which means that in order to write a control or an operator,
you'll need to follow the convention, but you don't need to use any specific language.
So to put this on writing, kubernetes operators would do the following
things. Probably it minimizes the manual deploying and
lifecycle management, so handles things like resource management
or complex resource management, scale up or down the
size of the cluster, and upgrade or downgrade the version.
It manages your configuration. And of course it does.
Monitoring basically take the human out of the boring part
of the work to make sure that no mistakes are made.
So in our example, what does the operators manage
or a little bit about aerospike aerospike is a NoSQL database
and it implements a hybrid memory architecture where the index
is purely in memory, so not persisted, and the data is
stored only on a persistent storage and reads directly from
the disk. The disk I O is not required to access the
index, which enables predictable performance. There are
however strict SLAS petabytes of data handled in
sub milliseconds and there is transactional guarantees.
Basically means that the database transactions provide asset guarantees
if needed. You can also have strong consistency,
which is a term you probably heard from the cap theorem for distributed databases.
All those reasons are why clients that are big financial
institutes, for example banks and other clients from other industries,
are using aerospike. Some other features that are a little more technical but
relevant for the rest of this presentation would be rack
awareness, which is a feature that allows you to store different replicas
of records on different hardware failure groups. For these resilience and
a multicluster XDR or cross data center replication
setup. Multisite is when the nodes
that are comprising a single cluster are distributed across different steps,
a physical rack in a data center, an entire data center,
or an availability zone in a cloud region.
Basically, these cluster is stretched across regions and cloud providers,
and it expands horizontally. This uses synchronous replication
to deliver a global distributed transaction capability.
The update speed is only limited by things like the speed of light.
You can also go asynchronous for that. You'll use the cross data
center replication setup, which uses asynchronous replication
to connect to clusters that are located at different geographically distributed
sites. It can extend the data infrastructure to any number of
clusters easily. So we talked a little bit about
the database. Let's talk about the operator. The Kubernetes
operator is driven by a single custom resource CR,
and it conforms with operator custom resource definition CRD.
The cluster specs include things like the size,
so the number of nodes per cluster and the resource allocation request.
For example, the cpu per node it has the complete aerospike
configurations and for example the YAML version of the Aerospike
server, converting YAML based configuration to the aerospike
version of them. And it handles the security configuration, TLS user
management and so on. Because of some of the special features
that we covered, it has some special considerations
in what it does and how. So the deploying of
the database clusters is a pretty obvious feature.
Manages all the things lifecycle management, which means database
cluster scale up and down, server version upgrade and downgrade,
aerospike configuration management, rack awareness management,
and cluster access control management. Also it handles all
the fine details of the multi cluster cross data center replication setup.
So remember how we said it spreads across the different availability zones,
different cloud providers, and even combination of cloud and
bare metal? That's a lot of configuration to figure out and keep
up. And it monitors everything. So here are some
of the engineering challenges we faced when we were developing it,
with the first one being the persistent data. Each pod has a dedicated
storage, and as we said, it must be persistent. We are using a database
here. The logic is if it's new storage, it means that it probably
has old data, because just like with computer memory, you cannot assume whether
the storage that you were allocated with is empty or
just filled with crash. But if you're
restarting a pod, it means you probably have their relevant
data. So you definitely don't want to touch that when you change the
configuration. This is when the pod restarts, or when maybe something went wrong.
You do want to save the storage, you do want to reuse that data.
So a restarted pod has no metrics and there's
kind of no kubernetes way of using that and telling
whether this is a restart or a new pod. It also does not help
that you probably have a new image because you did something like a version update.
So how do you do this? These answer is flags. Add a flag
using the init containers. This is what you run before
your containers run, and that's how you init these devices. This is where
you do the wiping of the data. How not to wipe data twice?
The operator makes a cr for each resource in which we create a
single tone instance upon initialization. So basically when
the wiping is happening, we're adding a flag in the file.
And then next time a pod restarts, it checks the config file.
It steps that these flags exist so it knows not to wipe the data.
Smart. Next challenge is changes during
a rolling update. Say something happened, and the solution
is to update the server version on all the ten nodes
update on, node one complete, node two complete. Node three
complete, abort. Suddenly you realize this is
not the right thing for you to do, and you want to abort the server
version update. But if the command that you issued is update
on all ten nodes, how are you going to stop that? The way
that we implemented this is that after every operation it recues
the reconciliation request. Basically the operator
is asking the API after every node, now what?
This way, after it completed updating node three to the new version,
the next step that it will receive as a command in the response
to the question now what would be rollback or
update to the old version? Node one, these, node two these, nodes three.
And to make things even more efficient, the operator
requests a delay in the response. Let's say that it
knows that the migration of this specific node, which it cannot
abort, will take this amount of time. A few records, it tens
to the API, please respond to me, but not right
away, but in a few seconds, because it can be that in those few seconds
until the migration will be over, you will receive yet another change.
So just make sure that you save resources and tell me
what is the most up to date thing that I should do. And while it
sounds very trivial to you now go check out different operators.
I think these is a cool idea. The third challenge is
what happens when you reach a really large scale. And well,
we know that cloud is great for prototyping, but it can get pricey
at a very large scale. And we have customers with
half a trillion objects to give you a scale. Imagine that $1
buys you three and a half objects. So this is a little bit of a
funny thing to say, because an object is kind of a row in the database,
and, well, $1 does not buy you three and a half rows. But let's just
say in this case half a trillion is how much Jeff Bezos
can afford. And remember, we have these slas for clients that require
petabytes of traffic in sub milliseconds. Next,
there is the issue that cloud hardware is not homogeneous.
It gives you a promise of a minimum cpu, but it doesn't commit to
that. It means that some of the machines are in the minimal setup,
but others can have a higher setup. And aerospike, due to
its architecture is disk heavy. Sharing the I O means
you definitely get a slice, but it's hard to cap the size of
it. This means that your machine might respond slowly on messages and distributed
database send around a lot of messages to make sure that they're in
sync, to make sure that they know where is the most up to date replica
of the data is right now and also to endpoints like the heartbeat,
then there is also networking. The network is not private,
but you do get a slice of it. However, if you have noisy neighbors,
they can drive your performance down. Also, let's not
even start the conversation about the cascading effects of such any
of those interruptions. It's something that it's really hard to predict and
well, the spoiler is there, an operator alone will not solve these.
So how do our clients solve that private cloud? When you get
to a really, really large scale, get a whole host
and split the resources internally and do budget to have
comes overcapacity. When your client is for example, snap in their
scale, you do want to read from the client, not from the master.
Be aware and max your communication to
a local one because latency matters a lot at the scale.
Of course Kubernetes will work great in such a setup. Think about it,
it started inside Google in their private cloud. And yes,
of course the operator will work great in these setup as well.
If you want to read the operator source code, of course it's open
source, available in GitHub. And here's a recap of what we saw today.
We talked about the Kubernetes operator, how it
controls the custom resource, what does it mean and what is a
custom resource? Then we discussed a little bit some of the
challenges that we faced when we built our distributed database operator
at Aerospike. And the recommendations that you should be taking home
are keep the data upon pod restart because
that is a database. Be able to revert a rolling update immediately because
that's definitely important. Doesn't happen often, but in the one time you do
want this to happen and at a very large scale
you can have all sorts of new problems and you need more than one solution.
And automate, automate automate thank you very much for attending the
talk. Please tweet, please share and I am
looking forward to all your feedback. Thank you.