Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Dhiraj Hegde and I'm from Salesforce. Today I'll be
talking about how we manage hbase in public cloud.
Hbase is a distribute at key value store that is
horizontally scalable and it runs on top of Hadoop file system. For many
years, Salesforce has been running HBS in its own private data centers.
It's been running a very large footprint of clusters with thousands of machines,
petabytes of data storage and billions
of queries per day. But it had been using very traditional mechanisms
of managing these clusters on bare metal hosts
using open source tools like puppet and ambari.
When we started moving some of our clusters into public cloud,
we decided to take a very different approach, using kubernetes to manage these clusters.
We'll explain in this talk why we chose kubernetes, what challenges
we ran into with our choice, and how we overcame those challenges.
So why did we pick kubernetes? There are a number of reasons.
One of the main problems that we ran into in an old deployment
mechanism was that it was in place in a hosts,
meaning you would go into a host that already had software
installed and try to modify it. The problem with that is that when
you are trying to deploying on thousands of hosts, some of those hosts,
the installation could fail, leaving it in an uncertain
stateful, with some binaries present or some configs present and
others missing. And if you happen to miss these failures, those hosts
would remain in this inconsistent stateful for a very long time. The other thing
we have noticed is there's a temptation when you're dealing
with emergency issues, maybe a site issue that people
just go and log into. Hosts go in and modify these configurations locally
and completely forget that they did that. And the problem there again
is that once the config is forgotten, it looks a little different on that one
host compared to many of the other hosts. With containers. A lot of
these problems go away because they have this thing called immutability,
meaning with a container, it is actually created from
an image, and that image contains all the configuration,
all the binaries already present. And as that
container comes up, it is an exact replica of what that image has.
And if you even attempt to make a change after it's running, what happens is
that the next time the container is restarted could be a crash or a deliberate
restart. It would once again start from the image. So whatever
you made, changes locally would be totally lost. So in that
kind of immutable environment, it is pretty easy to make sure
that over a period of time, the images are all consistent,
no matter what people try to do to it or what failures
happen with it. The second reason why we felt Kubernetes
and containerization was a good thing was availability and scalability.
With Kubernetes, it's really good about making sure
that if you are running your containers on a particular host or
set of hosts, and if something happens to those hosts, kubernetes can monitor
and cache those issues and immediately restart those containers
on a different hosts. By the way, just to be clear here
in Kubernetes, containers are managed by
a construct called pods. But in this talk we are going to treat
the term containers and pods as one and these same to
keep it simple. They are not exactly the same, but we are not going to
get into it. And for the purposes of this talk, it doesn't
really matter as much. So coming back to the question of availability,
it makes sure that when pods fail
on a particular host, it will make sure that
if it can find another healthy host out there, it'll move these pods to those
other hosts. Similarly, when you need to scale
up your application, it's very easy in
Kubernetes to specify a higher number of pods to meet
to the requirements of that particular traffic. And once the traffic
spike is gone, you can also equally
quickly reduce the number of pods that are run to
serve the traffic. So this ability to scale up and down
is very valuable to us, especially when you are moving to the cloud.
And one of the big claims of cloud is that you got this elasticity.
So how can you take advantage of this elasticity?
Using Kubernetes and its elastic management of
pods is one great way of achieving it. The third
reason is actually one of the most important reasons as to why we went with
Kubernetes. We had this desire or goal to make
sure that whatever we built, and we started by these way with AWS,
which is Amazon's cloud offering. But our plan was to try and
build something there that would apply to other clouds,
because it is very likely that we would be running a software in other
clouds. Example of other clouds are things like Azure from Microsoft
or GCP from Google.
So we wanted to be able to run our software in
all those different environments. But the problem there is when
you're building your software deployment processes for one cloud, you are so intimately
tied to the APIs and the way they manage compute, storage network
that whatever you build there doesn't naturally apply to
these other clouds. But fortunately with Kubernetes, things turn
upside down. Kubernetes is actually very opinionated. It actually
specifies exactly how compute should be managed,
exactly how storage should be managed, and how network should be
managed. At that point, it becomes incumbent upon
the other cloud providers to make sure that they manage storage,
network and compute in the way Kubernetes
expects it to be managed. So when you build your software
deployment processes around kubernetes on any one of those cloud
providers, you automatically get the same deployment
processes working in all these other different clouds, because it sort of
enforces that sort of behavior. So for those three reasons,
we found Kubernetes to be very interesting to us. Okay,
now we'll get into a little bit about some of the challenges that we had
with Kubernetes once we started using it. To understand it a little
better, you would need to know the difference between stateful and stateless applications.
Here we show you a typical stateless application,
basically HTTP servers that run in a website. Each one of these
server instances has nothing that is unique in them. They kind
of serve the same content. If you ever try to make a request to them
that changes the content, it usually goes to some back end database
that is shared across all those instances, or maybe even a
shared file system across all these instances. So essentially each
one of those running servers in Kubernetes
world, these would be pods. They're all stateless.
And when a client is trying to access the service
from them, they typically go to a load balancer, as you can see here,
and the load balancer then forwards those requests,
each or any one of those HTTP servers in the back end. Now the interesting
thing to note here is that the client only needs to know the hostname and
IP address of the load balancer. It doesn't even need to know the
host names of the HTTP servers running behind it,
because they are just getting traffic forwarded to them by the cloud balancer.
Kubernetes usually manages these kinds of environments by
specifying a manifest, which is really a document
describing how many instances of these HTTP servers to
run as pods. It can also describe the configuration of
the cloud balancer, which is called a service in Kubernetes, but again it
is specified using a document called a manifest. And with that you can
nicely set up all of this. And in the early years, this is what Kubernetes
was really known for, the ability to very quickly spin up
a set of stateless applications and provide some
sort of service. But when you look at something like Hadoop
hbase, which is a very stateful application, it's a
database. Typically the way it behaves is that you got a client
there. And when it needs to access data for reasons
of reading, for querying, that is for modification,
for deletion, so on and so forth, it needs access
to the data servers which you can see at the bottom,
lined up at the bottom, a number of these data servers.
And each one of those data servers has very different data in
them. It typically does not have the same data across all of them
because then every data server would have to have huge amount of storage.
Instead you break it up into pieces and you spread it across a large
number of data servers. And that's how you scale as you need to store
more data. You just add more data servers. And on the left hand side you
typically have something called a metadata server which is responsible
for knowing where all this data is present. It basically knows the
geography of things. So a client, when it is
trying to access data, it goes to the metadata server
with a key saying that I want to access this. The metadata
server then gives a location of where these can find the data.
And then the client directly goes to the data server based
on these information and accesses the data. Now in all of this, what you will
notice is that the client actually needs to know the identity,
the hostname of all the elements. I mean it needs to
know the host names of the metadata server that can provide this
information. And once a metadata server provides the hostname information
of the data server, it needs to directly deal with those
particular data servers. There's no magical load balancer
hiding things from you, which is why the DNS that you see on
the right hand top corner is very important because it has to have all these
information of the host names and IP addresses.
And by the way, all the clients usually have a cache
because once you discover this information, having to rediscover
this information every time for a request is very inefficient.
So you hbase to kind of cache this information and
hopefully the stuff that you cache doesn't change too often because that is
a little bit of a disturbance that the client had to deal with. So you
try to minimize this disturbance. So all of this makes it
very important that the identity of
the servers that you're accessing tend to be
stable. They do not change very much. And given a host
name, the location of what it contains, it becomes very
important. Took so that association between the name of the
server and its content, the state it contains is very,
very important to the client for good performance. And this
is a very good example of what a stateful application is, and you can
see how it's a little different from stateless applications. So the
Kubernetes community, when it decided to support use
cases like Hadoop, hbase or Cassandra,
they introduced a feature called stateful.
What stateful provides is these same ability to create
pods, but in this case, when it creates pods, it gives it
unique names. And the names are kind of easy to guess.
They usually begin with the same prefix. In this case
it was pod. As an example I took. It can
even be HTTP or hadoop or whatever,
but then it would associate with each instance of the pod
a unique number, like zero, one, two,
depending upon how many of these you want. So each pod would
have a unique name. And also you can see in the right hand top corner
that the DNS is also modified to give these pods a
proper hostname and IP address. So you got everything with
it. You got unique names, hostnames and IP address associated
with each one of these pods. In addition, since this is a stateful application,
you could also define how much storage you want to associate
with each one of these pods, the size of it,
also the class, whether you want SSD or HDD. All of this can be specified
using a construct called a persistent volume claim.
It's basically a claim for storage. It's not the actual storage you are requesting
storage. And this is embedded in each one of these pod definitions
where a pv claim is specified. When this is defined,
what happens is that the providers or cloud providers who run
Kubernetes like AWS, Azure or GCP,
they will notice this claim and these immediately carve out a disk in the cloud
which has these size and the class of storage that is being requested,
and then it is made available to Kubernetes. Kubernetes then mounts that
disk in each one of these pods so that it becomes a locally available storage
in each one of those pods. And at that point
you've got a unique name which is well defined in DNS
associated with storage. And this is a one to one mapping
between the two. Now what's interesting to note here is that
let's take an example of any one of these pods which is
running on host a right now. It has a claim and it is accessing pv
zero which is mounted as a disk in it. Let's say for some reason host
a has some problem, it goes up in smoke.
Kubernetes would notice that. And it
will then say that okay, this pod is gone, it'll remove it from
its system and you'll notice in the right hand top corner that the DNS also
is modified to remove any DNS entries related to it.
You can still see the storage is present because claim that created
the storage is still present. These pod is gone but the claim is still there.
Eventually what Kubernetes will do is it will find another free
host like host d in this example and recreate
the same pod so it has the same hostname. Pod zero,
the one that got destroyed, is recreated with the same hostname.
And because it has the same claim embedded inside of it,
these same storage is again associated with it. And even
DNS is updated to have the DNS record.
One thing that is different here is that when the DNS record is
recreated, it did not get the same IP address.
So it had the same hostname, but the IP address had to change then.
That's just the nature of networking. When you move from one compute
unit to another, the IP addresses that
you associate with that compute unit has to change.
It's just how network is managed in kubernetes, but otherwise
you basically achieve something quite interesting, which is a
given host name is always associated with the same volume
no matter where your compute goes, moves around inside the Kubernetes
clusters a very interesting and useful property.
That stickiness between hostname and the
volume that you use. You also notice that the IP addresses
can change even if the hostname doesn't change. And this is kind of important because
we'll get into some of the issues we have because of this a few minutes
from now. So using stateful set we were able
to deploy Hadoop hbase and going back to
the same slide that I showed you a while back, you can see that each
one of the data servers is being deployed as a stateful set.
And every one of them has a unique name, like DS 123-4123
rather. And similarly the metadata server, which is also
stateful, has a unique name and disks associated
with it. So we were able to model and deploy a software
using stateful set pretty well. Stateful sets are managed
by a controller called as the stateful set controller or the STS
controller in Kubernetes. And while we found
many of its features around managing compute storage
failover et centers very useful, we also
had some challenges with it. One area where there
was a problem was with its rolling upgrade process where you're trying to upgrade
the software. And the way it does upgrade is it starts with the
pod with the highest number and goes one by one,
upgrading each one of them in
strict order all the way down to zero. And while this is a very
nice and careful way of upgrading software.
It is also very slow. You can imagine
in the world of Hadoop hbase,
you got hundreds of these pods, and each one of
them is a heavy server that takes around five minutes to boot up and
initialize and set up its security credentials,
cordless kerberos, key tabs, et cetera. So going
through it one by one would take a very long time and
almost make it impractical for us to use such a valuable feature.
Fortunately, Kubernetes is also very extensible,
so you can kind of go in and modify behavior
or introduce new behavior by providing your own controllers in certain
areas. And in this particular case, we were able to
build a new controller, which we call the custom controller,
which actually works in communication with the
default stateful set controller. So the stateful set
controller would continue to create pods and create storage
and coordinate the mounting and all of
that, whereas the custom controller that we built would be
in charge of deciding which pods would be
deleted next in order to be replaced. So the deletion
would be the custom controller's job and rest of
it would be these existing stateful set controller's job. So once
we had this ability, and this is enabled by a flag called on
delete strategy in stateful, if you're interested in looking it up,
basically, these custom controller would then enable batching where
it would go after a batch of pods, delete them first,
and then the stateful controller would notice that these pods
are missing and would recreate them with the new configuration,
though. Similarly, the custom controller would then move to the next batch
of three, in this case, delete them, and stateful
would do the remaining part of bringing up these new ones.
So in this manner, by coordinating with these existing behavior,
we were able to get batch upgrades enabled in Kubernetes,
which is a very big problem when we initially faced it.
Another limitation that we had was that in Kubernetes,
when you are deploying your services, you also define what is called
as the pod disruption budget. This is important to make sure that
whatever operations you do in your cluster, you don't let the number of
unhealthy pods or disrupted pods. To use
that terminology, you make sure that you put a limit on how many pods
are disrupted. In this case, for example, let's consider that
the pod disruption budget is one. What you're saying is that at
any given time in your cluster, you'd at most disrupt one pod
and not more than that when you're doing any of your administrative tasks.
Now, the problem here is that if more than one of your pods
is unhealthy, in this case, pod three and pod one are in an unhealthy state
because of some issue with them, and you are trying to
upgrade that particular stateful set,
maybe because you want to fix the issue by deploying new code.
Unfortunately, since it always starts with the highest number,
pod five in this case,
when it tries to upgrade, Kubernetes will prevent it
from being upgraded because it would increase the number
of unhealthy pods, because you hbase to destroy a healthy pod to create a
new one, and it bold increase the number of unhealthy pods as a result.
So in this case, again, a custom controller is really useful.
What it did was it went after the unhealthy pods first while
doing upgrades, instead of just being according to a strict ordering,
delete the first unhealthy pod and replace it with
the healthy pod as a result, and then go after the next unhealthy pod,
replace it with these new one, and then finally go to the healthy
pods which can now be replaced because there are no unhealthy pods left.
So in this way, we were able to overcome any blockage
due to pod disruption budget and move the rolling upgrade forward.
Another interesting problem we had, which is kind of unique to
stateful applications, I guess, especially things like Zookeeper
and many other such services. You have a number of
instances, but one of them is elected a leader
and it's the leader of the group, and it has certain responsibilities as
a result. And to create a leader, you have to go through an
election process. So there is some activity and delay involved
in doing some of these things. Unfortunately, Kubernetes at its level
knows nothing about these leader business. So the
controller would typically just go after the highest number of
pod, and if that pod is disrupted and a new one
is created, the leader might be reelected into one of the older pods.
So the next upgrade would hit that leader again, and once again you would have
election. And if you're really unlucky, the third pod also would be from
the older set of pods. So you end up disrupting the
leader these times in this case. But you can imagine in a real cluster,
this repeated leader election bold be very disruptive
to the cluster. So to avoid this,
once again, the custom controller came to a rescue.
We built sidecar containers. These are basically logic
that runs inside each one of these pods, which checks
to see if it's a leader, and it makes that information available through
labels in the pod. And the custom controller is basically monitoring
all these pods to see which one of them has this leader label on it.
And it would then avoid that particular leader and update
all the other pods first, then finally go and update the leader.
So you end up disrupting the leader pod only once
throughout this process, which was a nice capability that we could
have thanks to this custom controller. So another area
of problems that we experienced was around DNS.
As you can imagine in kubernetes,
it's a very dynamic world. As pods move from one hosts
to another, even though they keep the same hosts names, the IP
addresses keep changing. And I kind of went over that earlier.
This creates a strange problem because
traditional software like hbase, Hadoop file
system, et cetera, they were largely developed in an environment where
DNS did not change so much. So as a result,
there was a lot of bugs in this code base where
it would resolve DNS hostname to IP address and
cache that information for literally forever in its
code. So you can imagine if you had that kind of code,
you would have invalid information in the software pretty quickly.
And in particular, what we noticed is that if the metadata
servers had these IP addresses changing
and if a large number of data servers sort of had to
talk to these metadata servers, they were kind of losing connection
to this metadata server as its ip address changed.
Now obviously the fix to this kind of problem is to go into
the open source code, find where all these bugs are.
These it is holding on to these addresses and fix
those bugs. But with a large code base like
Hadoop file system and hbase, it's kind
of challenging to find all the places that this issue exists. And especially
when we had to get our software out and very sort
of depending upon our eyeballing capabilities to find
all these issues or a testing test matrix to
find all these issues seemed a little risky.
So what we ended up doing was that even as we went about fixing
these bugs, we came up with a solution where for
each one of our pods we put a load balancer
and it's called a service in Kubernetes. And it's actually not a physical
load balancer, it's a virtual one which works using network
magic, really there's no physical load balancer involved. So we
created this virtual load balancer in front of each one of
these metadata server instances. So now
what that does is that when you create a load balancer, not only does it
get a host name but also an IP address. And that IP address is
very static in nature. It doesn't change as long as you don't delete the
load balancer. So even though your pods may be changing the
IP addresses, the load balancer does not. So when
the client is trying to contact these pod, it would first go to the load
balancer and then the load balancer would forward the request to the pod.
So we sort of recreated that stateless applications methodology,
at least for metadata servers, so that we
can kind of protect ourselves from IP address related issues.
And in the meantime we also went about finding
all these bugs using various testing mechanisms and eliminating
it. But it gave us some breathing time. Another interesting
issue that you have with kubernetes is how DNS actually
works inside it. There's actually a dedicated DNS server
that is providing all this support for the changing IP
addresses. It's called core DNS. It actually runs inside the
Kubernetes cluster. And as you create pods of various
type and delete it, this core DNS is the one
which keeps track of when to create a DNS record and when to delete
it. The problem with this approach is that while it all works
great on the server side, there's no guarantee that your clients are
actually running in the same Kubernetes cluster. Really in the real
world your client is typically outside of a Kubernetes cluster.
It's probably running in some external host or VM,
but not necessarily inside your Kubernetes cluster. And that
client is actually depending upon typically a different DNS server,
which is the global DNS server that is visible across a
large number of environments and not the core DNS, which is
visible only inside the Kubernetes cluster. So to deal
with this issue, what you have to do is find a way of getting your
records from the core DNS into the global DNS.
Otherwise your client would not know how to contact all
your services. So in our case we use an open source
tool called external DNS. It's something that is open source
and most people use it when they're trying to deal with
this particular scenario. And what external DNS does is
that it transfers these DNS records that are within the Kubernetes cluster
into this global DNS server. I've simplified the picture here by showing
that it's actually moving data from core DNS to global
DNS. In reality that's not exactly how it does it,
but in effect it has the same impact.
It makes sure that those DNS records are available in global
DNS. Once they are available in global DNS, the client is able
to then contact your data servers
and communicate with them effectively now one challenge with
this approach is that external DNS only runs periodically,
every minute or so. So your DNS
records are not immediately available in global DNS.
So for example, if data server four here is just booting
up, it should not go online until it's absolutely
certain that its DNS records are available in global DNS.
So we have to actually build logic to make sure that it can validate
that global DNS has actually got its DNS records.
Once it's confirmed, only then would the data server declare
itself as available for traffic. So this is one of those steps
you kind of might have to deal with in the real world when you're trying
to use Kubernetes and stateful applications in
general. Now finally, I want to talk
a little bit about scalability architecture in
public cloud. Typically you deploy software in a
certain region. You can deploy it across multiple
regions, but if you are doing a high performance software that needs
very low latency, you deploy that software
in a particular region, which is really a geographical region like
us east or US west.
And within that region you can also spread your software
across different availability zones. Availability zones can
mean different things for different cloud providers, but typically
it is either a separate building, a separate
campus even, but very close to each other, so that
the network latency between the different availability zones is
not too high. So you can actually spread your software across it without
experiencing too much of a performance issue.
So I'll be calling availability zones AZ for
short here. So the goal is typically for you
to take a few AZ. In our case
we took three AZ approach and make sure that your software
is spread across the instances of your software are spread across each
one of these AZ to achieve this.
Fortunately in kubernetes there has been significant
effort to make sure that you can support this kind of deployment.
So they have got something called affinity and anti affinity rules
where you can tell the Kubernetes scheduler
to spread the pods across different AZ.
And the way they do it is that the hosts that run in each AZ
have a certain label indicating what AZ that hosts is
running in. And then you can tell Kubernetes that,
make sure that when you deploy these pods, they run on
hosts that have different label values as much as possible.
Obviously you will have more than three pods, so you're not
going to be able to spread these all on different AZ, but you do your
best effort to equally balance it across different AZ.
Now that takes care of making sure that your software
is running on different azs, but what about the data inside that
software. A good example of it is Hadoop file system,
which keeps three copies of data for high availability reasons.
Now you want to make sure that that copy each one of those
copies is running in different AZ for safety reasons.
So fortunately in Hadoop itself, when they designed
it, they introduced this concept called rack topology, which is sort
of the traditional data center terminology where you tell
Hadoop fails system. In particular it's metadata server.
What is the topology of your servers?
In which racks do they run in? And Hadoop will make sure that
these replicas are kind of distributed on
different racks, so that if one whole rack goes down,
you still got other racks that can serve the data.
We were then able to convince Hadoop through using its
script based interface that each AZ is a different rack.
And thereby Hadoop was able to spread these replicas
across different azs using that mapping. The metadata
server itself had multiple instances which are
again spread across different AZ
using the same affinity anti affinity rules that Kubernetes
supports. So what you achieve with all these spread is
that if an entire AZ goes away due to a power
outage or a network outage, you still have the software and
its data available in the other AZ and still
serving traffic in spite of this outage. So it's really useful
for resilience in general. So that's
pretty much all I had for today's talk. Thank you
so much for listening.