Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi, welcome to Comp 42 Cloud
native. We're going to talk about service mesh to service meshes. Do you
need a service mesh? How do you get started? We'll see demos of both Linkerd
and istio in this process. So let's jump in. Here's the part where
I tell you, I am definitely going to post the slides on my
site tonight. I've been in enough talks that have done
similar things. The slides are online, right? Right now let's
head to robrich.org. We'll click on presentations here at the top
and we can see service meshes. To service mesh is right there at the
we're while we're here on robrich.org, let's click on about me and see some
of the things that I've done recently.
Both Docker and Microsoft have given me some awards and
AZ give camp is really fun. AZ givecamp brings volunteer developers
together with charities to build free software. We start building software
Friday after work. Sunday afternoon we deliver completed software to
the charities. Sleep is optional, caffeine provided. If you're in
Phoenix, come join us to the next AZ give camp. Or if you'd like a
give camp closer to you, hit me up here at the event or
on any of the socials here and let's get a give camp
in your neighborhood too. Some of the other things that I've done, I was
awarded a tip of the captain's hat award by Docker last year.
That was a lot of fun and one of the things I'm particularly proud of,
I replied to a net Rocks podcast episode. They read my comments
on the air and they sent me a mug. So there's my claim
to fame, my coveted net rocks mug.
So let's dig into service mesh to service meshes.
We talked about this guy. So let's imagine
that we're learning how to drive. Do you remember when
you learned how to drive? Do you remember the fun it was to be able
to hit that open road? You know, the excitement of going beyond
just the current neighborhood into the next town?
Maybe across the country perhaps? You know how
fun it was to drive. Well, let's imagine a small town
and yeah, we can drive anywhere. We can drive as fast as
we want to. We can really enjoy the road.
Well, over time the town starts to grow up,
the traffic gets a little bit more congested, and now
we've got traffic. So how do
we solve traffic here in our small town? Well, I know
it's those people coming into town that shouldn't be here so
let's set up a traffic cop right at the edge of town. Anyone going faster
or slower than we want to will penalize them and enforce conformity
across our town. Now the traffic is flowing. Yeah,
we enforced conformity, but we didn't really optimize
travel. We optimized mediocrity.
Really what we want to do is something like this. We want the cars to
be able to communicate with each other, prioritize the traffic.
Those cars that want to go fast or that need urgent access,
they can go in one lane, and other cars that might go slower can go
in other lanes. And we can coordinate this traffic to ensure
that everyone reaches their destination with as much fun
and expediency as is comfortable for their system.
Yeah, if we could prioritize the traffic and communicate
together, we wouldn't have to aim for mediocrity.
We could excel at defining the system.
Yeah, we'll take a similar analogy as we start to
look at service mesh. Do we just want to aim for a conformity,
or do we want to do something? Excellent. So we'll
take a look at what is a service mesh? Why would I use it?
How do I get started? What are the benefits of it? We'll see a demo
of both istio and linkerd in this process.
And finally, we'll talk about best practices.
First up, a service mesh. A service mesh manages traffic
between services in a graceful and scalable way.
Or, said differently, a service mesh is the answer to the question,
how do I observe, control, and secure the
communication between my microservices? Now, if you have the
need to observe, control, or secure traffic between your microservices,
service mesh may be a great solution. Maybe if you just have one,
it might be a little bit overkill. Let's dive into each
of those. Observe well, we want to be able to watch the
traffic flowing between our containers in
our Kubernetes cluster and get a feel for how they behave.
Are we getting microservices calling into places that they shouldn't?
Are we getting road traffic coming through our system?
Are services online? Are they behaving as expected?
These are all things that we can observe as we get a service mesh in
place. Next, let's upgrade to control.
Let's create policies within our cluster that
says this service can speak to this service, this service can accept traffic
from this service. But all this other traffic that we really don't
understand, we're just going to shut it down. We don't want rogue services
calling into our project just because they happen to start
up a pod there. Now, we do need to work carefully with developers
to ensure the applications work as designed,
but we can also stop rogue applications that happen to pop up
within our cluster. They just can't get
to our services. We've walled off the services to
match the needs of those particular applications.
Next, we can secure. Now, the beauty of securing our
applications is by default, within Kubernetes, all of our
services communicate over HTTP unencrypted.
Now, maybe they're doing GRPC or other
forms of communication rest graphql
but at the end of the day they're doing HTTP and they're doing it unencrypted.
Well, if we have the need to encrypt traffic within our cluster,
we can use service mesh's mutual tls to be able to create
encrypted tunnels where services communicating between each other can
hit each other through secure tunnels without needing to
change our application. Now, back in the day when we had monoliths,
it was really easy. We deployed all of the pieces of our application holistically
together. As containers came about, we were able to split our
application into lots of different services. Now, we love this
because now we can deploy little pieces, scale them independently,
replace them independently, maybe even develop them independently.
We can build and deploy and scale our services much easier than
we could in a monolithic system. But now
our application's internal pieces have IP addresses.
And so right now, microservices own their own
data, and we've contained that mechanism.
The user interface is able to call the microservices that they need to,
and everything is fine. As we talk about traffic
within our cluster, we'll talk about both north south traffic
and east west traffic. North south
traffic is traffic flowing into or out of our cluster.
By comparison, east west traffic is traffic flowing between our microservices
inside of our cluster. And the beauty here is that a
service mesh can secure both. Well, what came before us,
well, back in the day, we had an API gateway. We could think of this
as like a fence around our cluster. Now that's great.
We had a traffic cop at the edge of town, and we were making sure
that anyone that came into town was behaving as expected. But what about the people
who are already in town? What about the traffic already
in our cluster? We can see that the API gateway
has no visibility into microservices calling each other's
data stores, or microservices calling other microservices that
it shouldn't. The API gateway is merely a fence
around our system. Now, it's a great fence.
We can use it for monitoring inbound traffic. We can use it for
counting usage and billing back to those systems
that need it. But it can only see
traffic at the boundary of our cluster. It can't see traffic within
our cluster. It can see north south traffic, it can't see east west
traffic. So now what?
Well, let's take a look at how service mesh works.
Now, what's really cool is if service a needs to call service b without
a service mesh, it just calls it. But if service a needs to
call service b within a service mesh, it works like this.
We start out with service a INSIde of its own pod,
reaching out to this proxy. Now, this proxy was deployed as part
of this pod to ensure that service a can communicate securely with all the
things. This proxy reaches out to the service mesh
control plane, different from the cluster control plane, and the service mesh
control plane can validate that traffic. Am I allowed to talk to
service B? In this case, the service mesh says yes.
Now, this proxy connects to service B's proxy, and service
B's proxy again reaches out to the service mesh. Am I allowed to
accept traffic from service A? In this case, the service mesh says
yes, and the traffic is forwarded on to service B. Service B
replies, and across that proxy, the response goes to Service A.
Now, the beauty here is inside the pod, all the traffic
can communicate between the service and its proxy just across
localhost. But anytime it leaves that pod boundary,
it's going to run through this proxy connection. And the
beauty here is that we can secure this connection with mutual tls. So this
side has a certificate, that side has a certificate. It's bound to
the trust chain within the service mesh. And now we have a great communication
pattern that is secure anytime traffic leaves the pod.
And we did all that without needing to modify service A or
service B. Service A to the proxy proxy, to the
service mesh, the service mesh says yes, let's create a mutual tls
tunnel. Service B's proxy reach out to the mesh, service B's proxy
forwards it off to service B. And all this happens transparently to
the two services who just communicate with whatever the service mesh needs
to tell them about. Now, that's great. We could also replace
this with ingress, or replace this with egress so
that traffic going into or out of our cluster is also
mutually tls and validated with the service mesh.
So service meshes we can observe, control and
secure the traffic going through our cluster. Because we're
proxying all the traffic through these envoy proxies.
Now that's great. Now that we can visualize, now that
all the traffic is flowing between these proxies, we can observe it,
we can visualize it, we can understand its system,
we can also control it. No service a is not allowed to
connect to Service B, or rogue service X is not allowed to connect to
service B. And then finally we can secure it with mutual tls,
mutual tls through a trust chain to the
service mesh that may also then have a trust chain into our PKI
system. Now it's more than just a proxy.
Let's take a look at the other features that the service mesh might give us.
Because all the traffic is flowing through this proxy, we can
start to build a network topology. Now what's interesting here, this is
not the way the architect designed the system, but what we've observed
from actual traffic flowing through the system. We can build these graphs
that will have really impactful, meaningful details.
Let's compare it to the architect's version and see if maybe we
didn't deploy all the pieces, or maybe we accidentally turned off a service
with a service flag. Next we can take a
look at service health. Now the beauty here with monitoring
service health is that we can capture 500 or high
latency things and start to report that back to the controllers.
Now here we can take a look at the traffic flowing through our
cluster. We can compare it to known good things. We can
understand when our cluster is starting to misbehave. This is perfect.
And we can also log let's log all the traffic between all
the services. Let's log the HTTP status code, the results,
log the call chains between the services.
We have a really great mechanism of being able to capture
the network traffic going between these systems.
Let's level up again and take a look at additional features that
a service mesh can bring us. We can do a b testing
now because we're routing through this envoy proxy. We could redirect
the service mesh, could redirect it to two different versions.
Let's create a version a and a version b and see how they perform,
and then lean into the one that performs best.
We can also create a beta channel. Let's create a new
version of our software that maybe we don't have as much confidence in,
or maybe has advanced features that we want to get early feedback on
and enroll certain users in that beta channel or canary release.
Once we validate that the system works as expected, now we
can roll it out to the rest of the users as well. Some users may
really enjoy being part of that early feedback cycle and get
access to features as soon as they're available,
and we can create circuit breakers. If a service
becomes overloaded, it's really easy for us to accidentally topple
over that service. Well, all of the clients noticing that they didn't
get a response and presuming that it's just intermittent network
traffic might say, well, let me just retry it.
As soon as the service comes back online, it gets overwhelmed with all
of the requests coming in from all of those services that are retrying
and promptly falls over again.
So we can put in a circuit breaker that says, hey, things,
service is not doing well, I'm just going to fail all these requests right now
and let the service start back up gently, reach a
healthy state. Now we'll send in a little traffic, and unlike the circuit breakers
in our house, the machinery can automatically turn this back on once
the service is healthy. These are features that we get out of a service mesh
because we're proxying all the traffic between all of our services
within the Kubernetes cluster. We also get some
really great dashboards that allow us to visualize the traffic and
understand the health of our system.
Yeah, we started out with a service where everything
was calling everything, and we really don't like that mechanism. We grabbed
a service mesh to be able to control, observe and secure
the traffic within our cluster to ensure that our microservices
are calling the appropriate endpoints and rogue microservices aren't
able to exfiltrate data from our system. Let's take a
look at some service meshes. Now, as we look at service meshes,
we'll compare quite a few examples of this. Now,
service meshes are getting built really fast right now, and their features
are evolving quickly. So we're not going to compare feature sets,
but rather methodologies of these systems. We'll look in particular
at Istio and Linkerd, but there's many more service meshes that you
may choose from. For the longest time, Linkerd was
the only one in CNCF, and so it became quite popular.
Istio was amazingly popular, but had some governance
restrictions that are now no longer the case. So do
you need Istio or Linkerd? Those are good places to start, and as
you search for those, you may find others that best match your needs.
First up, Linkerd. Now, Linkerd's methodology
is very simple install and really
easy to use. They focus on having everything that
you need to get started in the box.
Now, that's great. You can get started really easily, but it does mean
that if you want to stray beyond their initial set of features,
that you'll probably need to look to third parties to be able to augment Linkerd.
Linkerd is great at contributing back to the rust
community, so a lot of the rust networking stack was actually
built to facilitate Linkerd.
Next up, istio. Now, istio's methodology
is very different. It tries to include the best of open
source projects to ensure that you have all of the features that you
need. Then you can turn on and off features based on
profiles or based on just turning features on and off, and then
you can tune Istio to be exactly the thing that you need.
Now we'll dig in deep with Istio's virtual services to be
able to see how we might choose to host some traffic in one service
and some traffic in another service. An A B test and this is
a feature of all service meshes, but we'll get to see it here in istio.
So let's take a look at these. First up,
let's fire up Linkerd. Oh, let's not
fire up that one, let's fire up this one. Let's use Linkerd.
And what Linkerd focuses on is a really elegant and smooth
install experience. So let's head off to the Linkerd
docs and take a look at getting started. Well, I start off by downloading
the Linkerd Cli and then I can
do a Linkerd check pre says I know
that Linkerd isn't installed yet, but let's just validate it that it's there.
Then I'll install the Crds, then I'll install Linkerd and
then I can run Linkerd check. I've already done these just to
speed up this presentation. Next we could take a look at
the dashboards. So we'll
say Linkerd viz install and we'll install that.
I've already done it, but let's do it again just in case.
Let's get Linkerd installed and then next up we can
check to see if Linkerd is running.
So Linkerd check. And what I like about this is that
not only will it validate that Linkerd is running, but it'll also wait for
it if it isn't. So let's double check that the
viz extension is in place and once we get the green
light now we know that Linkerd is ready to go.
Now Linkerd will augment namespaces
to be able to show which namespaces should have that sidecar
applied. So we can see the labels here in our default namespace says
Linkerd inject enabled.
And so now anything that we start in the default namespace will get that
sidecar applied. Well, let's take a look at the
dashboard inside Linkerd.
Linkerd dashboard.
And now we've started the built in dashboard for Linkerd,
we can take a look at the various namespaces in our system. Take a
look at the automatic discovery
of the service integration because we've got them injected
through Linkerd. Yep, Linkerd is running for Linkerd.
And then we can take a look at all the deployments and the
health of those services. Picking a particular service.
We can take a look at the details of that service. Well, it looks like
we're up 100% of the time now, and we have the references of what calls
what on the way past. That's really elegant. Now if we
don't want to view it through an API, we can definitely do it a different
way as well. Linkerd stat.
And I'll take a look at the Linkerd namespace and take
a look at deployments. Here's that same output from the command line.
And I could also grab it from Prometheus syncs and pipe
it off to Grafana or splunk or another system.
So that was great to be able to take a look at Linkerd. The install
experience is super fast and allows us to get going really easily.
It does kind of have a bare bone system. They want everything in the
box, so we may need to reach out to others if we want to go
farther. Next, let's take a look at istio.
Now with istio we have a similar setup of getting started.
We can start by downloading the
istio CLI and then once we've got
that in our path, we'll install istio picking the profile that
we want. In this case we'll use demo, which turns everything on.
Next we can enable that namespace
injection. So let's take a look at the namespace
and we can see that we've got istio set
up to be able to automatically inject the
sidecar into each of the pods launched in this namespace.
Next we can launch a sample application. Now this sample application is
a really good way to look at istio and Istio's virtual
routing. So we have an ingress that
might route to a product page. Our product page shows some details and
then also gets reviews. We have three different review services.
Now, we can think of this as developing the various reviews,
and we'll look at the automatic upgrades
through those processes. Now, you probably wouldn't run all three at the same time,
but we're going to do that for this demo. And version
two and version three show stars reaching into another service.
Each of these gray boxes is an envoy proxy that allows us
to be able to virtually route traffic as we need to.
So here's our bookstore app, and right now we're going equally between
the three systems. So you'll see, sometimes I have no stars.
Sometimes I have stars in black color. Sometimes I have stars in red color.
This is great to be able to show the various versions.
Version one has no stars. Version two has stars in black
color, and version three has stars in red color.
How did we get that? Well, here's that service that allows
us to be able to look at all three. It has
this virtual service that routes traffic evenly
between them. Well, almost. So now that we've got
traffic flowing evenly between them, let's take a look at an
upgrade cycle of how we might use Istio to be able to route traffic
without downtime, taking advantage of av channels canary
deploys. Let's start by sending it all to version one.
So let's cubectl apply f
virtual service reviews. We'll do v one.
And now all of our traffic will go to version one.
We'll see that we now have no stars no matter how many times we
refresh it. Excellent. Now we
want to start routing traffic to version two, but we only
want to grab, say, 20% of the traffic.
Let's make sure that version two works as expected. Okay,
so let's go grab this one and we'll apply this
rule. Cubectl apply f
that one. Now, 80% of the time we'll get no stars.
And 20% of the time we'll get stars in black color.
Yeah, it looks like that was working as expected.
My system is behaving. So let's flip over to
go completely to version two.
Okay, here's version two. And now
with version two, we have 100% of the traffic going to
version two. We were able to migrate without downtime,
giving some users access to the early features. Well,
let's take that a little further and let's create a canary release.
Well, here we want to say if the user is
json, then we'll give them version three. Otherwise we'll
give them the original version, version two. Okay, so let's
Cubectl apply f
virtual service reviews
json up typos.
Let's try that again. There we go.
Oh, cubectl apply
f. That one.
Nice. Now that we've got that one in place,
let's refresh our app and we'll see that most of
the time, while unauthenticated,
we get version two. Stars in black color. Well, let's sign
in to the canary release. I'll log in as
Jason. And now we can see that we get version three.
Jason is really excited for these new features. It looks like it's working
well. And if we log back out, we'll see that we get back to version
two consistently. Our regular users are not impacted by
this test. Now that we've gotten version
three ready to go, let's flip over exclusively to version three.
Cubectl apply f
virtual service reviews v three.
And now that we're exclusively in version three, we always
have the stars in red color. Now, we were able to upgrade through
these versions with no downtime. That's excellent.
Let's flip back to the one that is 33
and 33 for our next demo.
And so now we can see that we have all three of the stars.
Version one, version three.
Version two. Now let's take a look at the istio
dashboards.
Istio dashboard Prometheus.
The Prometheus dashboard is great at being able to look
deep into the istio system. So let's look for
istio requests total,
and we can see those Prometheus metrics flowing
in. Now, that may not be the best way to visualize it. So instead
of visualizing it through Prometheus, let's visualize it
through Grafana. Now,
Grafana is an industry standard dashboard. And with istio,
you get some grafana dashboards. So let's take a look
at the istio control plane dashboard. We can see all kinds
of interesting metrics associated with our cluster and
the various traffic within it. That looks pretty neat.
Let's dig into the next dashboard that comes built in with istio that
we might choose to enable. I'm going to use Jaeger.
Now, Yeager is really great for open telemetry. It allows
us to be able to grab traces between our system.
So let's take a look at this one. We'll take a look at traces,
and we can see the various calls to this system.
Ooh, this one looks interesting. Let's pop open this one.
We can see the request came into the istio ingress gateway.
It was forwarded off to the product page microservice.
The product page microservice called the details page and it ran
for this long. It also called the product reviews service.
Now we can see the details service didn't run very long,
but the reviews service ran a little bit longer and
the product page did a whole lot of processing after that. So if
we were to optimize this system, working on the
details page is probably not going to optimize our use
case. Now, it's great to be able to then dig into each of those things
and understand those distributed traces so that we have context
across our system. Next dashboard that we look at.
Let's take a look at Kiali. Now,
Kiali is great for visualizing who calls what. We'll log
into Kiali. We'll take a look at the graphs and we'll change
this from 1 minute to 30 minutes to take a look at the calls
through our system. Now what's beautiful here is that we get
a network diagram of our system. We called the product
page and it called the details page V one. We also
called the product page that called the review system. And over the
course of our experience, we ended up with all three versions
getting called. We saw two and three called the rating service.
Now what's interesting here is this is what's actually happening within
our system. That's great, but what if we notice
that v two isn't calling the rating system? Did we have a feature
flag that disabled the system and we forgot to turn it back on?
We can get a feel for how our system is actually behaving.
Compare that to what the architect expected and make
some different choices. Oh, it looks like we haven't used v one in a while
and so that one started going gray. That's excellent.
So we were able to look at both Istio and Linkerd.
Istio was great at showing all of the different details,
having features that we could turn on and off to get deep into our system.
It includes the best of open source projects. By comparison,
Linkerd is super easy to get started with and includes
pretty much everything in the box that we need to start. But if we want
to go farther, we need to reach outside of Linkerd.
Now that was great. We got to see both systems, compare and contrast them.
If one of those is a great fit for you, that's great. If you want
to look at other things, perhaps searching for these two
will help you find the one that exactly matches your needs.
Now we got to see, as we were looking at service meshes that
when we first start crawling, we get monitoring,
logging, service, health. These are all features that we get as
we proxy through our service mesh, upgrading from
crawl to walk, we can see that we got intelligent routing. We were
able to create a b tests, we were able to create canary
releases, we were able to virtually route between versions while
both of them were working simultaneously within our cluster.
And when we upgrade from walk to run, we get a live
network topology diagram that shows us exactly what's happening in
our cluster. Distributed traces, live network diagrams.
We get great monitoring and diagnostics from our
system, because we're proxying between each of those microservices.
Now, a service mesh is not without its costs. On the
left is a typical architecture diagram for kubernetes. We can
see the control plane and the worker nodes, and then we also have a
control plane and envoy proxies with
a service mesh. Now that means that we're running
more containers. Now, granted, envoy proxy is a
lot leaner than a Java Tomcat app, so maybe
we're not running twice the workload, but we're probably running
twice the containers, maybe one and a half the workload, or one and a
third the workload. We will run more stuff,
and that does mean additional hosting costs. So how
do we know when a service mesh is right? Is it worth the investment to
have that level of observability, control and security?
The benefits of a service mesh? We get to observe,
control and secure the system. And if we have these needs, a service
mesh is a really elegant tool. We can watch the traffic
flowing between our cluster. We can create network policies
that route it to beta channels, or that just
discard it if it's not coming in the best way. And we
get mutual tls between all of our services, ensuring that the
services are not attacked by rogue containers
running in our cluster. So when should we use this?
Well, a service mesh is really great if we have
a mix of trusted and untrusted workloads.
So for example, maybe we have very highly sensitive workloads,
PKI or PCI workloads, and we need to ensure that they
are completely separate. We'll build a virtual cage for those services,
so that only those pieces that need to can communicate with those
services running untrusted workloads.
Maybe we have a multi tenant system, or we're running
things on behalf of others and we're not quite sure what they are. We definitely
need to be able to segregate those out so they don't impact the
majority of our workloads. Maybe I'm running a multitenant workload
and I need to be able to segregate different lanes for different environments.
And so now I can create mechanisms where each tenant can
get their own bounded mechanism and
not interfere with other clients running elsewhere in the cluster.
Now, by default, Kubernetes has namespaces, but namespaces
are an organizational boundary, not a security boundary.
By comparison, when I add a service mesh,
I'm able to create those hard boundaries between services to
ensure that only those things that need to are able to reach it.
If I need security in depth, if I need HTTPs within
my cluster, not just to the front door of my cluster,
then a service mesh can be a great way to get mutual tls.
If all I need is mutual tls, I might find a lighter weight solution.
But if I need mutual tls together with observability and control,
then perhaps service mesh is great. If I need
additional features like AB routing or a beta channel,
a service mesh can be a great opportunity to get that. Now there are
other ways to be able to get multiple versions running at the
same time and virtually route between them. But if that's one of
my needs together with other needs in things list, then service meshes might
be a great fit. This has been a lot of fun getting to introduce
to you servicemesh and show you when it makes sense,
and maybe when it doesn't make sense. If you're watching this on
demand, find me on Twitter at rob underscore rich or on MacedonRobrich
at hashaderm IO. Or find all the other socials
on robrich.org and you can download this presentation right
now from robrich.org. Click on presentations if
you're watching this live. I'll see you in a minute at that spot where the
conference is designated for live q and a.
Thanks for joining us for service meshes service meshes to service mesh
here at comp 40 two's cube native thanks for coming.