Service Mess to Service Mesh

Video size:

Abstract

In our quest to secure all the things, do we jump in too quickly? We’ll use Istio and Linkerd as example service meshes, and look at the features we would expect from a service mesh. You’ll leave with a concrete understanding of the service mesh, and be ready to justify the investment.

Summary

We're going to talk about service mesh to service meshes. Do you need a service mesh? How do you get started? We'll see demos of both Linkerd and istio in this process. I am definitely going to post the slides on my site tonight.
AZ givecamp brings volunteer developers together with charities to build free software. Sunday afternoon we deliver completed software to the charities. If you're in Phoenix, come join us to the next AZ give camp. Or if you'd like a give camp closer to you, hit me up.
Service mesh allows cars to communicate with each other, prioritize the traffic. What is a service mesh? Why would I use it? How do I get started? What are the benefits of it? And finally, we'll talk about best practices.
A service mesh manages traffic between services in a graceful and scalable way. It can be used to observe, control, and secure the communication between microservices. By default, within Kubernetes, all of our services communicate over HTTP unencrypted.
A service mesh can secure both north south and east west traffic flowing into or out of our cluster. All the traffic can communicate between the service and its proxy just across localhost. Service meshes can observe, control and secure the traffic going through the cluster.
First up, Linkerd. Linkerd focuses on having everything that you need to get started in the box. Next up, istio's methodology is very different. It tries to include the best of open source projects. Now we'll dig in deep with Istio's virtual services.
Linkerd focuses on a really elegant and smooth install experience. The install experience is super fast and allows us to get going really easily. Let's take a look at the dashboard inside Linkerd.
Next, let's take a look at istio's virtual routing. We have three different review services. It has this virtual service that routes traffic evenly between them. Let's see an upgrade cycle of how we might use Istio to route traffic without downtime.
Yeager is really great for open telemetry. Kiali is great for visualizing who calls what. Service mesh is a really elegant tool. Is it worth the investment to have that level of observability and security?

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Hi, welcome to Comp 42 Cloud native. We're going to talk about service mesh to service meshes. Do you need a service mesh? How do you get started? We'll see demos of both Linkerd and istio in this process. So let's jump in. Here's the part where I tell you, I am definitely going to post the slides on my site tonight. I've been in enough talks that have done similar things. The slides are online, right? Right now let's head to robrich.org. We'll click on presentations here at the top and we can see service meshes. To service mesh is right there at the we're while we're here on robrich.org, let's click on about me and see some of the things that I've done recently. Both Docker and Microsoft have given me some awards and AZ give camp is really fun. AZ givecamp brings volunteer developers together with charities to build free software. We start building software Friday after work. Sunday afternoon we deliver completed software to the charities. Sleep is optional, caffeine provided. If you're in Phoenix, come join us to the next AZ give camp. Or if you'd like a give camp closer to you, hit me up here at the event or on any of the socials here and let's get a give camp in your neighborhood too. Some of the other things that I've done, I was awarded a tip of the captain's hat award by Docker last year. That was a lot of fun and one of the things I'm particularly proud of, I replied to a net Rocks podcast episode. They read my comments on the air and they sent me a mug. So there's my claim to fame, my coveted net rocks mug. So let's dig into service mesh to service meshes. We talked about this guy. So let's imagine that we're learning how to drive. Do you remember when you learned how to drive? Do you remember the fun it was to be able to hit that open road? You know, the excitement of going beyond just the current neighborhood into the next town? Maybe across the country perhaps? You know how fun it was to drive. Well, let's imagine a small town and yeah, we can drive anywhere. We can drive as fast as we want to. We can really enjoy the road. Well, over time the town starts to grow up, the traffic gets a little bit more congested, and now we've got traffic. So how do we solve traffic here in our small town? Well, I know it's those people coming into town that shouldn't be here so let's set up a traffic cop right at the edge of town. Anyone going faster or slower than we want to will penalize them and enforce conformity across our town. Now the traffic is flowing. Yeah, we enforced conformity, but we didn't really optimize travel. We optimized mediocrity. Really what we want to do is something like this. We want the cars to be able to communicate with each other, prioritize the traffic. Those cars that want to go fast or that need urgent access, they can go in one lane, and other cars that might go slower can go in other lanes. And we can coordinate this traffic to ensure that everyone reaches their destination with as much fun and expediency as is comfortable for their system. Yeah, if we could prioritize the traffic and communicate together, we wouldn't have to aim for mediocrity. We could excel at defining the system. Yeah, we'll take a similar analogy as we start to look at service mesh. Do we just want to aim for a conformity, or do we want to do something? Excellent. So we'll take a look at what is a service mesh? Why would I use it? How do I get started? What are the benefits of it? We'll see a demo of both istio and linkerd in this process. And finally, we'll talk about best practices. First up, a service mesh. A service mesh manages traffic between services in a graceful and scalable way. Or, said differently, a service mesh is the answer to the question, how do I observe, control, and secure the communication between my microservices? Now, if you have the need to observe, control, or secure traffic between your microservices, service mesh may be a great solution. Maybe if you just have one, it might be a little bit overkill. Let's dive into each of those. Observe well, we want to be able to watch the traffic flowing between our containers in our Kubernetes cluster and get a feel for how they behave. Are we getting microservices calling into places that they shouldn't? Are we getting road traffic coming through our system? Are services online? Are they behaving as expected? These are all things that we can observe as we get a service mesh in place. Next, let's upgrade to control. Let's create policies within our cluster that says this service can speak to this service, this service can accept traffic from this service. But all this other traffic that we really don't understand, we're just going to shut it down. We don't want rogue services calling into our project just because they happen to start up a pod there. Now, we do need to work carefully with developers to ensure the applications work as designed, but we can also stop rogue applications that happen to pop up within our cluster. They just can't get to our services. We've walled off the services to match the needs of those particular applications. Next, we can secure. Now, the beauty of securing our applications is by default, within Kubernetes, all of our services communicate over HTTP unencrypted. Now, maybe they're doing GRPC or other forms of communication rest graphql but at the end of the day they're doing HTTP and they're doing it unencrypted. Well, if we have the need to encrypt traffic within our cluster, we can use service mesh's mutual tls to be able to create encrypted tunnels where services communicating between each other can hit each other through secure tunnels without needing to change our application. Now, back in the day when we had monoliths, it was really easy. We deployed all of the pieces of our application holistically together. As containers came about, we were able to split our application into lots of different services. Now, we love this because now we can deploy little pieces, scale them independently, replace them independently, maybe even develop them independently. We can build and deploy and scale our services much easier than we could in a monolithic system. But now our application's internal pieces have IP addresses. And so right now, microservices own their own data, and we've contained that mechanism. The user interface is able to call the microservices that they need to, and everything is fine. As we talk about traffic within our cluster, we'll talk about both north south traffic and east west traffic. North south traffic is traffic flowing into or out of our cluster. By comparison, east west traffic is traffic flowing between our microservices inside of our cluster. And the beauty here is that a service mesh can secure both. Well, what came before us, well, back in the day, we had an API gateway. We could think of this as like a fence around our cluster. Now that's great. We had a traffic cop at the edge of town, and we were making sure that anyone that came into town was behaving as expected. But what about the people who are already in town? What about the traffic already in our cluster? We can see that the API gateway has no visibility into microservices calling each other's data stores, or microservices calling other microservices that it shouldn't. The API gateway is merely a fence around our system. Now, it's a great fence. We can use it for monitoring inbound traffic. We can use it for counting usage and billing back to those systems that need it. But it can only see traffic at the boundary of our cluster. It can't see traffic within our cluster. It can see north south traffic, it can't see east west traffic. So now what? Well, let's take a look at how service mesh works. Now, what's really cool is if service a needs to call service b without a service mesh, it just calls it. But if service a needs to call service b within a service mesh, it works like this. We start out with service a INSIde of its own pod, reaching out to this proxy. Now, this proxy was deployed as part of this pod to ensure that service a can communicate securely with all the things. This proxy reaches out to the service mesh control plane, different from the cluster control plane, and the service mesh control plane can validate that traffic. Am I allowed to talk to service B? In this case, the service mesh says yes. Now, this proxy connects to service B's proxy, and service B's proxy again reaches out to the service mesh. Am I allowed to accept traffic from service A? In this case, the service mesh says yes, and the traffic is forwarded on to service B. Service B replies, and across that proxy, the response goes to Service A. Now, the beauty here is inside the pod, all the traffic can communicate between the service and its proxy just across localhost. But anytime it leaves that pod boundary, it's going to run through this proxy connection. And the beauty here is that we can secure this connection with mutual tls. So this side has a certificate, that side has a certificate. It's bound to the trust chain within the service mesh. And now we have a great communication pattern that is secure anytime traffic leaves the pod. And we did all that without needing to modify service A or service B. Service A to the proxy proxy, to the service mesh, the service mesh says yes, let's create a mutual tls tunnel. Service B's proxy reach out to the mesh, service B's proxy forwards it off to service B. And all this happens transparently to the two services who just communicate with whatever the service mesh needs to tell them about. Now, that's great. We could also replace this with ingress, or replace this with egress so that traffic going into or out of our cluster is also mutually tls and validated with the service mesh. So service meshes we can observe, control and secure the traffic going through our cluster. Because we're proxying all the traffic through these envoy proxies. Now that's great. Now that we can visualize, now that all the traffic is flowing between these proxies, we can observe it, we can visualize it, we can understand its system, we can also control it. No service a is not allowed to connect to Service B, or rogue service X is not allowed to connect to service B. And then finally we can secure it with mutual tls, mutual tls through a trust chain to the service mesh that may also then have a trust chain into our PKI system. Now it's more than just a proxy. Let's take a look at the other features that the service mesh might give us. Because all the traffic is flowing through this proxy, we can start to build a network topology. Now what's interesting here, this is not the way the architect designed the system, but what we've observed from actual traffic flowing through the system. We can build these graphs that will have really impactful, meaningful details. Let's compare it to the architect's version and see if maybe we didn't deploy all the pieces, or maybe we accidentally turned off a service with a service flag. Next we can take a look at service health. Now the beauty here with monitoring service health is that we can capture 500 or high latency things and start to report that back to the controllers. Now here we can take a look at the traffic flowing through our cluster. We can compare it to known good things. We can understand when our cluster is starting to misbehave. This is perfect. And we can also log let's log all the traffic between all the services. Let's log the HTTP status code, the results, log the call chains between the services. We have a really great mechanism of being able to capture the network traffic going between these systems. Let's level up again and take a look at additional features that a service mesh can bring us. We can do a b testing now because we're routing through this envoy proxy. We could redirect the service mesh, could redirect it to two different versions. Let's create a version a and a version b and see how they perform, and then lean into the one that performs best. We can also create a beta channel. Let's create a new version of our software that maybe we don't have as much confidence in, or maybe has advanced features that we want to get early feedback on and enroll certain users in that beta channel or canary release. Once we validate that the system works as expected, now we can roll it out to the rest of the users as well. Some users may really enjoy being part of that early feedback cycle and get access to features as soon as they're available, and we can create circuit breakers. If a service becomes overloaded, it's really easy for us to accidentally topple over that service. Well, all of the clients noticing that they didn't get a response and presuming that it's just intermittent network traffic might say, well, let me just retry it. As soon as the service comes back online, it gets overwhelmed with all of the requests coming in from all of those services that are retrying and promptly falls over again. So we can put in a circuit breaker that says, hey, things, service is not doing well, I'm just going to fail all these requests right now and let the service start back up gently, reach a healthy state. Now we'll send in a little traffic, and unlike the circuit breakers in our house, the machinery can automatically turn this back on once the service is healthy. These are features that we get out of a service mesh because we're proxying all the traffic between all of our services within the Kubernetes cluster. We also get some really great dashboards that allow us to visualize the traffic and understand the health of our system. Yeah, we started out with a service where everything was calling everything, and we really don't like that mechanism. We grabbed a service mesh to be able to control, observe and secure the traffic within our cluster to ensure that our microservices are calling the appropriate endpoints and rogue microservices aren't able to exfiltrate data from our system. Let's take a look at some service meshes. Now, as we look at service meshes, we'll compare quite a few examples of this. Now, service meshes are getting built really fast right now, and their features are evolving quickly. So we're not going to compare feature sets, but rather methodologies of these systems. We'll look in particular at Istio and Linkerd, but there's many more service meshes that you may choose from. For the longest time, Linkerd was the only one in CNCF, and so it became quite popular. Istio was amazingly popular, but had some governance restrictions that are now no longer the case. So do you need Istio or Linkerd? Those are good places to start, and as you search for those, you may find others that best match your needs. First up, Linkerd. Now, Linkerd's methodology is very simple install and really easy to use. They focus on having everything that you need to get started in the box. Now, that's great. You can get started really easily, but it does mean that if you want to stray beyond their initial set of features, that you'll probably need to look to third parties to be able to augment Linkerd. Linkerd is great at contributing back to the rust community, so a lot of the rust networking stack was actually built to facilitate Linkerd. Next up, istio. Now, istio's methodology is very different. It tries to include the best of open source projects to ensure that you have all of the features that you need. Then you can turn on and off features based on profiles or based on just turning features on and off, and then you can tune Istio to be exactly the thing that you need. Now we'll dig in deep with Istio's virtual services to be able to see how we might choose to host some traffic in one service and some traffic in another service. An A B test and this is a feature of all service meshes, but we'll get to see it here in istio. So let's take a look at these. First up, let's fire up Linkerd. Oh, let's not fire up that one, let's fire up this one. Let's use Linkerd. And what Linkerd focuses on is a really elegant and smooth install experience. So let's head off to the Linkerd docs and take a look at getting started. Well, I start off by downloading the Linkerd Cli and then I can do a Linkerd check pre says I know that Linkerd isn't installed yet, but let's just validate it that it's there. Then I'll install the Crds, then I'll install Linkerd and then I can run Linkerd check. I've already done these just to speed up this presentation. Next we could take a look at the dashboards. So we'll say Linkerd viz install and we'll install that. I've already done it, but let's do it again just in case. Let's get Linkerd installed and then next up we can check to see if Linkerd is running. So Linkerd check. And what I like about this is that not only will it validate that Linkerd is running, but it'll also wait for it if it isn't. So let's double check that the viz extension is in place and once we get the green light now we know that Linkerd is ready to go. Now Linkerd will augment namespaces to be able to show which namespaces should have that sidecar applied. So we can see the labels here in our default namespace says Linkerd inject enabled. And so now anything that we start in the default namespace will get that sidecar applied. Well, let's take a look at the dashboard inside Linkerd. Linkerd dashboard. And now we've started the built in dashboard for Linkerd, we can take a look at the various namespaces in our system. Take a look at the automatic discovery of the service integration because we've got them injected through Linkerd. Yep, Linkerd is running for Linkerd. And then we can take a look at all the deployments and the health of those services. Picking a particular service. We can take a look at the details of that service. Well, it looks like we're up 100% of the time now, and we have the references of what calls what on the way past. That's really elegant. Now if we don't want to view it through an API, we can definitely do it a different way as well. Linkerd stat. And I'll take a look at the Linkerd namespace and take a look at deployments. Here's that same output from the command line. And I could also grab it from Prometheus syncs and pipe it off to Grafana or splunk or another system. So that was great to be able to take a look at Linkerd. The install experience is super fast and allows us to get going really easily. It does kind of have a bare bone system. They want everything in the box, so we may need to reach out to others if we want to go farther. Next, let's take a look at istio. Now with istio we have a similar setup of getting started. We can start by downloading the istio CLI and then once we've got that in our path, we'll install istio picking the profile that we want. In this case we'll use demo, which turns everything on. Next we can enable that namespace injection. So let's take a look at the namespace and we can see that we've got istio set up to be able to automatically inject the sidecar into each of the pods launched in this namespace. Next we can launch a sample application. Now this sample application is a really good way to look at istio and Istio's virtual routing. So we have an ingress that might route to a product page. Our product page shows some details and then also gets reviews. We have three different review services. Now, we can think of this as developing the various reviews, and we'll look at the automatic upgrades through those processes. Now, you probably wouldn't run all three at the same time, but we're going to do that for this demo. And version two and version three show stars reaching into another service. Each of these gray boxes is an envoy proxy that allows us to be able to virtually route traffic as we need to. So here's our bookstore app, and right now we're going equally between the three systems. So you'll see, sometimes I have no stars. Sometimes I have stars in black color. Sometimes I have stars in red color. This is great to be able to show the various versions. Version one has no stars. Version two has stars in black color, and version three has stars in red color. How did we get that? Well, here's that service that allows us to be able to look at all three. It has this virtual service that routes traffic evenly between them. Well, almost. So now that we've got traffic flowing evenly between them, let's take a look at an upgrade cycle of how we might use Istio to be able to route traffic without downtime, taking advantage of av channels canary deploys. Let's start by sending it all to version one. So let's cubectl apply f virtual service reviews. We'll do v one. And now all of our traffic will go to version one. We'll see that we now have no stars no matter how many times we refresh it. Excellent. Now we want to start routing traffic to version two, but we only want to grab, say, 20% of the traffic. Let's make sure that version two works as expected. Okay, so let's go grab this one and we'll apply this rule. Cubectl apply f that one. Now, 80% of the time we'll get no stars. And 20% of the time we'll get stars in black color. Yeah, it looks like that was working as expected. My system is behaving. So let's flip over to go completely to version two. Okay, here's version two. And now with version two, we have 100% of the traffic going to version two. We were able to migrate without downtime, giving some users access to the early features. Well, let's take that a little further and let's create a canary release. Well, here we want to say if the user is json, then we'll give them version three. Otherwise we'll give them the original version, version two. Okay, so let's Cubectl apply f virtual service reviews json up typos. Let's try that again. There we go. Oh, cubectl apply f. That one. Nice. Now that we've got that one in place, let's refresh our app and we'll see that most of the time, while unauthenticated, we get version two. Stars in black color. Well, let's sign in to the canary release. I'll log in as Jason. And now we can see that we get version three. Jason is really excited for these new features. It looks like it's working well. And if we log back out, we'll see that we get back to version two consistently. Our regular users are not impacted by this test. Now that we've gotten version three ready to go, let's flip over exclusively to version three. Cubectl apply f virtual service reviews v three. And now that we're exclusively in version three, we always have the stars in red color. Now, we were able to upgrade through these versions with no downtime. That's excellent. Let's flip back to the one that is 33 and 33 for our next demo. And so now we can see that we have all three of the stars. Version one, version three. Version two. Now let's take a look at the istio dashboards. Istio dashboard Prometheus. The Prometheus dashboard is great at being able to look deep into the istio system. So let's look for istio requests total, and we can see those Prometheus metrics flowing in. Now, that may not be the best way to visualize it. So instead of visualizing it through Prometheus, let's visualize it through Grafana. Now, Grafana is an industry standard dashboard. And with istio, you get some grafana dashboards. So let's take a look at the istio control plane dashboard. We can see all kinds of interesting metrics associated with our cluster and the various traffic within it. That looks pretty neat. Let's dig into the next dashboard that comes built in with istio that we might choose to enable. I'm going to use Jaeger. Now, Yeager is really great for open telemetry. It allows us to be able to grab traces between our system. So let's take a look at this one. We'll take a look at traces, and we can see the various calls to this system. Ooh, this one looks interesting. Let's pop open this one. We can see the request came into the istio ingress gateway. It was forwarded off to the product page microservice. The product page microservice called the details page and it ran for this long. It also called the product reviews service. Now we can see the details service didn't run very long, but the reviews service ran a little bit longer and the product page did a whole lot of processing after that. So if we were to optimize this system, working on the details page is probably not going to optimize our use case. Now, it's great to be able to then dig into each of those things and understand those distributed traces so that we have context across our system. Next dashboard that we look at. Let's take a look at Kiali. Now, Kiali is great for visualizing who calls what. We'll log into Kiali. We'll take a look at the graphs and we'll change this from 1 minute to 30 minutes to take a look at the calls through our system. Now what's beautiful here is that we get a network diagram of our system. We called the product page and it called the details page V one. We also called the product page that called the review system. And over the course of our experience, we ended up with all three versions getting called. We saw two and three called the rating service. Now what's interesting here is this is what's actually happening within our system. That's great, but what if we notice that v two isn't calling the rating system? Did we have a feature flag that disabled the system and we forgot to turn it back on? We can get a feel for how our system is actually behaving. Compare that to what the architect expected and make some different choices. Oh, it looks like we haven't used v one in a while and so that one started going gray. That's excellent. So we were able to look at both Istio and Linkerd. Istio was great at showing all of the different details, having features that we could turn on and off to get deep into our system. It includes the best of open source projects. By comparison, Linkerd is super easy to get started with and includes pretty much everything in the box that we need to start. But if we want to go farther, we need to reach outside of Linkerd. Now that was great. We got to see both systems, compare and contrast them. If one of those is a great fit for you, that's great. If you want to look at other things, perhaps searching for these two will help you find the one that exactly matches your needs. Now we got to see, as we were looking at service meshes that when we first start crawling, we get monitoring, logging, service, health. These are all features that we get as we proxy through our service mesh, upgrading from crawl to walk, we can see that we got intelligent routing. We were able to create a b tests, we were able to create canary releases, we were able to virtually route between versions while both of them were working simultaneously within our cluster. And when we upgrade from walk to run, we get a live network topology diagram that shows us exactly what's happening in our cluster. Distributed traces, live network diagrams. We get great monitoring and diagnostics from our system, because we're proxying between each of those microservices. Now, a service mesh is not without its costs. On the left is a typical architecture diagram for kubernetes. We can see the control plane and the worker nodes, and then we also have a control plane and envoy proxies with a service mesh. Now that means that we're running more containers. Now, granted, envoy proxy is a lot leaner than a Java Tomcat app, so maybe we're not running twice the workload, but we're probably running twice the containers, maybe one and a half the workload, or one and a third the workload. We will run more stuff, and that does mean additional hosting costs. So how do we know when a service mesh is right? Is it worth the investment to have that level of observability, control and security? The benefits of a service mesh? We get to observe, control and secure the system. And if we have these needs, a service mesh is a really elegant tool. We can watch the traffic flowing between our cluster. We can create network policies that route it to beta channels, or that just discard it if it's not coming in the best way. And we get mutual tls between all of our services, ensuring that the services are not attacked by rogue containers running in our cluster. So when should we use this? Well, a service mesh is really great if we have a mix of trusted and untrusted workloads. So for example, maybe we have very highly sensitive workloads, PKI or PCI workloads, and we need to ensure that they are completely separate. We'll build a virtual cage for those services, so that only those pieces that need to can communicate with those services running untrusted workloads. Maybe we have a multi tenant system, or we're running things on behalf of others and we're not quite sure what they are. We definitely need to be able to segregate those out so they don't impact the majority of our workloads. Maybe I'm running a multitenant workload and I need to be able to segregate different lanes for different environments. And so now I can create mechanisms where each tenant can get their own bounded mechanism and not interfere with other clients running elsewhere in the cluster. Now, by default, Kubernetes has namespaces, but namespaces are an organizational boundary, not a security boundary. By comparison, when I add a service mesh, I'm able to create those hard boundaries between services to ensure that only those things that need to are able to reach it. If I need security in depth, if I need HTTPs within my cluster, not just to the front door of my cluster, then a service mesh can be a great way to get mutual tls. If all I need is mutual tls, I might find a lighter weight solution. But if I need mutual tls together with observability and control, then perhaps service mesh is great. If I need additional features like AB routing or a beta channel, a service mesh can be a great opportunity to get that. Now there are other ways to be able to get multiple versions running at the same time and virtually route between them. But if that's one of my needs together with other needs in things list, then service meshes might be a great fit. This has been a lot of fun getting to introduce to you servicemesh and show you when it makes sense, and maybe when it doesn't make sense. If you're watching this on demand, find me on Twitter at rob underscore rich or on MacedonRobrich at hashaderm IO. Or find all the other socials on robrich.org and you can download this presentation right now from robrich.org. Click on presentations if you're watching this live. I'll see you in a minute at that spot where the conference is designated for live q and a. Thanks for joining us for service meshes service meshes to service mesh here at comp 40 two's cube native thanks for coming.

See all 21 talks at this event!

Conf42 Kube Native 2023 - Online

September 28 2023

Service Mess to Service Mesh

Video size:

Abstract

Summary

Transcript

Rob Richardson

Portfolio Architect @ The Church of Jesus Christ of Latter-day Saints

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2023 - Online

September 28 2023

Service Mess to Service Mesh

Video size:

Abstract

Summary

Transcript

Rob Richardson

Portfolio Architect @ The Church of Jesus Christ of Latter-day Saints

Join the community!