Transcript
This transcript was autogenerated. To make changes, submit a PR.
This is a developer's introduction to service mesh. I realized
that a lot of the service mesh resources that I was seeing had
can operator approach to it, which is how do you build a service mesh?
How do you configure it? But it turns out there's a lot of developer
capability that you need to take in service mesh
and use for your applications in order to get value out of it.
So if you're a developer or you're an operator, who needs to maybe
enable developers on how to use a service mesh? This is
a very rapid introduction and overview into all of
the different ways a service mesh affects your application.
So my journey into service mesh started with this very vague statement
of we must have a service mesh. And a security
engineer approached me with this concern. I was a little bit confused.
I wasn't sure what a service mesh was at the time. I was kind
of doing an embed with an application development team as
sort of an operations or infrastructure engineer,
and it was a very interesting statement. I had never heard of it.
And eventually I got to the core of what the
engineer was looking for, and they mentioned to me that they wanted service
to service communication with mtls. Basically, they wanted
each service to communicate encrypted with a certificate.
And when they approached the application
team for this requirement,
the concern was that it would take way too long to refactor every single
application and all of the code to use certificates.
And that is a valid concern. Do you need
service to service mtls? If for the most part you've secured
all your applications internally into a private network?
Well, you can never be too sure. So the security
team was looking for a way to secure communications
with mtls, point to point with applications and
their research introduction service mesh. Now, as I investigated
service mesh a little bit further, it turns out there's a lot of pieces to
how applications connect to each other. It's not just about securing
and encrypting the communications between services.
Turns out, services need to discover each other. We usually did this by
DNS. So how does service one get to service two second
services? Load balance. You need to be able to load balance between
instances of services as well as between different services.
Security was the concern that first came to
me. But besides mtls, there's also authorization too.
So are services allowed to communicate with each other on an API?
There were a lot of sophisticated tools out there, as well as code libraries that
were supporting this in applications. And did we really want to change that?
Finally, traffic management. Some applications might be
a little bit more sophisticated in how they require retry handling as
well as error handling and telemetry, which we were
trying for brick tracing. It was really difficult to implement and we were trying
to pretty much get metrics unified across the board.
So all of these functions were ways in which
services communicated with each other or the ways that they needed to
interact with each other. We weren't really sure what
a good answer was because right now all of these different kinds of
concepts required multiple tools.
And so I did more research on the service mesh to try to
understand why it solves this problem. And it comes down to this.
Service meshes rely on something called proxies. In this case, we're talking
about envoy proxy as a tool, but there are many other proxies as
well as service meshes with other custom proxies. But in
this case, we'll just focus a little bit more on envoy. For every application
instance that you have, for example, I have report v two, report v three,
expense v one, and expense v two, I have a proxy running
next to it. The proxy is responsible for all communications between services.
So anytime you need to communicate out of report v two,
it goes through proxy. Anytime it comes in, it's through proxy.
This has an interesting side effect. If you have multiple
application frameworks, which is usually the case in larger companies,
you have the ability to, well, direct traffic
through the proxies. And as a side effect, this means that you can build
abstractions or almost a layer on top
of the proxies. For example, the expense proxies represent
as a whole the expense service, whether version one, version two.
Similarly, the report service represents the abstraction of report version
two and version three, and the proxies can represent that.
Now report communications to expense. Similarly, everything goes
through the proxies. So the proxies control whether or not report service
can communicate to expense, the upstream service.
All of this, right, plus some kind of control plane equals
a service mesh. And when we mention control plane, we mentioned it
as a way to say you can push configurations out to each proxy
and a service mesh pushes configuration out to each proxy.
So if, for example, I wanted to create a report service,
the service mesh would create the abstraction of the report service
and send that configuration to proxies. Now, regardless of
which service mesh you use, for the most part, they're all using a very similar
approach in the way that they're pushing the configuration out to the proxies.
So most of the configurations you'll see today, while they are
consul focused or they're envoy proxy focused,
you'll see similar functionality in other service meshes.
My hope is that if you're using something else, you'll be able to understand the
terminology, the generic terminology, and apply it
to your application. So in this
case we are able to create a
service mesh configuration, push it out to proxies. Now if
you're in kubernetes, it's pretty easy to add the proxy
in place. The idea is that you can use an annotation or
you can inject it by default. So most service meshes will allow you to
add a service mesh annotation and it will inject
the proxy for you and the proxy for
many of the Kubernetes ones are envoy, although some of the other service meshes
use different proxy tools. So the
idea is that if you're doing this in Kubernetes, you can do the annotation
in console. Does do service mesh outside of
kubernetes as well. In this case you do have to add the process sidecar.
Process sidecar proxy. So if you're doing this on
a virtual machine for a much older application, you will have to deploy
the binary for the proxy and then configure it as a process
on the virtual machine.
So next, what does this mean? Well,
this configuration, this abstraction, pushing all
of these things into a service mesh means that you can configure
service discovery, load balancing, security, traffic management and
telemetry in one place irrespective of the
application code and the library. So if you're a
development team or you're an operations team tracing to enable development team doing
this, the idea is that you're replacing the functionality
that you might have implemented or already implemented in code for
service discovery, cloud balancing, security, traffic management and arguably telemetry
into a service mesh. So we're going to go through all five of these.
In the case of service discovery, remember there's two sets of
abstractions, the application options as well as the service mesh.
Application side options typically involve libraries like Eureka if
you're in the securing ecosystem, DNS or Kubernetes services.
If you're on Kubernetes, a service mesh does this all with
proxy registration. So in the application code
case in a programming language that allows you to do this,
you can do something as easily as adding an annotation
in your application. So this is securing and in
this case I'm enabling discovery client and now I've got service discovery for spring
applications. Problematically, not all applications
are using spring or Java for that matter. So you may have heterogeneous
workloads with different kinds of application frameworks, and in which case maybe
the service mesh service discovery approach is actually
much more useful. Service mesh will again create the
abstraction of the report service for v two, v three, as well
as the expense service for v one, v two. Doesn't matter what application frameworks
they are when you look at a service mesh
admin configuration. So if you're looking at the proxy admin configuration,
most of them will have this cluster's endpoint.
And this cluster's endpoint has a list of the service name,
such as expense, as well as the ip address. So in this case I'm
going to the proxy, the envoy proxy, running an API
call to just do some debugging. And if you examine this debug interface,
you'll see that there's can expense mapping, Jaeger mapping,
expense v two mapping to each IP address.
This is actually pushed out because when the proxy registers it
has information about the service and consul
itself pushes that information further to the proxies.
So that is where you're getting the service discovery piece.
In the case of load balancing, you also have two options
here, application side. Again, you can use a library like thane load
balancers and DNS. The combination usually give you some
kind of load balancing configuration. In the case of service mesh,
you're using pretty much just proxy configuration. So again, if you're lucky
and you're using something like spring, you have enable fang clients and
that injects a client that allows you to load balance
between certain service instances or application instances.
In the case of multiple application frameworks,
well, a service mesh again takes that abstraction, pushes it out
into a separate layer. So in this case you can use a service mesh to
push configuration out 50% to version 150 percent
to versions two. What this looks like is that
if you go into console, for example,
and I retrieve service splitter
configuration, what this does is that it outlines sort of an expense
service splitter, for example. And if I print out the CRD
or the custom resource for it, you'll notice that 50% of the weight goes
to v 150 percent goes to v two. All of this is
done through my interface of choice for my service mesh. So in this
case, this is a custom resource definition in Kubernetes. But you could
do this with an API call to console, API call to any other service mesh.
Why is this important? Well, when you examine this in
your service mesh configuration or your proxy configuration,
your service mesh is mapping that interface,
that declarative interface that you've made on the weight
to your proxy configuration so
effectively what it's doing is that it's doing that transformation
for you. So you'll also notice the weights 5000 5000 expense
and expense v two as well as the total weight. So this
is on the administrative side of the proxy itself.
So the proxy has JSOn and this is actually available
for you to see. So basically service meshes are pushing all this configuration
out to the proxies and now the proxy have awareness of all
of these weights that you need. The benefit of this
is that if you are accessing it from report service,
so that's what I'm going to do. I'm going to
access this through an API call from my upstream service
to my expense service and I'm just going to get the expense version.
You'll notice that it is load balancing between the
Java version which is o zero one snapshot as well
as the net version which is 60. So all of these are
configure through one interface and pushed out. So irrespective of
whether or not net or Java or any other application framework,
you have a single abstraction to do that.
So security, this is where I started my journey and this is where I
first heard about service mesh. And there were some misnomers to it,
right? Security requires a couple different abstractions when
it comes to loading a certificate or doing API authorization.
So libraries will, and write your own libraries
often allow sort of an easy interface to side cloud
a certificate or validate it if you want. On top of that,
if you're doing something like API authorization, for example, report can only
access expense on the version endpoint.
That API authorization flow could be done separately by a
server, a special server, or it can do it by OIDC or job.
In the case of service mesh, it's a little bit different.
You get mtls out of the box between proxies
as well as proxy filters, and the proxy filters
help you filter traffic based on API authorization
endpoints. So I'll actually show this. First we'll talk about the application side
and then we'll talk about service mesh. But on the application side, the complicated
complaint that I was getting from a number of developers for quite some time was
that they would have to add their own certificate validation
code into their codebase. And this is taken from the
ASP net core documentation. But for example, in the case of.
Net you'll have to add a validation event and you'll
have to add your own logic for that. So it can be quite a bit
of code. In the case of service mesh,
mtls is a little bit different mtls
happens between each of the proxies. So proxy from v one
to v three to v one,
v two, v three across. All of these services are all mtls,
so they're all encrypted. However, they're not encrypted between
proxy and report, for example, report v three, for example. So each proxy
that is running sidecar with the report or
expense instance is not going to have any
mtls. So that's where the caveat is, but mtls
is going to be within the mesh and between the proxies.
Now, if you're looking at this in the service mesh,
you can actually see that it is applying a certificate to each proxy.
So if you do a config dump, which is again the administrative interface
for envoy proxy, you'll notice that there's a
certificate chain as well as a private key and a validation context.
So all of this is done within the mesh. So you get
mtls between proxies, effectively point to point,
it's unencrypted between the proxy and the application instance.
Second piece of this is API authorization. API authorization is
whether or not report can communications to expense.
Can it do it on certain API endpoints, can you only do it on
certain methods? Now in spring, it's really easy to
get this done in that you have an oauth two client annotation
as well as a global method security annotation, and then you can configure
how services communicate to each other. But if you have something like
net go or something else that doesn't really exist,
it's not that easy to implement. You have to build yourself. So in
this case you can push it into once again the service mesh.
So for example, in this service mesh, I'm allowing report to access API
expense trip on the expense service. That API authorization
means that if the traffic going through the proxy accesses
an endpoint to the expense service that's not API expense trip, it will
not be allowed to do so. It's a little confusing,
and there's a lot of text in this, but the idea is that if you're
doing a dump on the administrative interface of envoy
proxy, you'll notice that there's a filter implementation.
This filter implementation adds the rules for
access between services. So in this case, the principal
report can access expense on
the path prefix of API expense trip. However, it's not allowed
to access anything else. Now, if you were to look at this
not as part of envoy proxy, and you were to look at this in a
much, I would say a much more user friendly way.
You can see this as part of, let's say something called intentions
in consul console basically abstracts
these proxy configurations and will sort of give
you a more intent driven view of how it works.
But effectively what it's doing is that when you create a custom resource and
called an intention, the intention describes you can allow report
to access expense on API
expense trip using a get. You can also get from the API from
report, but you cannot do anything else. So in
this case, this intention is mapping down to
the proxy configuration that I showed
earlier.
So traffic management, this one's a little bit more complicated.
It can get very very lengthy to describe,
and so I'm going to try to abbreviate this. But in
application space, especially with services, we talk a lot about circuit breaking,
retry handling, the importance of error handling, and most
of these have been traditionally done by libraries. So there
were libraries that would allow you to circuit break based on certain configurations,
or you would write your own kind of retry handling, which does happen.
In the case of service mesh, you can do a similar functionality.
There is a bit of a confusing terminology shift in that
if you're using something like envoy, a circuit breaker is
not quite the same as the circuit breaking pattern. The circuit
breaker sets the maximum pending and current connections for the upstream services,
and then outlier detection does the collection. So technically outlier detection
does the circuit breaking and ejects the service instance once certain
number of failures reach a threshold. But the
communications of the two combined implement the circuit breaker
pattern. So if you're familiar with that from can application view, you'll need
the combination. So in the case of securing
really nice, you enable circuit breaker. It's an annotation there.
It makes it super easy. In the case of net, it's a little bit
trickier. You have to write your own circuit breaker policy. So in
this case, the trouble with this is that if you want a holistic
view across all of your services about how they're circuit breaking on
each other and all of their behaviors, you'll have to scan
through all of the code in order to find that information. So in this situation,
you do have to consider how do you inject this
information into each application. And if it's not using net
and it's using something different, and you're doing this across multiple services,
you need to keep track of what kind of circuit breaker behavior
is happening. So there are some nuances to this so you can implement
this again as an abstraction in service mesh.
If there's a certain number of HTTP 500 errors like greater
than three, eject the service and then divert
traffic to the other service version. So this is pretty useful.
If, for example, you rolled out expense v two and there are a ton of
errors in it, then circuit breaking will eject the
service and then divert everything by default to expense v
one. Circuit breaking does require a little bit more time to show.
I'm not going to show that today for the sake of time,
but if you're interested in seeing this, there are a couple
of interesting videos to show how circuit
breaking in service mesh works in greater depth.
Now, in order to configure this in the console side,
I won't show this in the envoy config because it's a big, rather large
config. But if you're configuring this from your service mesh
and you're pushing it into your envoy configure, you would configure
something in consul called a passive health check.
Finally, this is probably the one that is my favorite,
but also the one that I get commonly asked questions for
telemetry is a little bit tricky. There's two sources of
telemetry, and that's for metrics and for traces.
So when I say telemetry, it's for metrics and traces both. But there's actually two
sources of telemetry you need. There's application side
sources. So this is like the libraries for open telemetry,
the Prometheus exporters, or you write your own application side options.
Then there's the service mesh telemetry. So the service
mesh telemetry has proxy metrics, proxy traces.
One of the things that you have to understand with telemetry is that you
must have both application and service mesh. Just because you have
a service mesh doesn't mean that you get telemetry out of the box.
Not all the information in the service mesh metrics and traces will
help unless you have the application side set up to do that.
So one thing to consider is you need instrumentation for your application.
You cannot omit this. Your application
needs instrumentation specifically for tracing because
it needs to propagate the traces. So if you do not have metrics
or tracing in your application, adding the service mesh doesn't necessarily give
you that out of the box. So you still need that.
If you're looking at something like net, I'm using open telemetry,
I just add open telemetry metrics as well as add open telemetry
tracing. And easy enough, it creates the metrics as
well as traces that I need. In this case I'm using
Prometheus as well as I'm exporting zipkin spans.
In the case of open telemetry for Java, open telemetry for Java
has an agent, so you don't actually need to add anything to
your application code. Instead you load this library and
then you add some configurations. Again, I'm using Zipkin Prometheus.
You have to keep them consistent. If they're not consistent, then traces
in particular will not go through correctly.
So the service mesh configuration for tracing is a little bit
different. You first have to configure your
service mesh to expose the proxy traces. So the proxies
themselves carry trace information. You want to expose those.
So the way you do that is that if you're in envoy or if
you're in a service mesh, a service mesh will push
this tracer config into envoy.
And if you check the proxies, the proxies will
have the envoy trace config for, let's say
Zipkin, and then you can assign the collector cluster.
In this case I'm using Jaeger as well as the endpoint.
One thing that I found that was very difficult for this
situation is that you have to make sure that whatever instrumentation
library you're using and the export format for
traces must match the tracer that
you're using in envoy or your service mesh.
So for a very long time, the envoy version, older envoy
versions pretty much supported Zipkin formats and that was
pretty much all it would use. Now it has much more
tracer options for you, so just make sure it's consistent.
In this case, I just standardized on Zipkin because
previous libraries did not support let's say like
open tracing or other libraries, it was just using Zipkin. So as the
lowest common denominator for all my applications, I just chose Zipkin
spans. And in this case I would use the Zipkin format
for specifically envoy.
In the case of metrics, you need to expose service mesh and
proxy metrics. Those actually do come out of the box
as long as you enable them. In the case of consul, for example, I'm just
doing envoy Prometheus bind address on 2000 and 20,200
and that pretty much enables the proxy metrics
in Prometheus format. Now the trick however, is that
if you really want to get the benefit of metrics, you have to merge
the metrics that you instrumented in your application with the
proxy metrics endpoint. So most
service meshes allow you to merge the application metrics
with the proxy metrics. And this is something you will need to add or I
highly recommend you add. In the case of console,
you can add an annotation that says enable metrics merging equals true,
and then you tell it which service metrics port. The metrics are
available on. The metrics port is on the application.
So in this case I have 94 64.
It was really convenient. The result is that when you
get the metrics endpoint from the proxy, not from
the application, from the proxy, you'll notice that it merges the envoy
metrics as well as the, let's say, runtime JVN metrics. This is
in the case of Java, but the idea is that you
want to expose the application metrics for Prometheus to use,
merge the metrics into the envoy proxy endpoint,
and that way Prometheus can scrape it in one place.
So they're not just you protect your application that way, right?
So in the case of mesh, what you're trying to do is just keep your
application, avoid it from being publicly available.
So what you're doing is you're scraping the envoy endpoint,
envoy proxy endpoint, which merges the metrics. So that's where
trick. For those who are trying to do this and you've invested into
instrumenting your application, you want to make sure this is done. This is where
it is.
If you do all of this, right? In the case of service
mesh, if you do all of this, what you end up seeing is
a very different kind of trace, and it's not vastly
different, but you do get a little bit more information. So here I've been
trying to issue traces across different commands.
So previously, before you'll notice that I did some traces, I'm using Kong
as an API gateway. Kong itself is also in the service mesh,
actually. So you'll notice there's like proxy information here
about where it is, its peer, et cetera.
And then you'll notice there's actually a component proxy. This is the envoy trace
here. So envoy trace includes the internal span
format, lets me know exactly where it's going. It's a
report, it's report v three. So this is where I know it's
going to the version three. You'll notice that these are my application
traces. So this is from open telemetry. I added open telemetry
in here and it's tracking the calls to
the controller as well. So you'll notice hotel libraries name
as well as the get subsequent nested child spans
here as well. This is calling expense, so you'll notice that it's furthermore
calling expense. And then you'll notice that there's the demo expenses.
So this is calling the database. So the full trace here
is available. But the only reason why it works is that I have turned on
tracing implementations in every part of the expected
trace. So from proxies to the
gateways to internal instrumentation within the
applications, I need to make sure to propagate all of them. All right,
so we talked about these five different concepts, service discovery, balance,
cloud balancing, security, traffic management, and telemetry. All of
these are very, very central to how services
communicate to each other. You can do this within an application,
but you can also abstract some of these functionalities
away into a service mesh. This isn't a statement on whether or not you should
use service mesh or you shouldn't. The point is that most applications will end up
using a little bit of either an internal configure as well
as a service mesh. The idea is that if you have a lot of different
services you plan on growing that you don't want to configure
all of these different code bases. Then maybe consider doing
a service mesh and abstraction. But if you're a developer and you're being asked to
implement it, hopefully this provides a reasonable
mapping of how you would do this in can application,
but then how it impacts and changes as part of a service mesh.
Now, if you want a very thorough example like the live one
I showed today, you can feel free to go to this URL. It has all
of the in depth configuration as well as the entire
environment that you would need to set up. Hopefully it provides a
deeper reference. If you have any questions about
what the appropriate configurations are, you're more than welcome to reach out to
me. I appreciate you tuning in to comp 42.