Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Jason Dudash. I'm a chief architect at Red Hat, and I
focus on modern application development and cloud technologies.
Today I'm going to talk about the introduction to service mesh,
and I'm going to focus on the, the how, the what,
the why of service mesh. And I apologize in advance.
This is going to be a little bit of a fire hose. I've got a
lot of information I want to get through in kind of a short amount of
time to do the talk. So I'll also, at the
end, be demonstrating a lot of the key concepts that I talk about
in the beginning slides. So really to set the
stage, what we're talking about here is in the context of distributed computing,
distributed systems at the most basic level, and distributed systems are
great. There's a lot of capability that you can get from distributing
your software across multiple geographies,
across multiple systems,
and you can meet all these non functional requirements,
right? Those things that systems engineers like to call the illities, the scalability,
the supportability, the reliability.
And so this isn't a new concept, right?
So distributed systems have been around since ethernet was invented,
like in the late 70s, but it's
just now become more ubiquitous and we see things
happening at much larger scales than ever before.
Distributing your software across multiple systems
is really advantageous, but it also brings a lot of challenges,
and those challenges are inherent to distributing
software capability across networks. And so there were
these things that were identified 25 plus
years ago, known as the fallacies of distributed computing,
and it impacts the way that we develop,
deploy and manage our software systems.
So things that novice developers don't think about,
maybe experienced developers are writing some additional
code to deal with the challenges you're facing.
But for the most part, people aren't thinking about the reliability
of the network, the latency that you're going to expect
to experience in production, the bandwidth that you'll
have, and even like, security concerns
are often overlooked. And so those sorts of challenges
have existed for a long time, but they're even worse,
and they're even more impactful today
because we're trending towards moving out of our data
centers into cloud environments and we're transitioning monolithic
systems into microservice architectures. And those
microservice architectures bring us all this extra agility,
but it also means that we're distributing things at a much larger
scale than ever before. And so all these independently
scalable, single purpose services that compose your
overall application means
lots and lots of little network connections back and forth and
chains of network connections between all these services.
And so you've probably seen this before. If you've been building
microservice architectures, I've definitely seen it in
my systems and customer systems that I work with.
But once you start building these things, everything looks
good. In development, everything actually looks good. A lot of cases in test,
you fix a lot of bugs in QA and you're
like, cool, let's ship this thing. Everything's good to go.
But once you get into production, things become less predictable.
And especially over time and
under production level loads, things really don't perform
the way you expected them to perform. And scaling up isn't
like the solution that fixes a lot of the problems you have.
In fact, when we fix problems, we do that with some workarounds.
You might be thinking, hey, I know there's a lot of companies that are successful
if you haven't done microservices already, there's a lot of companies that are doing it
and they're very successful. And you're right, because they've found ways to work around and
deal with those types of challenge. So historically, what we've
seen is those challenges are addressed by boilerplate
code and third party libraries. Netflix is probably best well known
for creating some of these things, the eureka and zool.
These frameworks get bundled into every microservice and provide
solutions to deal with the things we're talking. But, but that's not
really ideal. It can kind of reduce agility.
And we're talking about adding extra work
to developers plates to incorporate,
load these libraries in and actually manage these dependency
chains. Right? So imagine how much more challenges
that gets if you're not only developing Java based microservices, but you're
also using go and you're also using node js.
And now you've got this problem. It's kind of replicated across the
different tools and the different programming languages that you need to support.
And so what we're talking about here today is
a common approach to deal
with those challenges by moving the responsibility
to the platform and so you can address it at the infrastructure
layer so developers don't have to reinvent the wheel for each new service that
they're developing. And so that lets us
apply policy consistently, also across an
entire application, across an entire series, instead of microservices.
So I think a really good analogy that helps explain
what a service mesh does is with roads and traffic
control. If your company or your organization is a city,
then the red hat connect people's homes and
businesses and places of work, those are the
networks, right? And so if you live in a really
small town, you have just a few roads, you might not
need a whole lot of traffic control, but once you get to a city of
a certain size, you're probably now in a position
where you can't trust everyone to obey the speed limit and do the
right thing, or even that they'll do something the same way as each other if
there's no traffic signs and there's no guidance
for them to do those things. So what do we do in a city?
We put in place traffic control. We have police officers,
we have speed limit signs, we have bike lanes,
we have stoplights and walk signs and don't
walk signs. And we control what's going on
in that city. And the same thing should be true of organizations
that are deploying microservice based applications across Kubernetes
environments. So you need to
be able to assert control over how traffic moves between
those services. And the service mesh is the control plane
for asserting that control. Right? So you could probably take the
analogy even further if you wanted to, and talk about how
observability is important because cities also
have traffic cameras. And if you can see what's going on in traffic,
you can identify bottlenecks in the system and you can audit and figure out
how to improve those things. Again, service mesh
has observability capabilities as well, and we'll get into that
when I get into the demo. But under the hood it's
pretty straightforward. I'm going to give a high chief architecting
overview of the service mesh. It starts with a
Kubernetes cluster like Openshift. And so your service mesh is
part of that platform. And there are
two big concepts of a data plane and
a control plane, and I'll explain both of those.
So the data plane is essentially this mediation layer
that controls all the network communication between the microservices.
That's its role in life, and it does that transparently.
And one of the really cool things about this is how it works is that
the mesh deploys a sidecar container, which is a Kubernetes architecture
pattern that is colocated with your application.
And so your applications are in this data plane. They're all talking to each
other, but they're doing that through this sidecar proxy
called envoy. That's the open source project.
It's a really fast and dynamically configurable
proxy. There's an API it provides so that you don't have to reload
things. It just concepts, new configurations.
And so we are able to program
these envoy sidecars to
do all the policy enforcement that we've identified,
and we define that policy in a control plane
layer. So your policy is part of this control plane.
And the reason this is really really important is because imagine you
have hundreds of microservices and hundreds of proxies. You wouldn't want to programmatically
have to configure each of one of those individually.
The control plane lets you define your policy and it applies it across all
of your proxies for you. And so the separation of the
control and the data planes lets you make changes to
your mesh without having to change any of your application source code.
And honestly that's like really probably the most cool part
about all this, is that it's truly dynamic and you're
solving your challenges without having to write new code,
without having to rebuild your services. And once your services are
part of the mesh, you don't even have to redeploy containerized
into kubernetes to apply policy changes.
So with that introduction, that firehose of information,
let's take a look at it in action and see how some of this stuff
works. Okay,
let's dig into some observability capabilities of the mesh.
I've got a simple microservices application here with a
single sign on and a user interface and a
profile service and a couple of databases.
And altogether those microservices make up this web application.
So it lets you create boards and
add little items to boards. So I can come over here and I can say
storing something to share
red hat to the list.
And in this particular
example I've introduced some problems so that we can explore
the observability features in action. So there's three
main observability tools that I want to showcase. The first one is called
Kiali, and it's like the main dashboard
for the service mesh. So I can come into this graph view
here in Kiali and I can see everything that's happening. I can see the ingress
into my mesh, I can see the services that are running
and what workloads are behind them. And so you see the
user interface called App UI running the latest version of its container
aboard service, which provides an API to edit
and store data into a database that's MongoDB.
And then our user profile service that actually has two different
versions backing it, version one and version two. And so I can
also see that same information in this applications view and
I can click on this app UI, it gives
me the little graph overview, but I can click on traffic,
I can see all the inbound sources of data and the
outbound destinations. I can see the protocol types
and some metrics and their success rates on all these
things. So right now I'm running a
couple of for loops to just ping,
simulate some user load here. And there's
some problems with this load and so we're going to dig into that.
So over here in the Grafana dashboard, I've opened
up the service viewpoint and it
shows me data and metrics about
what's happening with my services. Right now I've got this user profile
service selected. I could select one of the other services if I wanted
to see data on that. But if I scroll down and
see what's going on, I can right away see that these
graphs show me incoming requests are getting satisfied very slowly
in some cases. And we're seeing 20
seconds for this user profile,
two service to respond, and only three to
five milliseconds for version one. So that's a problem.
And I can see that same information via trace bands
in this distributed tracing dashboard.
So if I select the services for the user interface
and the operation call to the profile and
I click find traces, we'll see these drastically
different bubbles taking a lot longer in these calls
than these calls down at the bottom.
And this tracing tool comes in really handy when you've got microservice
chains that are like long calls
that provide a return path
to display some information on a GUI or something like that.
And when things go wrong, it lets you dig down into the details
and see exactly where the problem is. In this case, the problem
is at the end of the chain. So it just looks like the
bars are full. But if something happened in the middle,
it would be really obvious and visualized very nicely to see that.
But yeah, you can get a lot of information from this trace span, all the
HTTP header information. And you can see again like
we saw in Grafana, that this user profile two
is causing us these long delays.
And that looks like this on the app. If I click profile it's
like oh man, it's chugging along, but nothing's happening. It's just
ticking, ticking and ticking, and then finally it
comes up. So those
observability pieces have told us something's going wrong.
So what we're going to do to fix that is we're going to go run
some commands to apply some policy.
So the first thing I'm going to do is create some destination
rules and virtual services and
I can see by using Openshift
or kubectl type commands to see what those things are.
And I can see I've got destination rules, and I've got
virtual services now. And if
I want to go and look at what that looks like in Kali now,
things are a little bit different now that
we're in Kali, I can go to this istio config and I can
see all those different configuration items that we created.
We'll notice that there's actually pretty nice capability where if
you've got problems in this case, I've got an intentional problem in
a destination rule. It's going to give you this error that tells you something's not
right. But let's go back over and show you
how we can apply some policy. We're going to
change a destination to say, hey,
we saw that we have this virtual service
in Kiali that was, if you remember,
splitting traffic between version one and version two.
And we can see that by doing that.
And if I want to flip,
which I just applied. Oops, sorry.
I can apply this configuration to flip all that
traffic back from version two, which was giving us problems,
to version one of our services. And I'm
already throwing all this load in, so it should happen pretty immediately.
And we should see that when I
go to the profile now, it should just work. And I'm not getting those long
delay problems right. So that works here. And I'm going to
go to Kiali. We'll see that traffic has shifted over the
last minute or so, and it'll eventually get up to 100%.
So right away you can see how quickly
you can fix a problem with routing by just changing the dynamic rules
behind the scenes.
So another thing I could do is deploy a third version of this service,
and let's do that right now. I'm going
to add a v three.
So let's say we fixed that problem in v two. We want to
go now ahead and patch that, what we had. So we'll
run a command to create the user
profile, v three. We'll take a check here and make sure
it comes up and runs. It's almost
there. It's trying to find itself a stable state.
Cool. It looks like it's running. That's good.
Now what we want to do is
route traffic to it. But instead of making the same mistake we made
last time, let's do a canary deployment, which is an advanced
deployments technique to put just some of the traffic, but there.
So we'll say, like, let's put 90% of the
traffic still going to go to the one we know that works, but 10% will
start shifting over to this newer service.
And let's curl some. Actually, I'm already curling
it, so it's getting traffic already. So now if I go back over
to Kiali, I can see this version three is showed
up,
and it should start getting traffic pretty quickly
now that we've applied
that rule. So let's see.
Yes, there we go. It's starting to shift up to 10% of that traffic
over. And so it's just averaging the last 1 minute of data.
So that's why it took a little while to catch up. But the traffic was
already starting to show up there as soon as I hit enter on that
command in the command line. And so let's
go back over our toward. Let's start popping in.
That's v three. That's v one.
V one. V one. V one. V one. V one. V three.
V one. Cool. So loading fast. We fixed
that 20 to 32nd delay bug. Everything looks good.
So now we could go ahead and shift it to like
a 50 50 if we wanted to and
see that. And then eventually we would just say, hey,
let's put everything over to version three, which would look
like that. Another traffic
management capability I want to showcase is called circuit breaking.
And if we go back to the profile page, we can see that
right now I'm balancing traffic between version three
of the profile service and version one of the profile
service. And you can see there are different colors.
And the circuit breaking concept is essentially that
similar to a circuit breaker in your house, where if
there's going to be a circuit overload, current is going to overload
what's going on and create cause problems.
Then it interrupts that flow.
And in this case, if the network has detected
failures that are happening at a certain threshold,
we'll trip that circuit, and we'll prevent further calls from
being made. We'll eject, essentially, that workload from the
serviceable pool. So this looks kind of like this.
We'll create a destination rule, and down
here, we'll define an outlier section, and we'll say how
many consecutive errors, and we'll define some other properties and
policy details. And if I start sending
load again and I
go over to the kiali dashboard, I can see.
Give it a second. There we go. So we're getting load, and you can see
this 50 50 balance is happening. You can see also that I've created
a. It's visualizing that there's a circuit
breaker here, has circuit breaker. And if
I wreak some havoc over in my cluster and just
kick out delete that pod,
the version three pod will just destroy it.
So where's that going to be? This. So we killed version
three. If I flip back over to here, we'll notice that version three
will stop working in a second.
And you can see here version three started
getting errors and now the traffic is being
sent back over to version one because that circuit
breaker has tripped.
Okay, so I flipped over here to the container platform dashboard to
showcasing. A lot of things we've been doing are with the command line,
but there's also graphical ways to do that. Right now we're looking
at the control plane overview for the installed service mesh.
And we can see that while we turned
on all the observability capability, we didn't turn on security capabilities
for the control plane or the data plane. And we want to do that now
because we decided that maybe our policy
should take into account the need to encrypt all the service to service communication.
So if I go over here now, I can kind of visualize and
showcase like with a simple little curl command that hey, anybody can
jump on, run a container and
get the data out of these microservices. So we're not
really enforcing a strong identity in a nice secure
way. So let's fix that by adding a
peer authentication policy. And that's pretty straightforward. It just looks
like this, we'll do a create command, and now
I need to set some destination rules to tell the rest of the
services to communicate using mutual tls.
So I'll do that right now. That's going to create the destination rules.
And the quick test, if I run that curl
command again, we'll see that it failed because it wasn't
a known identity, it wasn't a member of the mesh that was able to make
these sorts of calls.
And if I flip back over to the web console, I could have easily done
that in a similar way just by turning the checkboxes on
for the whole mesh.
So that would be another way to do it.
The last thing I want to showcase security wise is just
a quick look at some of the additional resources
that you can configure beyond just the mutual TLS.
You can also have the services verify
JSON web tokens for trust, and that
would be through a request authentication policy.
And in addition to authentication,
authorization can be done via authorization policies.
And those both look a little bit like that.
So with that demo, that's all the things we
have taught to demo today.
We just scratched the service today so there's a whole lot more information.
If you want to get deeper, I recommend you go to developers Red Hat
check out the service mesh topic. We've got tutorials and articles
and videos you can check out so it's a great resource for you.
And if you want to ask me questions, reach out to me directly. Feel free
to scan this QR code, it'll take you to my social accounts
and also red hat links are below so you can find out more about
red Hat. Thank you for watching.