Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody. My name is Michael Haberman. I am the
co founder and CTO of Aspecto. Today. I would like CTO
talk with you about one of the most, most exciting
topics out there. That would be microservices in
particular a lot of microservices, and even more particular,
not breaking your API. So if you run microservices today
or in your, I don't know, prior job or something like that, you know that
microservices could be complex. And in this talk
I'm going to try and help with that and
try to kind of predict what are the challenges that
you are going to face and also how we can overcome those
challenges, how we are going to do that. We are going to do that
by going through a journey, how your microservices,
your typical microservices project start in
a company and how it starts small and simple, and then
it start to get more and more complex as you go. Probably by
definition of microservices going to be complex. And then we
need to ask ourselves what tools do we have in order CTO overcome
these challenges, this complexity. Specifically, I'm going to talk
about distributed tracing, and also we are going to take a
look at what is distributed breaking, how we can use it to overcome microservice
issues and also how we can use it in a cool new
way that is going to, I think, really help your
development process. So let's get started. But before that,
let me tell you why. I feel that I know
a bit about microservices. So I've been doing microservices for
about five years. Started as an independent consultant,
helped companies to break their monolith to microservices
or their microservices scaled and they needed some
help with that. At my last position, I worked for
a company and I was the chief architect, managing about
120 microservices with quite some scale. And in
the last two years I decided that I'm going to do my own thing.
Started Aspecto basically product focus on helping
developers with microservices. So yeah, let's dive right into
it, looking at your typical microservices journey.
So most companies are migrating to
microservices rather than starting a brand new company
or product with microservices. You usually have some problems. You have
some problems with your monolith. It's hard to deploy, it takes
a lot of time to test, you have a lot of regression bugs. This is
the area of complaints that you hear starting the migration
process. And then when you start your process and you
start to create microservices. Usually it's something outside of
the monolith. Usually it's something, a new feature, a green
field feature, something that doesn't evolve a
lot of monolith logic. And then you start to
develop, you develop those three services that you can see on the
screen. And your architecture is really simple.
Like you have service a communicating with b,
communicate with database, service b also communicates with the database.
Then service b communicates with service c and it uses the
database. And maybe even service a communicates with c.
Something quite simple, quite straightforward, very easy to maintain,
very easy to understand, and one of the characteristics
that you can, oh, important CTO say we are talking about HTTP
at this point. So it's a synchronous communication
between services. So what identify
that you are at the beginning of your microservices journey. It's very
simple to run it locally. I can just spin up three processes,
maybe with Docker, maybe without, maybe Docker compose,
maybe, I don't know. It's very simple to spin it up in your local environment.
And if I were to ask you to go CTO a whiteboard
and describe your architecture, you would do it quite easily. And maybe the best
way to know that you're in the beginning of your journey.
How easy it for a developer to onboard?
If it's easy, that means that you're in the beginning. And when looking
at product in companies, successful projects continue.
Now the most likely that your first
day in microservices are successful. You have small amount of
services, you don't have high complexity, you manage to release
this new feature fast. So the product is happy, sales are
happy because they can sell it, business are happy because
things are going fast, everybody is happy. And when everybody is
happy, you start CTO, get more requirements, more feature,
you started to take responsibility from your monoliths
and it starts to grow. When it is starting to grow your architecture
a few months later, starting to look more like that,
and there is a lot of components on the screen. But as
you start to draw the relation between
them, this is where it start to get a bit scary because there
is a lot of relations between them. And I was quite easy
here. You could have tons of relations between different services
depending on one another. And it's starting to get big,
it's starting to get more and more complex. So if I were
to ask you a few questions, I would ask you, okay, can you please
draw the architecture diagram of your project? And that's
starting to get more and more complex as you go and if I
would ask you, let's take this service, for instance, who is
using it, which other services or clients are calling this
service? And this service, does it call other services? If you know
this answer, it's going to really help you. But as the picture gets bigger
and the diagram gets bigger, it's harder to remember.
And also when you are looking at the communication, there is
tons of communication. We chose at this point to be
still with HTTP, but even with HTTP, you need
to remember the route that you are calling, the verb
and also the structure itself. What is the contract that
this service expects to get? When I'm communicating with another service,
I need to know what is the contract between us?
And sometimes even that. Isn't that obvious?
So I've raised your three main concerns
that I think that at this point in your microservice journey, those would
be the main concerns. Understanding the bigger picture, understanding who
is consuming who and how they're consuming. So those are
the three questions. But we are developers, we know how to solve
problems. This is what we do. So let's get and
get it fixed. But one thing to remember, and this is really
important, talking about microservices or distributed application
as a whole, as the picture. And when I'm saying picture,
I mean your architectural diagram, when that's big
and it increases over time, it also increases the risk of
having production issues. And this is something that we need to
remember because it's going to kind of guide us through this talk.
So if we agree that microservices, as they grow,
as we have more and more services, we are going to
have production issues. And the reason that is basically when you
have more to remember, more dependencies, more things
that you need to take into account when you code, it's just
getting bigger and just getting harder. And microservices, quite by
nature, by their definition, they are keep on growing. If I
were speaking in front of you, and I could really ask you the question,
how many services have you added in the past year? Most of
you would say some number. But if I were to ask you, did you
remove a service? So that doesn't happen a lot.
It does, but not at the same rate. So microservices
kind of dictates that we need to have a lot of them and
we need CTo have separation of concern, and we have all
kind of reason why we are separating. But microservices are
usually growing, and when they are growing, the picture gets bigger.
And when the picture gets bigger, the risk of production issue increases.
So this is kind of an equation that we need to take into account
when thinking about microservices. What I want
you to remember at this point is when microservices are growing,
my risk to have production issues increases as well.
Okay, so we mentioned three issues.
We mentioned the picture is big. It's hard to understood,
to understand who is consuming who, and it's hard to understand how.
And this is problems. Let's try to solve them. Again.
From being with a lot of customers and having with
them projects to migrate to microservices, I'm kind of going
to the main points that were repetitive between those
projects. And the first thing that I thought people are doing when
they're having a hard time to understand their architecture, they are doing
an architectural document. So they go to, I don't know,
any kind of solution that you can draw your diagram. And I
bet that each and one of you has in your organization some
kind of an architectural chart. And if I were
to ask you how confident you're absolutely sure that this
is accurate, I think all of you would find the honesty to
say, yeah, that's probably not 100% accurate.
It's somewhere in that direction, but it's not 100%.
And this is something, I think it makes sense, it's very natural.
Just go and create your architectural document and that's
fine, go ahead, do that. Do remember, it's never up
to date, it's always a leg behind. So we have an
architectural document that we did manually. Okay, that helps.
That doesn't solve it. But let's go to the second
issue, which is a bit more complex. I want to know
for a specific service on which service it depends on,
and which services it serve, which services consume
it. And this is again, something that it could
be somewhat difficult. Maybe another documentation,
maybe to find some service map solution, maybe CTO serve to
find some service catalog solution. You could try to find a
way, but also it's oriented a challenge. I think that
at the early days I would go with docs probably,
but maybe down the road I would take some vendor to
help with that. And the question how services communicate between one another.
Well, we can just go with Swagger. Swagger is a great tool,
or openapi, it's just the
protocol name that Swagger gave to their standard.
And open API really helps you to kind
of document your HTTP endpoints. And yeah,
again, you can do a documentation, usually not one hundreds percent accurate,
but really helpful at the beginning.
Okay, so we did all of that and everything
is good, everybody is happy, but time went by.
And now we have your first downtime. So when having
something down, you're starting to reallife that it's
quite problematic because looking at HTTP,
if service a, as you can see in the diagram below, sends an
API call to service b and service b is
down, is not available for some reason. Basically that
means that you lost data and we don't want to lose data.
The fact that we chose to do a distributed application doesn't mean that we need
to lose data, right? So HTTP,
it's great. I love working with HTTP. But do remember, if your
service is down, you may lose data. Losing data is something
that we should be afraid of. And your boss might
get upset with you because you lost data.
And then you sit outside and you just say, I hate HTTP,
I just hate it. So we can fix it. We know how to
fix it. HTTP doesn't work for us. We need to
move from synchronous to unsynchronous communication. And there
are just an endless amount of options out there to choose the
right one for you. It could be a pub sub solution, it could be a
queue solution. Or being more specific, Kafka Redis pub
sub RabbitmQ and AWS sQs.
There is can endless amount. You just need to pick the one that fits your
need or the one that your company already works with. And I'm going
to refer it from now on as Kafka,
just because we use a lot of Kafka and it's very trendy
these days. So when I'm introducing Kafka,
how is that going to help me? So you can see here the diagram
that I show you. Really, really at the beginning of this talk,
we have three services a, B and C. They communicate with one another.
But right now, if I'll take this example how service
a communicates with C, they are not communicating directly,
they are communicating through Kafka. So service a is going
to send the data to Kafka and Kafka is
going to receive it and is going to persist it. And then
service C is going to ask Kafka, hey, can you give me more
messages? I'm ready to kind of process more messages.
This thing that Kafka is doing, that it persists the data until
service c completes. To work on that basically ensures that
you don't have downtime. And that is great. That is exactly
what we wanted. So that's all of it.
Data is persisted, no data loss. If service is down,
we'll just spin a new one and everybody is happy. But we are experienced
enough with distributed application microservices.
And we know that usually there is downtime to architectural decisions.
So think for a second, what's the problem that we're
going to introduce by introducing Kafka? The first thing is
that our picture, our whole picture diagram just got
way more complicated. There is tons of stuff that happened
just because we made this tiny change and we need to take a look
in that. So service a is calling service b,
that's fine. Service a is calling Kafka.
But then, I don't know, as a service a,
who am I communicating with? I just lost my ability to
understand from service a perspective who is going to
consume this data. I can only understand that if I'll go to
the code base of service b and then figure out
if it's using that Kafka topic. And it
works really easy when you have 310, but when you have 100,
120 microservices, it's trying to get really problematic.
Like I don't know who produces this data, I don't know who is consuming this
data. So by introducing Kafka to a solution,
we didn't only solve the problems that we
had with async communication, but it also
introduced some problems in perspective of understanding
who is communicating with who. Also another thing that is part
of the architecture, you just need to use it in the way that
fits your need. It's a two way communication.
You don't have a request and a response. It's a one way thing. You just
send your payload and somebody's going to get it. You're not going to get a
response. That's fine. Also doing debugging
locally got more complicated. You can just spin service a, service b,
and start sending data. You need to spin up service a,
you need to spin up Kafka. You need to spin up service c. You need
to send data to service a so it would populate Kafka,
so B could consume it. Everything's starting CTO,
just be more complex. And Swagger is not relevant because Swagger is
an HTTP do communication rather than some payload
at any form. So those are the downsides.
And I kind of wanted to emphasize how it looks.
So I took the second diagram that I showed you where we have twelve services.
And I started to introduce Kafka in the middle. And I think
that you blind of get the idea. Like I hope
you look at this and you say, nah, that's too crowded. And if you
can see, I only added Kafka in the left
side of the screen. So it gets really complicated, it gets messy.
It's very hard to work with. And again, as the picture
gets bigger, the risk increases. Then if
we are able to serve the developer with the big
picture in a good compact way that is going to simplify
the developer work, it's going to reduce the risk. If I
were a developer and I was looking at the code base of service a,
I knew it. Communication with service b, there was no doubt in
my mind. Looking at Kafka solutions, I will have a
doubt. So if I were the ability to answer the developer
who is consuming this message, I would reduce the risk
of having production issues. So kind
of to summarize so far, what I was referring to when
I'm saying that microservices is complex,
your name is micro, you are going to have a lot of them.
I know a lot of companies are saying we're not doing microservices,
we have only like five or ten or something like that. We don't have
the big ones like thousands of services. Any distributed application
is going to have those issues. It just depends
on how significant they are. The more micro you go,
the more complexity you are going to face. But it's
going to allow you to run faster, it's going to allow you to deploy really,
really fast. So that's one thing that makes it more complex,
that it's very legitimate. CTO create more and more services.
Also microservices allow us to choose the right tool for the job.
Whether you need a special database or a special programming language, then you
just spin up another microservice and that's absolutely fine.
But you're going to have more. And also it's not only
HTTP, ASEAN communication is very popular.
We see all kind of async communication. All of them
has their own purpose. But you are going to face at some point
or another non HTTP communication and it's going
to get complex. So I hope I
kind of scared you just a bit, just the right amount.
So you know that distributed application are complicated and
we have tool to overcome it. So every time I spoke
about the complexity, I spoke about the big picture,
understanding the big picture, who is communicating with who
and how. And this is what I want to try and
help you with. So there is something out there called distributed breaking,
getting more and more popular. CNCF, the same foundation responsible
in Kubernetes is responsible also in a product called open telemetry,
a product that allows you to do distributed tracing.
The concept and I'll show you it in a second is quite simple.
If every microservice would report what's happening.
When I'm saying what's happening. I mean as a service,
I got an API call, I performed a DB query,
I set some data in redis, I communication with
my cloud provider, all of those are being reported into one
central place and there is kind of a link between
them. If I see that somebody, some service
set some data in redis, I have a blind to the HTTP
call that initiated this DB statement or redis statement.
So the way that it looks, if you can see here we have two
services, service a and service b. And within the
application code we have open telemetry service e.
What it does, it sends an API call to service b.
Service B gets it and, I don't know, does something
with it. And you can see that both of them are
kind of writing to a central traces solution. When service
a got the API call, basically it's
the root of this trace, it's the starting point, the entry
point of this trace. It gets this data
and it's saying, okay, I'm trace number one, this is the first
trace and I'm also the first span. So this is the first action
that took place under this context and
it will be more clear in a second. And then he sent the
API call. Now when this API call is
being sent, it actually ejects a unique header telling
the next microservice in line, hey, you are not the
first one. I was before you. I want you to link them together.
So when service b got the API call, it took
the reference that got from service a.
And as you can see, when it's reporting to the central trace place,
it's using the same trace but span number two.
So now if I'll ask you what's in trace one,
the answer is span one and spend two.
Span one represent the API call of microservice
A, and span two represent the API
call of service b. Now, being able
to see it all together kind of give you the story of
what happened to a particular API call. Let's even see that in
action. So here you can see Jaeger Yeager is
a very well known tool that allows you to
visualize traces, open telemetry traces, and other
type. And this is some flow in our backend.
And you can see here that we have aspecto API docs and we
have aspecto account and we have versions API lambda.
And you can see the process, you can see that we got an API call
to OpenAPI packages and then we sent an API
call to aspecto account to get the user probably to
authenticate the user. And once it was authenticated, we invoked
a lambda called version API lambda, which can a query
on DynamoDb. And I can see here the entire flow.
So I have three microservices involved, I have API docs account
and the versions microservice. And I
can see the interaction between them. And if I'll
click on one of them, I can even see all the
data, all the relevant data that I need in order to
understand what this thing is doing. Basically,
Opentelemetry gives you the ability to take one
particular request and visualize it altogether.
If I'll try to give you with a bit more details,
how it looks. So looking at, wait a
second, looking at how to implement
open telemetry. So it's usually kind of simple.
You have this SDK, this is our open
telemetry distribution. But there are plain open
source open telemetry, you can implement it.
Basically it's an SDK within your code that
is sending out to whatever destination you are going to
send it to. A destination could be directly
to something like Yeager. So Yeager would
take it and persist it in some database and then you're able to visualize
it. You can send it to some vendor that is going to
visualize it for you, and you can send it to your own database and then
query this data in which fashion that you
want. For instance, aspecto would take
this Jaeger UI and
would present it in kind of a, I don't
know, different way, I would say. And you can see here how
it looks within a spectrum. So once
you started to send, you implemented open
telemetry and you're sending the data, then what you got
out of it is two important things. The first one is the ability to
debug whenever you have an issue. Now you have all the story, all the
breadcrumbs altogether. CTO understand how you reach
this situation, how you reach the situation where you have this
bug and now you can understand it. And also it helps
you to visualize. We took some actions and we put
them together on a graph, and now it's more visualized for
the developer, which is definitely better. I'm not sure if it's
still answering the big picture question,
not sure about it. One really reallife important
thing to do in your log, you ship all
kind of metadata with it, right? What you can do is
basically if you look on your flow that you have a bug in
production today, you would get some error,
probably, maybe try to product it, maybe go to your log
solution, try to find the exception that was causing it.
And imagine that you found an exception, quite a generic
one, but you also have the trace id. The trace id
allows you to take the trace id and throw it back
into Jaeger. So you got an exception, you took the trace
id. As you can see here, that's the trace id.
You throw it in and now you can see the whole process
that caused this exception. And it's a really reallife cool
trick, simple one that you should definitely do. And I would urge
you to start with open telemetry in Jaeger. It's kind of easy to set
up and just work and it's amazing.
Now I started to talk about visualization and
said that I'm not sure it's answering the question. Let's say that
you have some complex system,
that you have a lot of services communicating with one another and
everything is being sent to Jaeger. In here you
are able to see how Yeager will show you the diagram
of dependencies between services. Now this is in
the service level communication. It's just telling you, hey,
Scraper service is communicating with user service, but it
also communicates with Wikipedia service. It's just telling you service
a communication with B and just telling you who is communicating
with who. It doesn't answer important question of
when in which endpoint they are going to start
to communicate with one another.
So let's give it an example. In the left hand side
you can see the service a, service b, service key communication.
That's cool. And now I'm going to show
you two different example of service a service a
as an endpoint, v one items. Then it calls service
c, but it calls service b only in v one purchase.
And this is kind of the resolution that the developer is looking for.
I want to know when services, when I'm communicating with
one another. The fact they are communicating with one another is important,
but it doesn't tell me all the details just yet. Those are
the kind of things that aspecto is really good at helping the
developer because we are trying to look at it from the developer perspective
and trying to answer what they are looking for. We understood the
problem. Now we know open telemetry, at least briefly,
but we know what it can help us with. Let's assume that you
started to collect distributed tracing data.
Let's talk about why it's extremely important. So we
already said that breaking as
a whole helps us, but not a lot. It doesn't
help me still with the big picture to understand it.
It doesn't exactly help me with dependencies and
it doesn't help me to narrow the gap between production to dev.
Let me try and emphasize. So let's say
that you want to replay traffic, right? I don't know how you're going to
solve it today, but it's hard CTO solve. You need to start work on that.
You need to start to introduce some tools that allows you to do that
if you want to generate mocks. So usually you do static mocks. If you
want to do API documentation, you do it manually, but maybe you can
auto generate that. Think of all of those things. All of those
things are present in your tracing data. Your ability to
create docs is if you have the raw
data, the raw network, the raw communication between
the services, you can just take this data and create documentation
based on that. So I think that you should use opentelemetry
data. And it's very simple. You have sdks deployed in
all kinds of different microservices. All of them are reporting
to some component called collect and opentelemetry.
Basically something that knows how to receive all the span and then just send
them to some database of your choosing, such as elasticsearch.
And on top of that you have Jaeger. Right? Yeager can communicate
with elasticsearch. I think it's their best
practice. And then what you can do is whenever
you have some question that you want to understand how things are
operating production environment, go and ask your database.
It's already there, just to give you an idea.
So you need to generate a mock, a mock for your unit tests.
For instance, how are you doing it today? Looking at the
code probably, and saying okay, I assume that I need
this data to look that way or another. And I would
just create the basic thing that I need to have for my mock,
make it static. And I won't ever change it, right.
Unless something really significant happens. But what
happens if we use the database to fetch some mocks?
Right now I have real relevant traces with real
different usages, and I can really easily reproduce
my production environment better in my test.
And this is very, very easy to do and could really improve
your tests. So I think we found cool ways. And just
through an idea there, there is tons of things that you
can do with distributing tracing data.
If you're interested in that, go check out aspecto and our
blog. We're talking about it quite often. And yeah,
we started from three microservices, very simple,
easy to reproduce locally, easy to tell another
developer about it. And things started to get more and
more complicated. So we introduced async communication using
Kafka. Then we kind of lost sight of what's happening.
So we had to introduce tracing. And then we found what
we can do with tracing, and we can do a whole bunch of that.
My suggestion is CTo you to get familiar with open telemetry,
get to know distributed tracing, understand how
to implement it. Your microservices is going to be super
helpful. Always log your trace id in any log in our
system, we have the trace id. If something happens, we can always throw
it in Yeager and then just visualize it. And once you have all
of that, go once a week to your database, have a look
there. I'm pretty sure you're going to find some interesting stuff.
So thank you very much. I really enjoyed talking about
it. And if you have any questions, feel free to shoot me an email,
Twitter, whatever. Thank you. Hope to see you next
time.