Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are youve an SRe? A developer,
a quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps
for reliability with chaos native. Create your free
account at Chaos native. Litmus cloud hi
folks, I am Dan, and today I'm going to be talking to you
about how to achieve reliability nirvana. And we
will do that by utilizing event driven design.
So let's get to it.
First things first, let me just give you a quick disclaimer.
So everything that I talk about I have done in production,
none of this is theoretical. This is stuff that I have personally done.
And if I haven't done some of it either in production or
maybe I've just done it in staging, I will let you know that that's the
case. I also try to keep it real,
meaning that if there is something that is really difficult, I will let
you know that it's actually difficult. And some of the stuff that looks difficult,
it is actually easy. I'll let you know about that, too. Really,
the final piece is that my own personal goal for this
talk is that I just want you to walk away having learned
something, right? Just have something tangible
at the end of a talk. So with
that said, let me talk about myself first, because every
single talk is supposed to have at least one slide about the presenter. So let
me be a little self indulgent here. So I am Dan.
I reside in Portland, Oregon. I have worked in the back end for
about ten years, probably more this point. I've been saying that
ten years for a while. So I
love building and operating distributed systems. I figured
out a while ago that I'm really into the architecture part
and the design portions of distributed systems. It's really fun
and really interesting. I was previously an SRE at
Neurelic. I was an SE at Envision
Digitalocean community. I also spent a lot of time in data centers as well,
basically wiring stuff up well, gluing things together,
really, between systems and software.
And most recently, I co founded a company called Batch,
and we focus on basically providing observability
for data that is usually found
on high throughput systems such as Kafka or RabbitmQ,
et cetera, essentially message brokers. And cool fact
is that we got into Y combinator, which is pretty sweet.
And one fact that you can immediately take away with you is that I am
originally from Latvia and there are at least literally
dozens of us that like distributed systems
and are from Latvia. So there you go. You've already
got one thing, you know. All right, so what
is reliability nirvana?
Well, it is not being woken up at 03:00 a.m. On Saturday night
or really any night at 03:00 a.m. I do not
want to be woken up at all. We want to have predictable service failure
scenarios. We want to have well defined service boundaries.
We want to be security conscious. We want to have self
healing services, and we want to be highly scalable and highly reliable.
Of course, now you might be saying that, well, we already
have this whatever tech. And yeah,
you're right, we do. We have the microservices pattern
that deals with the monolith problem that we've had of being able
to decouple things and basically slice them up. And after we slice
them up, we slice them into containers, which is perfect. We're now able to
have reproducible builds. We have a really nice dev flow because
everything is in a container. Now. We can make use of container orchestration, such as
kubernetes or mesos and so on, to tackle all
the concerns around lifecycles. Right, the lifecycle
of a service. And then we have things like a service mesh,
like service meshes which ensure that our services are able
to talk to each other correctly and so on. And there's probably a whole slew
of other things that you might have that
are going to help you achieve even better reliability as well,
such as, say, APM solutions like new Relic and Datadog
and honeycomb and logging platforms and
so on. There's really a ton of stuff like tracing, right? Like request tracing and
so on. Things have gotten pretty good
overall. Really? Right. So what is the overall problem if things
are already pretty great? Well, the issue is that
if you want to go a little bit higher, if you want to go and
basically take the next step and you want to achieve even higher
reliability, it gets exponentially harder.
It gets exponentially harder.
Case in point, if we're just talking about the microservices pattern
itself, if you have, I don't know, 2030 microservices
or something like that, you have a
massive failure domain. You actually have no idea how
big of your failure domain is because one service might actually take
down six other services and cause three other services
to become really, really slow and so on. So the usual
approach to solving this sort of a thing is to say, well, we're just going
to employ circuit breakers everywhere. But as it turns out,
with circuit breakers, it is really easy to shoot yourself
in the foot as well. And they're notoriously difficult to configure and get right
it's not. The fact that circuit breakers are difficult to actually implement,
it's to get them right is actually really hard. For those of you
not familiar with Histrix Histrix style circuit breakers, the idea
is basically, they're essentially patterns,
code patterns to introduce
fault tolerance into your requests, right?
So when you're making an HTTP request,
if it fails, then it's going to be retried automatically for you
exponentially for however many times
with an exponential backup and so on.
In my experience, when it's been done, it is
very easy to not get it right. And in some cases
the shooting yourself in the foot part is basically misconfiguring
in such a way that basically that it trips when it's
not supposed to. So when the service is not actually down, something will
happen and will cause it to trip and thus real
requests are going to get dropped. And at the same time you also have
to figure out how to avoid cascading failures. But the good
news is that there's always somebody who is going to say, well, my service
knows how to deal with that and it's going to prevent everything else after it
from failing in a cascading fashion,
which is of course totally not true. And whoever
that is, their service g is actually maybe
not that great. So at the same time,
we also want some sort of self healing right at
the service level, meaning that just because
we have auto scale in place, what happens to that request,
which was mid flight in the middle of the auto scale
event, does it get just dropped? And are we just simply okay
with the 0.2% of a failure rate that we're having because
of a deployment or something like that? So that
by itself already is hard to achieve as well.
And at the same time, you need to keep security in mind.
And as we all know, pms just absolutely youve
it when you are spending time on not shipping features and just
working on something that may seem to them as kind of useless.
So point being that getting to that
next level afterwards is really hard. It's akin
to moving from vms over to containers
for the first time. It's a fairly big deal at that point.
There's not unfortunately like some sort of a silver bullet which is just going
to automatically solve all these problems. So what
you might be seeing here is that there is a pattern is starting to
evolve, like emerge here. We're talking really about services.
It seems like the infrastructure part and the platform components
and so on, they're actually fine. There's nothing really inherently
wrong with them as a matter of fact, the microservice pattern
is great. Kubernetes is great, Docker is great. All these
things are really fantastic. So the place
where we need to put a focus on is on the services themselves.
So how would we achieve that? Well, we would want
to have some sort of a solution which ensures that maybe the services shouldn't
rely on each other anymore, meaning that service a does not have to talk
to service b. And as a matter of fact, none of them should do that
at all. Right. We would also want a situation
so that developers do not have to write code specifically
to be able to deal with a situation where the server is coming
back after it has dropped a bunch of requests.
Right. Not having to write these sort
of fairly complex fault tolerance systems and so on in
place. Similarly, we do not
want sres. What would be really awesome is that if sres did not have
to write any sort of per service firewall rules and just punching holes
in different places whenever service e needs
to talk to service x and so on and so on, it would
be nice to just be able to put a blanket rule down and it
just simply works right.
And then on top of that, it doesn't even have to be touched again at
some point in time. A nice bonus to all of
this would be that we would be able to investigate every single state change
that takes place as part of a
single request. To some extent we can already do that,
because if we have request tracing and so on, you could potentially
do that. But again, it's one of those solutions where while not everyone has request
tracing, it would be nice if something just basically came
for free, that we were just able to get it. And finally,
a really big one is that it would be super awesome that all this
really concrete, very specific systems
data that is flowing from somewhere which is representative of the
current system state would be available ultimately for
your future analytics uses. So, meaning maybe you don't have
a data science team right now, but maybe you will have in about six months
or something like that, being able to show them that, hey,
you can hook into here and you can see all the messages that are there.
Everything that's ever transpired on our systems would be super awesome
because they could then actually build various systems to analyze
the data, predict things and so on, and create dashboards
and whatever other cool stuff that data science teams do. I'm not
a data scientist, as it turns out. This is kind of what I do instead.
So what you might be noticing here is that reliability nirvana
is actually not just service reliability or
systems reliability. It's actually effortless
service reliability. We want something that is able
to go the next step where I do not have to spend time
building something that is just for a singular purpose only,
such as implementing some sort of a circuit breaker pattern.
Well that gets me closer to better reliability. But it's not the end
all be all. It is just one piece of the puzzle.
And what youve might be seeing here is that what I'm going towards is
that there kind of is a solution which is able to address all of
those things. It is not easy by any means. However,
it does exist and it is totally doable.
You can actually implement it and it is totally possible. But there are certainly
some caveats and that's the sort of stuff we're going to get into. But before
we do that, let's define what
is event driven, right? So the Wikipedia article,
the Wikipedia definition says the software and systems
architecture paradigm.
The Wikipedia entry defines event driven as software and
systems architectures paradigm, promoting the production, detection,
consumption and reaction to events.
So really what it means is that really
there's just three actions here. It's something emits
an event or a message. Something, oh,
I added an extra slide, sorry about that.
Something emits an event, something consumes an event and something reacts
to an event or a message, right?
That's essentially all there is to it. And it's always going in that
pattern after the reaction.
It might not emit an event, but it might just simply
continue consuming an event and so on, consuming other events.
Secondly, your event must be the source
of truth. This is an extremely important aspect of
event driven, is that you no longer want the entire
system to know, like all of your services to
know the entire state of the entire system. You want your
one particular service to only know its own state and it shouldn't care
about any of the states of anything else happening. As a matter of fact,
there shouldn't be a single service that knows the entire state of anything. It should
be just made up of single services that
only know about their own state and nobody else's. That is
an extremely important point. You of course want to communicate all
your events through a message bus of some sort,
like a broker, kafka, rabbit, MQTT,
whatever it is. Another super
important point is that you want everything to be completely item potent.
And that sounds complicated, but it's really not.
All it basically means is that your services are able to deal with
situations where it might receive a duplicate event and
it might receive a couple of events that have come out of order.
And the way you deal with that is you simply keep track of the events
that you've taken care of already. And if it's the same event
coming through again, well, you just ignore it. And if
it's coming out of order, well, then you apply business logic to that.
Do you care that it's an older event or that's
appearing after your future event has already happened? Maybe you don't.
So you could just simply discard it. It's not terribly difficult
to pull it off, and it sounds much more
impressive, really, as written down, than it actually is.
And ultimately, you must be okay with eventual consistency.
It is just a fact here is
that you're essentially trading
really high availability and high reliability for
this eventual consistency thing, and you're basically exchanging
one thing for another. In this case, you will
no longer be able to say that you have guaranteed consistency
because you do not. You will probably have 99.99%
consistency. However, there is no longer a
way for you to guarantee that. You can just simply say that right
now the system is mostly correct. But there's no
way to say that it's 100% correct all the time. You just know that it
will eventually become correct. So the components
that actually make up an event driven system, of course
there's an event, but the event bus of
my choice would be Rabbit MQ 100% of the time.
Well, 95% of the time. Let's say
that Rabbit MQ is, for the
most part,
what are the event driven components? So the event driven components, of course.
Hold on.
So let's explore the event driven components, or let's
explore what makes up an event driven system. Let's look at
all the components that are involved. So, number one, of course there is an
event bus. I would 100% choose RabbitMQ for this.
RabbitMQ is extremely versatile when it
comes to the sort of things you can do with it, rather than having
to reinvent certain functionality, such as, let's say if
you took another message broker, let's say like Kafka,
you would probably have to basically build a lot of your
own stuff. For example, dead lettering.
That doesn't really exist in Kafka, so you would have to build it yourself.
Routing based on headers. That sort of a thing doesn't exist in
Kafka, but it definitely does exist in RabbitMQ. So by
utilizing something with such a versatile
way to do routing and just utilizing
a system that has so many different features is not
going to pigeonhole you into designing a foundation that
might be a little less than perfect.
Besides that, Rabbit MQ is decently fast.
It's able to do upwards of around 20,000 messages a second,
which is not too bad. I think it's fine,
especially for something that serves as your internal event
bus. Essentially, if you needed more than that,
then you would probably have to go to something else. And RabbitMQ unfortunately
is not distributed. You can scale
it pretty much vertically before youve going to have to go to sharding.
So at that point in time you probably may want to look at some of
the tech, but for all intents and purposes, in most cases, folks that think that
they need Kafka, they probably don't need Kafka, they probably can go with Rabbit.
You're also going to want to have some sort of a config layer
or a caching layer, and that is basically going to serve the purpose for
each service. It's going to be like essentially a dumping
ground for each one of your services to be able to write
some sort of intermediate state to it, right? So as
a service, when we're talking about for instance, even item potency,
the services will want to record at some intervals that like oh,
these are the messages that I've already processed so that in the case that the
service gets restarted, it is going to be able to pick up right where it
left off at for this purpose.
SCD is fantastic for it, etCD is distributed.
It is really rock solid. When I say
highly latency resilient, what it really
means is that you can stick them with 100
and 5200 milliseconds of latency between links and ETCD
is going to survive without any issues whatsoever. It is decently
fast. It says that it's about 20,000 messages a
second. I have never seen an ETCD with 20,000 messages a
second. But let's just say it's
probably not what youve should be shooting for anyways. You should be shooting for probably
even under 1000 a second. And the other thing is
that eTCD is used heavily
by Kubernetes. So the chances of it going away are
pretty much, well next to nothing at this point.
You're also going to want to have someplace to store all of your events.
Everything that has ever happened on your Rabbit MQ, you're going to want to but
it somewhere and that place. There's really nothing much better than
s three. If you have the ability
to use something external, meaning don't have to run
it yourself, such as mini or ceph or something
like that, then s three is fantastic. Because it's super cheap,
it is fast and it's plenty reliable. In some cases
you might experience some hiccups with trying to write to s three, but overall,
if you get around that, that sort of an issue, everything else after that is
going to be fine. And finally you're going to
want to actually fill up that event store with something.
And for that purpose you're going to need to building an event archiver of some
sort. Now building it completely from
scratch, you would do it probably in go or you could probably
use some sort of spark jobs or
you could probably glue some things together to just move things off of
Rabbit MQ into s three for all of the events.
That is essentially what comprises an event driven system.
Those are all the big components that are there. So you can kind of already
tell that the big pieces of it really are.
It's not infrastructure. The infrastructure is not actually
that terribly complex. It's really the organizational
aspect of it. Right. And it's kind of a
paradigm shift really of how you think about things.
How does this exactly translate to improve reliability?
Well, number one is you do not have to think about service outages anymore.
And by that I mean, well,
if a service goes down in the midst of dealing
with a request,
not obviously, but it has not acknowledged the
message that it has been completed. All right, so even
if it dropped in whatever state, wherever state it was before it
was actually completed doing the work, it's going to pick it up when it comes
back up. And you do not need to write any extra code for
that. That is just basically part of how all of the services in your
stack should be operating. They pick up messages, they react to them
and upon reaction, upon finishing working on the
message, they acknowledge them and they move on. Right? So basically,
yeah, service outages are a significantly smaller thing.
The fact that you are no longer relying on any other services around you.
You are the master of the domain. There is a single service that
only cares about itself. That means that your failure domain is
really small and you can put all your efforts into making sure that that failure
domain is actually solid. Realistically,
really what we're talking about is it's really one service and the two
or three dependencies that it has, which is your
event bus and ETCD and whatever 3rd,
4th dependency that you have. But that is definitely no longer just a
service by itself. So when you're talking about thinking through your failure domain,
you no longer have to think about, well, what happens when service a
or b or c goes down? And what if d becomes slow and so
on, because they're not really part of your failure domain anymore.
True service autonomy, that is one thing that was promised
to all of us when the microservice pattern emerged as like
a clear winner that, oh, everybody gets to work on their own stuff now.
Well, the fact of the matter is that that's not entirely true because,
yes, even though you own the code for your service,
you are still highly dependent on this other team that owns
service b that you have a dependency for.
If they change their API, well, now you have to update all
your code as well and you're going to be delayed. So what we're talking about
here is that, again, because we do not care about anybody else but ourselves,
we are truly becoming autonomous now.
Well defined development workflow. What this is referring to
is the fact that you are now going to have some sort of centralized
schema which is going to represent
everything that can happen on your system. Basically what we're talking about is
a single message envelope that is going to contain actions,
parameters, all kinds of stuff inside of it.
And as a result of doing that, you're centralizing
one way to communicate your interfaces
for services, being able to do something that means that
you no longer have to have situations where one service is using
swagger, another service is using, I don't know,
insomnia or postman or something like that. And there
is now one repo which has all your schemas that say like,
this is the sort of stuff that I expect to be in this message and
that's the end. And that is super, super wonderful.
There is really just an entirely another talk that we could do
just on Protobuff schemas themselves alone. So we're just going to leave
it at that. But point is, Protobuff is
not even really that hard to begin with, or Avro. And what
am I looking for? None of those message encoding
paradigms really are that complex to begin with. And then
finally we have dramatically lowered our attack surface.
The fact that we no longer need to talk between services means
that we are able to implement some
sort of rules, like a blanket rule set on our firewall,
to simply say like, no, we no longer need to
accept any inbound connections and the only outbounds that we allow are talking
to the event bus and talking to eTCD or whatever else.
So you no longer have to basically punch firewall holes
all over the place just to be able to allow one service to talk to
another one. And that is amazing, absolutely amazing.
And security is going to be super happy about it. So one
thing is that's really important here is that folks
probably want to see what it actually looks like to do something completely event driven.
Batch uses is 100% event driven and
we have about 1920 services or so.
And every single one of them is based off of the singular go template
that we constantly keep up to date. You can go and check it,
but right there it is public. Just go at it.
We use Kafka and Rabbitmq. RabbitmQ is
our system bus for basically the bus
where we communicate, state and so on. And Kafka
is used basically for high throughput stuff.
Yeah, feel free to take a look at that. Now, youve probably
thinking that this all sounds terribly complicated and.
Yeah, it's pretty complicated, it turns out. Yes,
from a technical perspective, it's actually not entirely that complicated,
depending on the expertise of your engineering
team. However, the complicated part is
really, I guess, the political aspect of it
and trying to communicate all these changes and so on across the
engineering. That part is really complicated.
I also think it's really important for you to understand
your message bus inside and out. There is nothing worse than actually
designing foundation that you think is really beautiful
and realizing some months later that this feature
that you built into your foundation that you handcrafted
and so on was actually something that was completely supported in
the message bus itself. And I say that only because I have
totally done that myself several times,
only to realize that I really should have just sat down and just gone through
the docs to know that like, oh, wait a second, I can just do
this automatically. The bus can take care of this for me and I
don't need to come up with some sort of a solution.
Another big part is you absolutely need to accept
that the event bus or the events really are the source of truth.
That is super important. It is really
one of the main, main points of event driven
and the event sourcing architectures.
You should also embrace eventual consistency and
same with item potency. And you should just anticipate complex debug
where debug was actually fairly straightforward ish,
if you had request tracing with HTTP. Now debug has
gotten quite a bit, quite a bit more difficult.
This space is still kind of greenfield. There aren't
a whole lot of tools, so youve probably going to expect to have to build
some of this stuff yourself. I figured it would
be probably helpful to maybe put down how
much time certain parts of this are going to take,
at least from the technical perspective. I have a couple of
more slides at the end. At the very end of the
presentation, which goes into how much
time it takes for the organizational aspect of it as well. But I'll leave
that off for later. So first things first,
to set up the actual foundational infrastructure. I think it's the easiest
part by far. It really shouldn't even, probably take
even one week, maybe max two weeks or something like that,
especially if youve using some third party. Things like
that are from a vendor such as s three, right? Or maybe for the event.
Plus youve using some sort of a platform as a service such as
compose IO or something like that. Defining the
schemas might take a little bit longer. It also
varies highly based on how much expertise
do you have in regards to how your product works. Do you understand
every single part of it or do you understand only
a portion of it? That means that you're going to have to talk to more
people and so on to make sure that it fits correctly.
And then when I'm talking about schema publishing and consumption,
all that I'm talking about really is CI for creating releases
for your schemas.
It can be anywhere from medium to hard. Really,
it entirely depends on how complicated your schemas
are. Are you using protobuff? What are you using actually for the
messaging, coding, all that sort of stuff? A really important thing is
to provide an example service that uses event driven. You do not
want to give your developers basically
just a mandate that you should be doing a vendor event. You want to provide
something like libraries, probably wrappers, that sort of a thing to
say, like this is how you do it, and you just plug and play essentially,
and you get everything. And then the last parts that
have an asterisk next to them, they are by far the hardest
parts of all of this. With that said, they are
also potentially not needed right away. You'll probably want them eventually,
but they're probably not needed right now.
The most important one by far is having some sort
of an event archiving solution. And that is basically something that
is going to consume all the events
from rabbit or Kafka and stick them into something like
the long term storage, like s three. There is a little bit
of complexity in that, and the fact that you actually need to probably group the
events. You need to put them together because you don't want to have
1 million files in one directory or whatever,
one object space in s three. So youve going to want to group
it, munge it and compress it, that sort of a thing.
But the rest of them, such as a replay mechanism or an event viewer.
Maybe you don't need it right away, but you will probably need it eventually.
Some things, some quick tips in regards to when to
stamp this down and what is the best approach to
all this. So number one is that if this
is a brand new, you have a brand new platform, everything is new,
and you know what you're doing. This is awesome.
That is the prime time. It is absolutely amazing to
implement an event driven platform. It is
fantastic. But you
should really still only do it if you have a complete understanding
of everything, how everything works, how youve can
imagine how all the services are going to interact. You probably
need to have a very good idea of all the flows that
are going to be happening in your system. Really?
Really. This kind of goes without saying, but I'm
just going to point it out again, or I'm going to point it out is
that this is largely based on youve engineering capability.
If you have experience in this and you have experience dealing with
architecture and with design, you could
probably pull it off. If you do not, then you
might want to have some friends around who are going to be able
to put on their architecture hats together with you
and think through all of this. With that said, you almost certainly
want to do this with somebody else, even if they're
less experienced. Just to make sure that
you're not doing something totally silly and egregious.
One final thing is that do not use CDC as your
primary source of truth or as your only source of truth.
You can totally use CDC or change data capture,
but use it as a helper, not as the primary way to
create events. Actually let your services create the events,
not your database. So in
most cases, though, you're probably going to have an existing.org and in that case,
definitely move to event driven gradually. Do not
try to do it in one big fell swoop. It never
works. I have never seen it work. It is going to
be a massive waste of time and there's going to be problems.
There's functionality will be missed. The timetables
that you assign to it are going to be exceeded dramatically.
Just do it gradually, little by little.
Basically you can utilize CDC or change data capture
for this, where you are just going to expose some database
little by little. Like every single update that is happening in
the DB, you're going to push it out as an event and have some services
only rely on that. The only caveat there is that you do not
want to have a service. Half rely on CDC or directly
on the database, and then half of the same service
rely also on events as well. It's either it relies on one or
it relies on the other.
Now, in regards to some
more reality really, of where does Sre
fit into all of this? I think that Sre in this particular
case is by far the most important
part of the conversation. They need to be involved of anything
that deals with distributed system design, really,
to a certain degree. The only folks that truly understand
things, how they work at a platform level are sres.
So if you are not involved in the conversation, you should be involved.
And if you are involved but you're not a lead, you should be a
lead. You should be leading the charge on all this sort of
stuff, right? Another thing is, you should know
that this is a totally greenfield area,
this event driven in general. Like granted, you might
see some stuff about react or something like that. That is event driven as
well. In essence, when it comes to event
driven for systems, there's not a whole lot of stuff
out there. It's part of the reason why I wanted to start a company in
this space, because there's not a lot of stuff out there,
and I really wanted to build something that addresses some
of these issues. Another thing is, you need
to get comfortable wearing an architecture hat. You will wear it no
matter what. It's going to happen. And if youve are being a
thought leader and you are providing documentation
and talks and so on and so on, like it or not, you will have
an architecture hat. It's just simply going to happen.
Another thing then is, I guess the final bit
to all these tips is that really, I would focus heavily
on documentation and really on written culture in general,
that everything should be written down. Even though a lot of us now are completely
remote anyways, now it makes even more sense to
do it than before, but you absolutely must have
a written culture in place with an event driven system.
A lot of stuff feels a lot like magic when it just works properly.
But the thing is that when it doesn't work,
you will absolutely wish that you had some sort of flows, some sort
of flow diagrams, runbooks of how things are supposed to work,
maybe proposals for how a certain part of the
architecture is going to work, and so on. So written culture is super
important. So in essence,
in exchange for a pretty decent amount of complexity, you are going to
gain a lot, right. You are going to gain real autonomy
for both services and teams. You will be able to rebuild state,
you will have predictable failure scenarios,
your recovery time should increase massively because
you no longer will have to hunt down various teams and so on.
You will be able to sustain really long lasting outages. Not that that's
a terribly great thing to have long lasting outages, but you will be able to
sustain them. Security will totally thank you because
you have just added this massive improvement
on the entire platform and you have a really solid
and super well defined and very robust foundation.
It's not going to take very much explaining to say
this is how it works, because there's basically only one
thing heard, which is you emit an event and
an x amount of services are going to do something about that event and are
going to update certain parts of it. That's it.
And then really the final part is that you will have a lifetime
of historical records. And that is super amazing because you now have
all the things in relation to any sort of certifications
that you need and so on. You essentially have an audit trail
and on top of you have analytics for your future data,
science teams and so on. Just as a
quick side note is that batch is using. I think I already
alluded to this, that batch is 100% event driven.
So I'll just give you some quick stats as to what it means for us
is that, number one, we're a complete AWS
shop. We are using eks, we're using MSK, which is the
managed Kafka, and we're using lots of EC two, plus whatever
sort of random assortment of their other services,
the basics of route 53 for DNS and so on,
and rds in some cases. We have a
total of 19, I think somewhere on there in like 1920,
maybe a little bit more Golink services. All of them are based off
of that go template that I mentioned. We are 100%
event driven, except of course for the front end part,
which the front end has to talk to a public API. Those are synchronous
requests. But we have zero inner service dependencies.
Most services have only three dependencies, and that is rabbit,
etcd and Kafka. And that is it.
That means that we have an extremely
locked down network. I'm skipping a few points there, but that
is the result of that. We do not have a service mesh,
we have no service discovery. We do not need any of
it because we don't really care about accessing those
services, and those services don't need to talk to each other. We don't care about
having to have mtls between every single service
and so on. The only things that we have is just
mtls between the service itself and
Kafka, and that's about it. And of know rabbit and
eTCD and what this means though of
course, is that instead of triggering behavior via curl or
postman, we are actually triggering behavior by emitting an
event onto the event. But, and we use a tool called plumber
that we developed. It's also open source, you can see it
under the batch corp GitHub.org.
But it's a slightly different paradigm, right? Instead of just doing curl
for an HTTP request. Well now we do plumber, but it's essentially the same concept
as well. All we're emitting is essentially a
message that is a protobuff message,
right? I already mentioned, yeah, that our
network is massively locked down. We're barely able to talk to
anything where the services are able to only talk to very specific
other ips on the network. And then the stats are really
that our average event size is at 4k.
We have about 15 million system events and that takes up
about 100 gigs of storage in s three, which really translates to a
couple of bucks or so a month that you need to pay for s
three. It's really, in the grand scheme of things, it's absolutely nothing.
So while it may seem that you are super amazing now
at event driven, you're probably not quite yet.
Not quite there yet. I'm not there yet either. But I have a decent
ish idea. But to get there you should probably spend a little bit of
time doing a little bit more research and reading on this topic.
And you should totally start off with Martin Fowler's
event driven. This article basically just
talks about the idea that the fact that, well, event driven actually contains
multiple different architectures. So you should be aware of what
you're talking about when you say event driven in the first place.
So it's just a good overview
really of event driven in general. Then the event sourcing architecture
pattern, which is basically what we discussed here,
when you're using replays, that's essentially what youve implementing.
It's all under the same event driven umbrella really.
And cqRs, same thing. It's another architectures pattern
that fits really well within the event driven umbrella.
Now, in regards to item potency,
everyone should probably read about what is an item potent consumer.
The microservices IO site does it
in very few sentences, much better than I could to
explain what is actual item potency. Now there is one article that I
found, which is written by Ben Morris and it is super awesome.
There is one particular
quote that I really liked, which was that bear in mind that if you rely
on message ordering, then you are effectively coupling your applications together
in a temporal or time based sense. And if
you read between the lines, really what it's saying in that case is
that, well, that youve probably don't want to do
it. Youve creating potential problems for
yourself. So in other words, do not rely on
ordering. Create your system so
that order doesn't really matter. And that's like where item potency comes into play.
And at the same token, exactly one's delivery, that's another
topic that can be discussed in detail for a very long time.
But exactly one's delivery is very
difficult and potentially snake oil.
There, I just said it. Yes,
Kafka does talk about it and yes, technically it is
doable, but there are many,
many caveats to it. So it's simply much better to
design your system so that your services are
able to deal with duplicate events in the first place.
Okay. And that is basically it.
I hope I was able to show you something,
show you something new. And if this is the sort of stuff that is interesting
to you, you should totally come and talk to me.
If youve want to work with stuff like this, you should also come
and talk to me. This whole
space is super interesting and it is still
fairly fresh. There aren't a lot of people doing it.
And in general, high throughput observability is
not really a thing yet. And we are trying to make it
happen. So if you are thinking about going to event driven,
if you have evaluated event driven and youve are not interested
in it or youve afraid of it, come talk to me. I would love to
nerd out about this sort of stuff in any case. All right, that is all
I have. Thank you very much and goodbye.