SLOs for Event-Based Systems: Navigating the Triad of Availability, Freshness, and Correctness
Video size:
Abstract
In the dynamic landscape of event-based systems, ensuring optimal performance is crucial for delivering a seamless user experience. This talk dives into the fascinating world of Service Level Objectives (SLOs) and explores their application in the context of event-driven architectures.
Summary
-
Event driven systems offer flexibility, scalability and a real time focus. Events can power everything from user interactions in web applications to complex data processing in IoT or financial systems. To achieve true reliability in event driven systems, lets concentrate on three fundamental. core availability, freshness and correctness.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to Conf 42 site Reliability
Engineering 2024. My name is Ricardo Castro and today
we're going to talk about slos for event driven architectures.
In this talk, we'll examine an approach for monitoring and maintaining
reliability of event driven systems.
We will focus on three core availability,
freshness and correctness, and we will learn how they impact
our complex distributed systems. Lets dive in
event driven architectures are transforming the way we build distributed
systems. They offer flexibility,
scalability and a real time focus.
Events can power everything from user interactions in web applications
to complex data processing in IoT or financial systems.
Understanding how to guarantee their reliability is key to
success. To achieve true
reliability in event driven systems, lets concentrate on
three fundamental freshness assures
timely data availability ensures services
stay responsive correctness prevents flawed decision making
by maintaining data integrity.
These arent just buzzwords, these are measurable metrics
that drive user experience. Imagine a
stock trading system where price updates are delayed or
a self driving car getting an outdated sensor reading.
Freshness is paramount in scenarios where the timeliness
of data directly impacts outcomes.
Freshness is measured by the time an event
creation takes to be consumed. Delayed or stale
data leads to poor decision making and user frustration.
Real time is a relative concept depending on the use case.
Lets look at a simple example. We have
two services that communicate over events.
In this context, freshness represents how long
an event took to get from service one to service
two, it can be implemented and self
reported, for example at the service level by using a
histogram. In a more complex scenario,
one or even more events can trigger multiple other events.
Although freshness can be measured in a similar way,
meaning self reported at the service level by using a histogram,
it needs to be done at different places.
It will of course have different thresholds of acceptance
depending on the context and on where it is measured.
In event based systems, we have to move beyond the simple notion
of uptime. Availability is not just about
if a system is online or offline, it centers on
whether core functionality is accessible within acceptable time
frames. Components may be technically
running, yet users could get an answer.
Parcel outages, a consumer fails and event processing
slows down can have real consequences. Our focus
must be on ensuring critical event flows remain operational,
even under stressors like component failures or load
spikes. Focusing again on our simple scenario
availability means that an event triggered in service one actually
arrived at service two. For a simple system like this,
it should be fairly straightforward to check that that event arrived at
service two. But for a more complex scenario
like the one were seeing here availability might not be as simple.
One or multiple events can trigger a cascade of
many events. Availability needs to be measured at
different points within the system. For a complex system,
it might not be feasible to do it online.
A way to achieve this could be to leverage synthetic monitoring.
Think of it as using a robot to continuously test your
system like a real event would. Simulated checks
run at regular intervals, allowing you to find issues
before they affect your users. Synthetic monitoring
provides control predictable tests, while real user
monitoring tracks actual user behavior, which can be messier.
These approaches work best when coupled together.
Have you ever heard the saying garbage in,
garbage out? It rings especially true for
event driven systems. Event payloads must
be valid, accurate, and align with the expected structure.
Incorrect data can propagate through a system undetected,
a bad sensor reading an invalid transaction.
This can ripple through the system, leading to inaccurate reports,
or worse, have irreversible actions.
Validation is crucial, yet it must be
balanced with performance. Coming back to our simple
scenario, correctness means that an event triggered in service
one has to arrive at service two with the right format.
Again, just as with availability checking.
This should be straightforward for such a simple scenario.
But in our more complex scenario, measuring correctness
is again not trivial. Again,
one or multiple events can trigger a cascade of
potentially different events. They are correlated,
but they are not the same event. Again,
correctness needs to be measured at different points,
and it might not be feasible to do it online.
Synthetic monitoring is again a very good option to achieve this.
Synthetic tests dont just check if a system responds,
they can examine its output.
These tests can be designed to send specific event data
and assert that the expected outcome occurs.
This might mean checking if calculations are correct or if a database
update is actually performed correctly. They can help uncover
incorrect responses, unexpected data transformations,
or flawed logic in your event flows. This is proactive
error prevention through connection checks.
Through correctness checks we can sometimes add some overhead,
so it's essential to strike a balance between rigorous testing and
system speed requirements. Synthetic monitoring
for correctness can help verify that event based systems
adhere to business rules and maintain data consistency.
Dont think of SLOS as merely setting targets.
This is a journey of continuous improvement for your event based
systems. It begins with identifying the most
impactful metrics for availability, freshness, and correctness.
You will refine these over time to ensure that they always align with
real user experience and business goals.
Remember, strong slos are the result of a close dialogue between
technical teams and those who understand the overall system goals.
And this is all for my part in this talk. We explored a
high level overview of how we can define reliability for event
driven architectures. In subsequent talks,
I will explore in much more detail how this can actually be
implemented and measured. Thank you so much for attending my talk
and have a great conference.