Conf42 Site Reliability Engineering (SRE) 2024 - Online

SLOs for Event-Based Systems: Navigating the Triad of Availability, Freshness, and Correctness

Video size:

Abstract

In the dynamic landscape of event-based systems, ensuring optimal performance is crucial for delivering a seamless user experience. This talk dives into the fascinating world of Service Level Objectives (SLOs) and explores their application in the context of event-driven architectures.

Summary

  • Event driven systems offer flexibility, scalability and a real time focus. Events can power everything from user interactions in web applications to complex data processing in IoT or financial systems. To achieve true reliability in event driven systems, lets concentrate on three fundamental. core availability, freshness and correctness.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to Conf 42 site Reliability Engineering 2024. My name is Ricardo Castro and today we're going to talk about slos for event driven architectures. In this talk, we'll examine an approach for monitoring and maintaining reliability of event driven systems. We will focus on three core availability, freshness and correctness, and we will learn how they impact our complex distributed systems. Lets dive in event driven architectures are transforming the way we build distributed systems. They offer flexibility, scalability and a real time focus. Events can power everything from user interactions in web applications to complex data processing in IoT or financial systems. Understanding how to guarantee their reliability is key to success. To achieve true reliability in event driven systems, lets concentrate on three fundamental freshness assures timely data availability ensures services stay responsive correctness prevents flawed decision making by maintaining data integrity. These arent just buzzwords, these are measurable metrics that drive user experience. Imagine a stock trading system where price updates are delayed or a self driving car getting an outdated sensor reading. Freshness is paramount in scenarios where the timeliness of data directly impacts outcomes. Freshness is measured by the time an event creation takes to be consumed. Delayed or stale data leads to poor decision making and user frustration. Real time is a relative concept depending on the use case. Lets look at a simple example. We have two services that communicate over events. In this context, freshness represents how long an event took to get from service one to service two, it can be implemented and self reported, for example at the service level by using a histogram. In a more complex scenario, one or even more events can trigger multiple other events. Although freshness can be measured in a similar way, meaning self reported at the service level by using a histogram, it needs to be done at different places. It will of course have different thresholds of acceptance depending on the context and on where it is measured. In event based systems, we have to move beyond the simple notion of uptime. Availability is not just about if a system is online or offline, it centers on whether core functionality is accessible within acceptable time frames. Components may be technically running, yet users could get an answer. Parcel outages, a consumer fails and event processing slows down can have real consequences. Our focus must be on ensuring critical event flows remain operational, even under stressors like component failures or load spikes. Focusing again on our simple scenario availability means that an event triggered in service one actually arrived at service two. For a simple system like this, it should be fairly straightforward to check that that event arrived at service two. But for a more complex scenario like the one were seeing here availability might not be as simple. One or multiple events can trigger a cascade of many events. Availability needs to be measured at different points within the system. For a complex system, it might not be feasible to do it online. A way to achieve this could be to leverage synthetic monitoring. Think of it as using a robot to continuously test your system like a real event would. Simulated checks run at regular intervals, allowing you to find issues before they affect your users. Synthetic monitoring provides control predictable tests, while real user monitoring tracks actual user behavior, which can be messier. These approaches work best when coupled together. Have you ever heard the saying garbage in, garbage out? It rings especially true for event driven systems. Event payloads must be valid, accurate, and align with the expected structure. Incorrect data can propagate through a system undetected, a bad sensor reading an invalid transaction. This can ripple through the system, leading to inaccurate reports, or worse, have irreversible actions. Validation is crucial, yet it must be balanced with performance. Coming back to our simple scenario, correctness means that an event triggered in service one has to arrive at service two with the right format. Again, just as with availability checking. This should be straightforward for such a simple scenario. But in our more complex scenario, measuring correctness is again not trivial. Again, one or multiple events can trigger a cascade of potentially different events. They are correlated, but they are not the same event. Again, correctness needs to be measured at different points, and it might not be feasible to do it online. Synthetic monitoring is again a very good option to achieve this. Synthetic tests dont just check if a system responds, they can examine its output. These tests can be designed to send specific event data and assert that the expected outcome occurs. This might mean checking if calculations are correct or if a database update is actually performed correctly. They can help uncover incorrect responses, unexpected data transformations, or flawed logic in your event flows. This is proactive error prevention through connection checks. Through correctness checks we can sometimes add some overhead, so it's essential to strike a balance between rigorous testing and system speed requirements. Synthetic monitoring for correctness can help verify that event based systems adhere to business rules and maintain data consistency. Dont think of SLOS as merely setting targets. This is a journey of continuous improvement for your event based systems. It begins with identifying the most impactful metrics for availability, freshness, and correctness. You will refine these over time to ensure that they always align with real user experience and business goals. Remember, strong slos are the result of a close dialogue between technical teams and those who understand the overall system goals. And this is all for my part in this talk. We explored a high level overview of how we can define reliability for event driven architectures. In subsequent talks, I will explore in much more detail how this can actually be implemented and measured. Thank you so much for attending my talk and have a great conference.
...

Ricardo Castro

Principal Engineer, SRE @ FanDuel

Ricardo Castro's LinkedIn account Ricardo Castro's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways