Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Mariah Peterson and I'm really excited to speak
to you today at this site reliability Engineering session
of ComP 42. I am currently a member of technical staff
at Telscale, and we will be talking today on the topic
of data. Data data reliability Engineering. It's crucial to your data organization,
so we'll start with what is data reliability engineering?
Just as Google's published SRE
practices have been implemented and re implemented across
various software engineering organizations,
implementing these same practices to data
organizations and data engineering and
your data services is senior
data reliability engineer. Implement these practices
with these focus of decreasing downtime and improving client
experience or the experience of data
users and data practitioners.
So main key topics are finding
if your data is reliable, decreasing your data downtime,
utilizing data level metrics, and treating
or creating a platform for data observability.
Reliable data is a
little bit different than what
you would say for a reliable web service.
There are many aspects to reliable data. They include accuracy,
freshness, missing data duplication,
but they also include the latency at which data is
available. If you can access data,
it depends on the databases, the data stores, the data
gateways, and you want to be
able to maintain a first class or a
4959 experience for all
of these aspects of your data.
Additionally, when you're building systems for reliable data,
you have to understand that systems
are not perfect. That's why we're not aiming for 100% systems.
We do want to create budgets that allow for failure
and that by leveraging these budgets, we have a first class team
that understands what a data incident is and can respond
quickly, right? So that we have a cushion for intervention.
We can respond quickly, efficiently, so that the customers
don't notice large outages, that we have minimized our downtime
and we have created an
experience that says this data is reliable and we are
willing to create that first class experience.
Data downtime, like I said, can have various parts
to it. Bar Moses,
the CEO of Monte Carlo, which is a data observability platform,
has long proliferated this definition
where it's not just your data being unavailable,
but if your data is erroneous, if it's only part
of the picture, if it's inaccurate in any way,
that is data downtime. And that's what we're trying to,
in can iterative way, leverage signals and
metrics, our error budgets and our
engineering time to minimize that, to make our data as
reliable, as close to
the customer's expectation, the practitioner's expectation
as possible things we
can do, just like with service level metrics.
You can put metrics at a data level to capture the
stability of your data pipelines, of your data stores, of your data gateways
or other data services you can create
pictures of if
your queries on data stores are taking too long,
if you're taking too long to train a model, if your data
gateway has an unexpected latency, errors, returning an inappropriate error,
you can use all of these informations
put together to create a picture and explain visually
through metrics and other metadata.
If your data is reliable and through
anomalies, create lets that allow us to investigate
them. Like I said, we can investigate latency,
see which is our most traffic services,
which services are experiencing more
or less traffic than we expect. If we have error
messages being thrown when they shouldn't be, or serve
being 400 errors, 500 errors on API
gateways, maybe you're having
a situations on your gateway or on your database or on other data
store you don't expect. These four things are
part of the golden signals expected
out of the SRE handbook. Observability is
one step beyond those metrics. We take these metrics
and create slas, slis and slos to make sure
that our data services are
performing to what your data practitioner needs.
That way, as things come up, right. Our metrics
are reporting something unexpected.
Through those slos and slis, we can address that either immediately,
depending on the criticality of that error, we can throw it
into a bug, we can give it to a support or a
support reliability customer engineer.
And using procedures, we can minimize
these downtime and leverage a decreased time to resolution
depending on severity. For example,
say we have a data
pipeline that is taking CRM data in
and it is training a model
on that, then creating a couple of dashboards,
deploying that model, and then releasing those dashboards to
sales staff. Right? The model is used to make predictive
leads, and the dashboards are then used by
sales leaders to motivate, to encourage
salesmen to help with marketing.
There's a variety of things that can be done with those.
So we create an SLA for this CRM pipeline
to have
the data be no less than the
model. The dashboards be 24
hours old, they cannot be any older than 24 hours.
We take that and put that into,
yes, revenue dashboard with minimal freshness of one day.
We take that and we transport that into an
SLO, right. An objective for our data
service or our pipelines that is creating that dashboard. So we
know that we want this pipelines to extract data from the CRM,
complete its transforms, training and analysis at least
once a day. And if something happens, we want to
be able to give us can error within a reasonable time,
right? So that we can manually intervene and our SLA
is never broken. If something comes up and the automation
fails, this allows us to keep that
customer or the practitioner expectation without
compromise with some kind of a fall plan while maintaining
that reliability for our systems. So then we have sois,
right? These are the indicators that alert us if we need to
as data, data reliability engineering involved. These can be errors
if something happens in the pipeline, right? Maybe our model doesn't complete training,
maybe one of the transformations doesn't happen. Maybe we can't connect to the CRM.
Maybe we time out and we sre not able to
write to our data store or we are unable to.
That timeout could be on the CRM or on the training.
And these create alerts, right. These slIs, if they
indicate above your threshold, they create alerts and they allow that
practitioner to know, oh, there needs to be a manual intervention
to maintain that reliability standard that we have with our data practitioners
or any of the stakeholders at the other end of our observability pipelines.
So great data reliability makes sense.
It's transposing your normal reliability practices to data
systems. Why do we need it? Right.
Great, the CRM stuff works. But what
about more? That's not as critical as some
of the other systems that rely on data. We have
in one of my favorite books, designing data intensive
systems, it describes these types of
data projects that all need data
reliability to be infused into their project.
The first one is your software data stack.
Right. So this would be,
we'll get to it later. Your software data stack, your enterprise
data stack and these of course your modern data stack,
the software data stack. Let's start. What is it
that is these data store for any customer having application, right.
This could be a database to a microservice. This could be an in
memory cache. This could be any of your
cookies stored in a browser, other kinds of application information,
but any kind of data that interfaces directly with
your software service. Typically who manages
the reliability of these data stores?
This usually falls within the spectrum. Senior senior data reliability
engineer have tools like they
often come from operations backgrounds. They understand how to spin up
databases. They have hosted databases and
they don't usually case to the point where
you need some kind of specialized knowledge for this
kind of reliability interface.
The enterprise data stack is
quite a bit different than that software data stack.
First thing first, it is your infrastructure structure
or data platforms for enterprise data services.
These are your large scale distributed databases. These are your
large data warehouses that are used for
reporting and bi, these are SAP
databases, oracle databases,
any kind of large data infrastructure that
has grown into what various
enterprise companies large scale enterprise services
need who manages their reliability.
That reliability for this requires a special set
of data skills and it is typically handled by your DBA
or a data reliability engineering. Somebody who has that
specialized database administration knowledge
can do sharding query
optimization, has managed many flavors of
database, SQL or other custom database languages,
but still would be the first line of defense if
you have an error from your metric reliability issue
comes up. Now, what is this modern
data stack? I mean, what's left? We've talked about
software, we've talked about your
large scale data. What's left? These are any
kind of pipeline and analytic that's used for
machine learning. These are your ETL transformations
that are used when you're extracting data for
data scientists, basically anything that
can be used for data, right? Creating data gateways,
creating data mesh, creating custom
machine learning pipelines on top of your data warehouse,
perhaps right where you're taking that data. Another step with extracting
it from that warehouse and maintaining a pipeline,
ETL pipelines that perform platform crucial
transformations, right. Oftentimes you're seeing it very common that information
from your analytics data warehouse is being used in
software applications. You're seeing machine
learning models being shipped to production as part of the software
product that is being maintained by
back end engineers. And they need somebody with more data expertise
to come and step in and help maintain the
reliability of those pipelines. Gateways,
like I've mentioned, with something like your data mesh, where you have data gateways
that protect golden data sets, but allow your data
to be accessed by a variety of teams of
consumers, stakeholders, or other kind of data practitioner.
And this is very different. This is not
a data warehouse maintenance load. This is not a site
reliability maintenance load. Who would manage this data set
and make sure it stays reliable? This is what your data reliability engineer
is trained to do. They have that skill set to understand
what ingress and egress is, to understand ETL
transformations both in streaming and batch, to understand storage
varieties, from databases
to large file
stores, in memory stores, distributed stores,
basic data governance and data analysis. They're not
going to be your DBA and maybe be resharding databases, but they'll have
a variety of skills that prepare them for handling data in
motion as opposed to data at rest.
They're going to focus on automating pipelines,
automating services. They're going to be doing monitoring and observability
on these data in motion systems and services, and they're going to understand
modern architectures that are part
of these systems, so that as they work with data engineers,
they can create these contracts
and slas that the practitioners
need to maintain a quality of service.
So that
is the senior data data, senior data reliability
engineer. You understand the
importance of it. I was on a team
that data reliability became essential
part of the work. And these orchestration
we were doing to provide our data practitioners
with the data they needed to perform daily software tasks.
It was very much fulfilling a contract,
keeping your data fresh,
not letting your data age out, keeping it viable,
non erroneous. And it can quickly
go from something trivial to something that is very
much a requirement for software services
to run successfully.
I think it is applicable to all data centric services.
Any service that relies on data to function
needs to have some kind of data reliability
in place as data is democratized
and used across the stack. We see this with movements like data Mesh,
where more and more teams need data and they
need a variety of data sets. And it's
not just your analytics teams, it is software teams,
it is operations teams, it is finance teams.
And as this data is democratized across transactions,
reliability in that data becomes more crucial
to every step of that process,
so that nobody's getting calls
because dashboards are outdated. But you can know in advance
and have preemptive steps in place like the SRE
practices or handbook subscribes
and a data reliability engineer is specialized and
different from any other reliability engineer because they
work with data stacks and with data engineers, and they
understand the data ecosystem, which is very
different than a lot of software ecosystems. There are
different kinds of optimizations and choices and design
and architecture that need to be made.
I want to thank you guys for coming to
my talk. I hope you enjoyed it and
you can take away a desire to have maybe
some new data reliability practices in your organization.
If you have any questions about data
reliability, where to get started,
what to do, how to do it, please reach out to me. I'm available
on Twitter, on LinkedIn. I do weekly
Twitch teams. I'm happy to talk about that,
brainstorm and go further. I want to thank
Comp 42 again for allowing me to speak on this topic
at this site Reliability engineering conference, and I
want to thank any sponsors and most importantly
the attendees. And I hope to see you guys again next time.