Transcript
This transcript was autogenerated. To make changes, submit a PR.
I'm Jorge, Platform Engineer and Consultant at Petalim.
Over the next 30 minutes, we will be talking about bringing your
data up to the Data Lakehouse.
If you don't have any of those two, no worries, this talk is actually for you.
We will be discussing the benefits and challenges that
may, lay ahead of your path.
And if you are already on your way, or if you are very deep into this topic, I think
this talk will also spark your interest.
I encourage you to stick around, maybe also share some of your
thoughts in the, in the comments or throughout, throughout the conference.
and yeah, hopefully we can learn from your trials and tribulations as well.
The thesis of this talk is twofold.
it is a well known fact that companies face recurring challenges whenever
they engage with data projects, whether it is because of inconsistencies in
data quality or, delay on insights delivery or misalignment between
stakeholders and developers.
My opinion is that applying DataOps and embracing Data Lake as
architecture is not just another fix that some engineers, cooked up,
but it is rather a paradigm shift.
That not only systematically addresses these issues that I mentioned, but also
thrives inside delivery and trust in data.
I don't know.
Is that arguable, polemical?
Let's find out together.
My hope is that you walk away from the talk, understanding what DataOps is, why
it matters, and how it approaches the problems that it says, that it addresses.
But more important than the boring what, why, and how, what I really want to share
with you is some of my, appreciation for the topic, some of my justified love
for it, so that you hopefully walk with more perspective on the huge opportunity
that this, topic represents for the organization, or your personal career.
About my personal career, I've been working on data related projects
for the last six years, different layers of the stack, everywhere from
platform to real time data analytics.
I have consulted literally a dozen organizations.
and supported them on their IT journey.
And over the last, 24 months, been in love with this topic, trying to find ways
to introduce DataOps to my daily work.
To keep the conversation very, pragmatic, very down to earth and easy to follow.
I would like to start by telling you the story of Fixit Incorporated, a
fictional company that has issues with their own, own existing current
data infrastructure and are trying to introduce DataOps to address them.
We will also see the grand vision behind DataOps.
but when we want to start understanding better what it is.
We'll need to touch on some key concepts, although a little bit technical, we will
still always keep, on, on the horizon.
These three core fundamentals for the business, namely cost
performance and understandability so that we force ourselves to measure
how much better we are doing than whatever is already out there.
We will then move on to the more spicy, I would like to say, part
of the talk, where we will discuss typical technical and organizational
challenges that you may face when you want to embrace, full, DataOps.
We will then conclude with some summary, and I'll give you some
pointers so that you know where to go next if you decide that you
want to, join us in this adventure.
Without further ado, let's start.
So to understand why DataOps is so important, right?
It makes sense to first try to understand what are the problems
that it's trying to address.
To illustrate this, I will tell you the story of Fixit.
This is a interplanetary megacorporation with a headquarters on Earth.
They recently hit their 500, 000 employees mark.
They are deep into research and development, aerospace
engineering and space mining.
They poised themselves as data driven and cutting edge, and therefore they recently
hired their first ever chief data officer, who launched a digital transformation
program to grow, to drive growth.
Now, central to this initiative is the data engineering department,
which has around 400 employees and big investments on cloud native compute,
storage, and martian BI tooling.
So the CDO, TechSavvy as he is, starts by mapping out the
technological landscape in the company.
What he recently quickly finds out is that All of the departments
have very well defined boundaries.
They all use their own date, sorry, tables.
and essentially, the organization has grown organically in that sense without
any, top level alignment of any sense.
They are all subject matter experts on their topic.
yet they rely on data from each other to actually perform.
Now the CDO wants to know how this works.
Organic growth, which, by the way, is a situation in which many
companies find themselves, also, without really realizing it.
we want to see how this organizational growth has impacted the typical
workflows inside the company.
So let's take one example.
Let's talk about customer support reviews.
Let's assume that one of our customers, the Fixit customers, he or she
buys one of these deep space mining techniques, technologies or equipments.
There is a technical issue.
They place a ticket on the support desk.
One of our colleagues helps them solve the issue, hopefully.
And afterwards, we get a review of how well we did.
Now, Fixit wants to know, If any of this review information can be used to drive
improvement on the customer support.
The way that they decide to do this is using sentiment analysis on these
reviews that we got from the customers.
alongside with some other signals.
the issue here is that data is at the moment on a relational data database
system managed by the customer success department, but it needs to reach
the data science team, which are the experts on sentiment analysis.
The data science team doesn't have direct access to this data, nor
the expertise or resources to move the data to where they need it, and
therefore they rely on the data union department who builds a pipeline.
that copies this data and puts it into a data lake for the data
sciences on a hourly basis, let's say.
The data scientists, they get the data, they start performing the modeling,
generate predictions, and then they push this to the fifth step, fifth
and last step on the pipeline, a BI report, which itself will likely have
some kind of caching or some storage.
and then on top of which the visualizations are built.
The manufacturing managers then can look at this information
and take decisions accordingly.
Now on the paper, this approach might seem very straightforward, right?
Very intuitive, nothing complicated here.
But if you look with a more clinical eye, you will start
seeing some issues cropping out.
For example, There is at least one hard data copy that we know
of between steps two and four.
This is where the data engineers, not knowing exactly what they're moving
around, they just know that they need to get data from one place or the other.
They don't complicate their lives, they simply copy it as it is.
and put it somewhere else.
we also see there is no centralized point of orchestration for
the whole sequence, right?
So if something fails on a step two, there is very little chance
that the people sitting on step four will have visibility over that, let
alone the ones who are consuming the dashboard, at the end of the stream.
Finally, since we have multiple different tools interacting, some of them being
RDS, some of them being data lakes.
Some of them being via reports, we can presume already that
there will be different formats.
So data will be transformed into different formats or three different formats,
at least throughout the whole process.
And that is inefficient, of course.
What happens then when all of these issues are corping up in one workflow?
And we look at a scale when we start looking at all the possible use
cases that the company has already built or is planning to build.
The CDO, knowing that no amount of documentation will help him wrap his
head around this situation, decides to do a full, review, a full, let's say,
interview process of the company to try to extract knowledge from the, colleagues.
So he sends out a survey.
And some rigorous questions.
Let's see what comes out of that.
The CEO tells us that the IT architecture grew organically over the years.
there are site data silos everywhere.
We kind of suspected this already, so nothing new.
The head of data engineering tells us we maintain hundreds of
pipelines, moving data between all departments, cyclically dependent even.
Okay, this is already new information and not a good news, actually.
That means that if in an event there is an issue with some pipeline there
is a chance that multiple pipelines will go down, all at the same time.
the new recruits from Data Science tell us that there are many obscure, column
names, transformations across data stores.
Even after six months of onboarding, there is still,
let's say, ongoing training plan.
Thank you.
No doubt, this is frustrating for the recruits, but also for the recruiters who
have to invest six months of training, even before they start seeing any tasks
delivered by these new colleagues.
Head of IT tells us that, indeed, there are too many systems
and many permission schemes.
The monthly bid is also unpredictable.
So not only we have, we only have we little visibility on costs, but also
security is, hard to enforce due to the complexity of permission is skimming.
Finally, the managers, the, the poor dashboard users, all of this comment
actually is far more, let's say generic than what the others mentioned, I
found it much more insidious, right?
There is some very, dark evil lurking behind of his statement.
He says, can I trust data in this dashboard?
Last time I checked, it looked funny.
It also takes forever to load.
So this means there is not only a user experience problem here,
but also trust is broken, right?
my question, or it begs the question, when, their incident,
this incident was, happening.
And it took three hours to solve it.
Would it also take only three hours to recover the confidence of this user?
Most likely not.
Perhaps three months.
Perhaps even three years.
So all the information is in.
our CDO has tabulated all the, the key findings into one table.
Across the board, we see data silos everywhere, multiple copies and formats,
no single access control scheme.
Metadata management is parts, or is a topic in part of the whole stack,
but not in some others, which means we have a huge blind spot towards the
side of data ingestion and storage.
And orchestration and monitoring is not centralized.
At least we should be happy there is some in place.
That's it.
But this is far from ideal, right?
And at this step, the CDO takes, some time and ponders the question, What
if we introduce DataOps, and mold it into a data lake architecture?
How would that look like?
What would it bring us?
Well, in an ideal scenario, we could cross out all of this complexity, right?
By first of all, introducing a single lake, a single data
lake where all data resides.
This will in turn enable us to guarantee that there will
only be one hard copy of data.
well, besides any backups for, disaster recovery, but there is no necessity
to do hard copies of data in between stages of a workflow since we know that
everything relies, everything resides in one single place, in one single lake.
This also enables us to start thinking about centralized metadata
management, a single, pane of glass for managing metadata.
We can also think about, single data format, or enforcing a single data format
at some point, hopefully an open source format that will allow us to further
develop our own tooling in the end.
Without really thinking about, vendor locking or, licensing.
we have additionally now the opportunity to talk about centralized
orchestration and monitoring, and last but not least, centralized access
management, management, whether we do this via traditional, access, role
based access control or more fancy tag based access control using metadata.
But in practice, is that even possible?
How would that look like?
And more importantly, what would that, entail, for costs, performance, and
understandability, which are the core fundamentals that we want to, let's say,
upheld from the business side of things.
Well, let's take a look at that.
We will need to start with a data lake.
In simple words, we might have different business tools.
Some of them might be third party, but all of them will
write data to a single location.
And this is actually a natural response to these tendencies that
we see on system integration, right?
Because of the, sheer amount of unstructured and semi structured data
when compared to the structured data that we actually would like to have.
but that's actually not a bad thing, given that the modern data
platform cost models favors, storage.
so meaning storage is cheap when compared to pre cloud, It also opens
up the opportunity for us to enforce, as I mentioned, this open source data
format, ideally something like Iceberg or Data Lake, which are not only widely
supported, but extremely high performance because of their binary nature, and
that also support now ACID operations or ACID operations, let's call them.
This simply means that you can read and write to these files.
without having to worry that you will compete with write and read
operations from other users.
Every operation is atomic and therefore there is no chance to collide.
This is a very well known principle of relational database systems, actually.
We can further improve our situation by bringing into the scope staging schemas.
And this is actually when the strength of the system starts to ramp up.
Think about it like this.
We start with very raw data that needs to be refined until we have
a final product which is ready for the business to consume.
Now we know what these typical refining operations are.
All that we need to do is to put them in very well defined stages and execute them.
The good thing of having these stages well defined is that if
something goes wrong, we know exactly where to look for the problem.
We also have these smaller steps, essentially a lot
of steps, very small steps.
And this incremental process allows us to simplify troubleshooting, extremely, yeah.
To give you an example, suppose that we have a staging stage where You or
stage with staging models when we where we do only language translation for
the column names and the table names.
Additionally, we do some enforcing of data type formats.
So if a column is a string, but contains a date, then we save that
as an actual date object on database and save some performance and costs.
Then we introduce another stage, intermediary stage, where we
enforce, the actual open source formatting that we wanted.
Maybe we add some partition and some compression for the files.
since now our files are stored in a much more performant format, we can
also think about, introducing data quality tests and freshness tests.
Moving forward, we go into the curated stage.
or some, however you want to like it, however you like to call it.
here where we start building business objects or entities that have some
meaning for whatever comes afterwards.
And all of this can be done while orchestrating the
interdependencies between the models.
That means if some business object requires two of the role
models, we can ensure that the role models are updated before.
the business object.
All in all, this is very easy to build, troubleshoot, and version.
And this is the key word here.
we are not really dividing data into development, testing, and
production, like it's usually done.
Rather for us, everything is production, but the same way we want
to test a new feature on application and roll out a new version of it.
We do the same with the tables.
I want to create something new, a new column for whatever business reason.
Then I can just create a new model next to my previous one with a new version.
Let's call it B2.
and then I am ready to test, right?
I, of course, we need to communicate to my, consumers down the line.
At some point, maybe I want to deprecate my previous version.
So we need to let them know, Hey, please migrate to the new one.
at this point, if necessary, right?
point that I forgot to mention here, this orchestration between
the model dependencies, there are tools already available for this.
There are a few, I want to recommend dbt simply because it's open
source and he has a huge community.
We will talk a little bit more later about them.
The next step into our, stack, is actually bringing the house
to the lake, in the lake house.
By this I mean introducing a warehouse.
Why we want to do this?
Well, think about it like this.
If the staging schemas are the perfect place for data exploration,
the warehouse is the perfect place for data exploitation.
This is because warehouse, technologies or solutions have much more
computational power, and they can help you meet this time critical use case.
Like if you need to get a report on, on, very frequent interval, and you need these
queries to really complete on due time.
Here we can also think about introducing more fancy materialization strategies.
So, like, incremental models, that means upserts.
So, not to completely overwrite tables or, yeah, just introduce some
of them, but rather, update, rows that are already existing on it, so.
as well we can, introduce more fancy tooling that comes out of the box
with this, technologies, for example, machine learning based queries or some
statistical transformations that might be required for whatever business reason.
And here I need to make a point.
There are multiple data warehousing solutions, but not all of them are able to
read and write directly into the one lake.
So, If we want to upheld the principles that we are trying to build here, we
need to stay, or we want to choose select carefully this tool so that we don't
build a new silo on top of the lake.
The cherry on the top for us will be white tables.
And I'll tell you why.
First of all, there are benchmarks I showed that in.
modern warehouses, querying white tables is 50 to 25 percent much more
performant than running joins on tables.
They are also extremely intuitive for the analysts and the business users,
so that they can actually really get very fast to the results, right?
They don't need too much onboarding to start, utilizing these tables.
Hence.
Reducing all this cost of, high labor, right?
the, the BI tools also sometimes tend to ask you for white tables as their input.
So, you can even think about them as more or less as a requirement depending
on which BI tooling you're using.
the caveat here is that you might of course end up with duplicated
columns in some white tables.
For example, you might have some table with conversions on some tables.
Time range and some other table where there are also conversions, but in
maybe less granular monthly basis.
And of course, you need to make sure that, the users understand that conversions
can look different in different tables and that there is metadata
that they can choose or they can use.
I'm talking about, catalog, glossary, and lineage.
This should all be information that should be available to the users so
that they can trace back all the changes from the beginning of the data lake
all the way to their consumption layer.
that way you can preemptively answer any questions that may crop
up, any doubts that may crop up in the daily work of the analysts.
And that brings me to the challenges, right?
So, we discussed already all the benefits of this.
from terms of cost performance and understandability, but it's still there
is a chance you will not be able to make this pitch in your organization
and get a thumbs up right away.
Some typical technical challenges are, for example, existing learning, long
running licenses with enterprise tooling.
maybe you get past that and start implementing.
DataOps, but you introduce way too much trivial testing.
This is to be fair, also something that you will see in other
areas of software development.
but it gives you this false sense of security that, might backfire.
we also highlighted that there is no dev test and production stages.
We just have everything as production with versions on the products or
on the tables or on the models.
Here what is critical is to have good communication, right?
So if you, create a new version of your model and don't properly communicate, you
might end up pulling the rock underneath somebody, and this is hard to estimate
the blast radius that it might have.
Finally, one that I see often is Although I think self explanatory at this
point, you should always try to build a mock up, right, of your use cases.
Even if it's an Excel sheet with dummy data.
But as a pre step to align yourself with the stakeholders, which is one of the
core principles of data house practices.
but sometimes it's rush, it's, yeah, it's brush off all under the table just
to, you know, get faster to the results.
Now, all of this, I think, can be addressed with a call head.
However, the organizational side of things may look different because, yeah, people
not always, operate with a call head.
Some comments that you might come across are, the conversion reports
you build are not matching with one of the ones from the other team.
So, this is this discrepancy that I mentioned.
You can end up with two different white tables, with the same metric.
but with different versions of this information.
Somebody can also mention, cool, but I'm not a technical person.
Why don't you talk to the engineers?
huge misconception here of what DataOps is.
As I mentioned, this is a principle, or one of these principles
is that you want to align the stakeholders with the engineers.
And therefore, this is a joint effort.
From the side of the engineers, you might also hear things
like, Hey, we have to work with.
Tool X for so many years.
Now you want me to change to tool Y.
But at the end of the day, the, the only thing that is
constant is in life is change.
I think everything, everyone can agree with that and DataOps is built for that.
It's embracing change, and staying nimble, right?
To stay, stay on your toes, always ready for whatever comes out of the curve.
Because piece of requirements will change, KPIs will change, data will change,
sources will change, tools, people, and best practices themselves might change.
but the, the whole idea is that you, come up with a translation for
DataOps that fits your organization.
You put it to the test, you review, and you make it better and
better and better, iterate over.
Small steps, but very quick steps.
Putting it all together, Data Lakehouse and Open Source Storage Formats are
the natural response to the trends that we see in system integration.
And they are, beneficial for us from a cost perspective because
they take advantage of the building models of modern data platforms.
they essentially trade more of the computational costs for the storage
costs, which in turns are lower.
we talked about why tables are, they're intuitive and performance.
even their, their, pervasiveness due to requirements from the BI tooling.
we also mentioned they have a downside because they are denormalized,
but the benefits outperform this.
Issues, let's say metadata, I cannot stress it enough is the key
to removing silos, is really the key to boosting understandability.
And given the fact that the cost of a skilled labor are still
high and will remain high, the understandability is a huge factor.
on your cost control.
So the more intuitive, the more easy it is for people to find information
on a self service manner, the better you are off for the future.
Finally, DataOps, as I mentioned, is a paradigm shift.
It's not just something that the engineers cooked up.
is the way that you treat data, metadata, and how you adapt to change itself.
If you are interested to learn a little bit more about this topic,
I can recommend a few things.
First of all, take a look at the DataOps Manifesto, 18 principles, very clear to
the, direct to the business question.
You can also read the best practices from the dev, devs from dbt.
As I mentioned, DVT is an open source tool is very opinionated, but the
documentation is very well written and the engineers will tell you every time
they take a decision why they do it.
So you can internalize, right, the, the, the fundamental behind it,
instead of just falling blindly into whatever the tool offers you.
I can also recommend you check out the book from Dave Fowler.
And this is called Cloud Data Management, Four Stages for Informed Companies.
It's an open, free book.
You can find it online.
It's a, it's a good read.
And, yeah, also maybe join the community on Slack.
dbt community is, is very active.
very humorous at times.
I think there is upwards of 70, 000 people right now in there.
And, yes, you will always find somebody with answers to your questions as well.
That's all I have for you.
I appreciate your time, your time.
And yeah, I hope to see you on the next one.
Cha cha.