Bring your DataOps to the Data Lakehouse: What are the benefits and how to navigate the common challenges

Video size:

Abstract

DataOps principles go well in tandem with Data Lakehouse architectures, and both help in aligning stakeholders, managing change, and driving trust in data. Backing up our considerations with real-world examples, let’s voice some actionable points for overcoming the common implementation challenges.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

I'm Jorge, Platform Engineer and Consultant at Perelyn. Over the next 30 minutes, we will be talking about bringing your data up to the Data Lakehouse. If you don't have any of those two, no worries, this talk is actually for you. We will be discussing the benefits and challenges that may, lay ahead of your path. And if you are already on your way, or if you are very deep into this topic, I think this talk will also spark your interest. I encourage you to stick around, maybe also share some of your thoughts in the, in the comments or throughout, throughout the conference. And yeah, hopefully we can learn from your trials and tribulations as well. The thesis of this talk is twofold. It is a well known fact that companies face recurring challenges whenever they engage with data projects, whether it is because of inconsistencies in data quality or, delay on insights delivery or misalignment between stakeholders and developers. My opinion is that applying DataOps and embracing Data Lake as architecture is not just another fix that some engineers, cooked up, but it is rather a paradigm shift. That not only systematically addresses these issues that I mentioned, but also thrives inside delivery and trust in data. I don't know. Is that arguable, polemical? Let's find out together. My hope is that you walk away from the talk, understanding what DataOps is, why it matters, and how it approaches the problems that it says, that it addresses. But more important than the boring what, why, and how, what I really want to share with you is some of my, appreciation for the topic, some of my justified love for it, so that you hopefully walk with more perspective on the huge opportunity that this, topic represents for the organization, or your personal career. About my personal career, I've been working on data related projects for the last six years, different layers of the stack, everywhere from platform to real time data analytics. I have consulted literally a dozen organizations. and supported them on their IT journey. And over the last, 24 months, been in love with this topic, trying to find ways to introduce DataOps to my daily work. To keep the conversation very, pragmatic, very down to earth and easy to follow. I would like to start by telling you the story of Ficsit Incorporated, a fictional company that has issues with their own, own existing current data infrastructure and are trying to introduce DataOps to address them. We will also see the grand vision behind DataOps. but when we want to start understanding better what it is. We'll need to touch on some key concepts, although a little bit technical, we will still always keep, on, on the horizon. These three core fundamentals for the business, namely cost performance and understandability so that we force ourselves to measure how much better we are doing than whatever is already out there. We will then move on to the more spicy, I would like to say, part of the talk, where we will discuss typical technical and organizational challenges that you may face when you want to embrace, full, DataOps. We will then conclude with some summary, and I'll give you some pointers so that you know where to go next if you decide that you want to, join us in this adventure. Without further ado, let's start. So to understand why DataOps is so important, right? It makes sense to first try to understand what are the problems that it's trying to address. To illustrate this, I will tell you the story of Ficsit. This is a interplanetary megacorporation with a headquarters on Earth. They recently hit their 500, 000 employees mark. They are deep into research and development, aerospace engineering and space mining. They poised themselves as data driven and cutting edge, and therefore they recently hired their first ever chief data officer, who launched a digital transformation program to grow, to drive growth. Now, central to this initiative is the data engineering department, which has around 500 employees and big investments on cloud native compute, storage, and martian BI tooling. So the CDO, techsavvy as he is, starts by mapping out the technological landscape in the company. What he recently quickly finds out is that all of the departments have very well defined boundaries. They all use their own date, sorry, tools, and essentially, the organization has grown organically in that sense without any, top level alignment of any sense. They are all subject matter experts on their topic, yet they rely on data from each other to actually perform. Now the CDO wants to know how this Organic growth, which, by the way, is a situation in which many companies find themselves, also, without really realizing it, we want to see how this organizational growth has impacted the typical workflows inside the company. So let's take one example. Let's talk about customer support reviews. Let's assume that one of our customers, the Ficsit customers, he or she buys one of these deep space mining techniques, technologies or equipments. There is a technical issue. They place a ticket on the support desk. One of our colleagues helps them solve the issue, hopefully. And afterwards, we get a review of how well we did. Now, Ficsit wants to know, if any of this review information can be used to drive improvement on the customer support. The way that they decide to do this is using sentiment analysis on these reviews that we got from the customers. alongside with some other signals. the issue here is that data is at the moment on a relational data database system managed by the customer success department, but it needs to reach the data science team, which are the experts on sentiment analysis. The data science team doesn't have direct access to this data, nor the expertise or resources to move the data to where they need it, and therefore they rely on the data union department who builds a pipeline that copies this data and puts it into a data lake for the data sciences on a hourly basis, let's say. The data scientists, they get the data, they start performing the modeling, generate predictions, and then they push this to the fifth step, fifth and last step on the pipeline, a BI report, which itself will likely have some kind of caching or some storage, and then on top of which the visualizations are built. The manufacturing managers then can look at this information and take decisions accordingly. Now on the paper, this approach might seem very straightforward, right? Very intuitive, nothing complicated here. But if you look with a more clinical eye, you will start seeing some issues cropping out. For example, There is at least one hard data copy that we know of between steps two and four. This is where the data engineers, not knowing exactly what they're moving around, they just know that they need to get data from one place or the other. They don't complicate their lives, they simply copy it as it is, and put it somewhere else. We also see there is no centralized point of orchestration for the whole sequence, right? So if something fails on a step two, there is very little chance that the people sitting on step four will have visibility over that, let alone the ones who are consuming the dashboard, at the end of the stream. Finally, since we have multiple different tools interacting, some of them being RDS, some of them being data lakes. Some of them being via reports, we can presume already that there will be different formats. So data will be transformed into different formats or three different formats, at least throughout the whole process. And that is inefficient, of course. What happens then when all of these issues are croping up in one workflow, and we look at a scale when we start looking at all the possible use cases that the company has already built or is planning to build? The CDO, knowing that no amount of documentation will help him wrap his head around this situation, decides to do a full, review, a full, let's say, interview process of the company to try to extract knowledge from the, colleagues. So he sends out a survey and some rigorous questions. Let's see what comes out of that. The CEO tells us that "The IT architecture grew organically over the years. There are site data silos everywhere". We kind of suspected this already, so nothing new. The head of data engineering tells us "We maintain hundreds of pipelines, moving data between all departments, cyclically dependent even!" Okay, this is already new information and not a good news, actually. That means that if in an event there is an issue with some pipeline there is a chance that multiple pipelines will go down, all at the same time. the new recruits from Data Science tell us that there are many obscure, column names, transformations across data stores. Even after six months of onboarding, there is still, let's say, ongoing training planned ... No doubt, this is frustrating for the recruits, but also for the recruiters who have to invest six months of training, even before they start seeing any tasks delivered by these new colleagues. Head of IT tells us that, indeed, "There are too many systems and many permission schemes. The monthly bid is also unpredictable". So not only we have, not only have we little visibility on costs, but also security is, hard to enforce due to the complexity of permission is skimming. Finally, the managers, the, the poor dashboard users. Although this comment actually is far more, let's say generic than what the others mentioned, I found it much more insidious, right? There is some very, dark evil lurking behind of his statement. He says, can "I trust data in this dashboard? Last time I checked, it looked funny. It also takes forever to load". So this means there is not only a user experience problem here, but also trust is broken, right? So, my question, or it begs the question, when there is an incident, or this incident was happening, and it took three hours to solve it, would it also take only three hours to recover the confidence of this user? Most likely not. Perhaps three months. Perhaps even three years. So all the information is in. Our CDO has tabulated all the, the key findings into one table. Across the board, we see data silos everywhere, multiple copies and formats, no single access control scheme. Metadata management is in parts, or is a topic in part of the whole stack, but not in some others, which means we have a huge blind spot towards the side of data ingestion and storage. And orchestration and monitoring is not centralized. At least we should be happy there is some in place ... But this is far from ideal, right? And at this step, the CDO takes some time and ponders the question: What if we introduce DataOps, and mold it into a data lake architecture? How would that look like? What would it bring us? Well, in an ideal scenario, we could cross out all of this complexity, right? By first of all, introducing a single lake, a single data lake where all data resides. This will in turn enable us to guarantee that there will only be one hard copy of data, well, besides any backups for, disaster recovery, but there is no necessity to do hard copies of data in between stages of a workflow since we know that everything relies, everything resides in one single place, in one single lake. This also enables us to start thinking about centralized metadata management, a single, pane of glass for managing metadata. We can also think about, single data format, or enforcing a single data format at some point, hopefully an open source format that will allow us to further develop our own tooling in the end. Without really thinking about, vendor locking or licensing. we have additionally now the opportunity to talk about centralized orchestration and monitoring, and last but not least, centralized access management, whether we do this via traditional role-based access control or more fancy tag-based access control using metadata. But in practice, is that even possible? How would that look like? And more importantly, what would that, entail, for costs, performance, and understandability, which are the core fundamentals that we want to, let's say, upheld, from the business side of things. Well, let's take a look at that. We will need to start with a data lake. In simple words, we might have different business tools. Some of them might be third party, but all of them will write data to a single location. And this is actually a natural response to these tendencies that we see on system integration, right? Because of the, sheer amount of unstructured and semi-structured data when compared to the structured data that we actually would like to have. But that's actually not a bad thing, given that the modern data platform cost models favors storage, meaning storage is cheap, when compared to pre-cloud era. It also opens up the opportunity for us to enforce, as I mentioned, this open source data format, ideally something like Iceberg or Deltalake, which are not only widely supported, but extremely high performance because of their binary nature, and that also support now A.C.I.D operations or acid operations, let's call them. This simply means that you can read and write to these files, without having to worry that you will compete with write and read operations from other users. Every operation is atomic and therefore there is no chance to collide. This is a very well known principle of relational database systems, actually. We can further improve our situation by bringing into the scope staging schemas. And this is actually when the strength of the system starts to ramp up. Think about it like this. We start with very raw data that needs to be refined until we have a final product which is ready for the business to consume. Now we know what these typical refining operations are. All that we need to do is to put them in very well defined stages and execute them. The good thing of having these stages well defined is that if something goes wrong, we know exactly where to look for the problem. We also have these smaller steps, essentially a lot of steps, very small steps. And this incremental process allows us to simplify troubleshooting, extremely, yeah. To give you an example, suppose that we have a staging stage where you or a stage with staging models when we where we do only language translation for the column names and the table names. Additionally, we do some enforcing of data type formats. So if a column is a string, but contains a date, then we save that as an actual date object on database and save some performance and costs. Then we introduce another stage, intermediary stage, where we enforce, the actual open source formatting that we wanted. Maybe we add some partition and some compression for the files. Since now our files are stored in a much more performant format, we can also think about, introducing data quality tests and freshness tests. Moving forward, we go into the curated stage. or zone, however youl'd like to call it. Here where we start building business objects or entities that have some meaning for whatever comes afterwards. And all of this can be done while orchestrating the interdependencies between the models. That means if some business object requires two of the raw models, we can ensure that the raw models are updated before the business object. All in all, this is very easy to build, troubleshoot, and version. And this is the key word here. We are not really dividing data into development, testing, and production, like it's usually done. Rather for us, everything is production, but the same way we want to test a new feature on application and roll out a new version of it. We do the same with the tables. I want to create something new, a new column for whatever business reason. Then I can just create a new model next to my previous one with a new version, let's call it V2, and then I am ready to test, right? I of course, we need to communicate to my, consumers down the line. At some point, maybe I want to deprecate my previous version. So we need to let them know, Hey, please migrate to the new one. At this point, if necessary, right? A point that I forgot to mention here, this orchestration between the model dependencies, there are tools already available for this. There are a few, I want to recommend dbt simply because it's open source and he has a huge community. We will talk a little bit more later about them. The next step into our, stack, is actually bringing the house to the lake, in the lake house. By this I mean introducing a warehouse. Why we want to do this? Well, think about it like this. If the staging schemas are the perfect place for data exploration, the warehouse is the perfect place for data exploitation. This is because warehouse, technologies or solutions have much more computational power, and they can help you meet this time critical use case. Like if you need to get a report on, on, very frequent interval, and you need these queries to really complete on due time. Here we can also think about introducing more fancy materialization strategies. So, like, incremental models, that means upserts. So, not to completely overwrite tables or, yeah, just introduce on them, but rather, update, rows that are already existing on it. As well we can, introduce more fancy tooling that comes out of the box with this, technologies, for example, machine learning based queries or some statistical transformations that might be required for whatever business reason. And here I need to make a point. There are multiple data warehousing solutions, but not all of them are able to read and write directly into the one-lake. So, If we want to upheld the principles that we are trying to build here, we need to stay, or we want to choose/select carefully this tool so that we don't build a new silo on top of the lake. The cherry on the top for us will be wide tables. And I'll tell you why. First of all, there are benchmarks that show that in modern warehouses, querying wide tables is 50 to 25 percent much more performant than running joins on tables. They are also extremely intuitive for the analysts and the business users, so that they can actually really get very fast to the results, right? They don't need too much onboarding to start, utilizing these tables. Hence, reducing all this cost of, high labor, right? The BI tools also sometimes tend to ask you for wide tables as their input. So, you can even think about them as more or less as a requirement depending on which BI tooling you're using. The caveat here is that you might of course end up with duplicated columns in some wide tables. For example, you might have some table with conversions on some tables. Time range and some other table where there are also conversions, but in maybe less granular monthly basis. And of course, you need to make sure that, the users understand that conversions can look different in different tables and that there is metadata that they can choose or they can use. I'm talking about, catalog, glossary, and lineage. This should all be information that should be available to the users so that they can trace back all the changes from the beginning of the data lake all the way to their consumption layer. That way you can preemptively answer any questions that may crop up, any doubts that may crop up in the daily work of the analysts. And that brings me to the challenges, right? So, we discussed already all the benefits of this, from terms of cost performance and understandability, but it's still there is a chance you will not be able to make this pitch in your organization and get a thumbs up right away. Some typical technical challenges are, for example, existing running, long running licenses with enterprise tooling. Maybe you get past that and start implementing. DataOps, but you introduce way too much trivial testing. This is to be fair, also something that you will see in other areas of software development. But it gives you this false sense of security that, might backfire. We also highlighted that there is no dev test and production stages. We just have everything as production with versions on the products or on the tables or on the models. Here what is critical is to have good communication, right? So if you, create a new version of your model and don't properly communicate, you might end up pulling the rug underneath somebody, and this is hard to estimate the blast radius that it might have. Finally, one that I see often is Although I think self explanatory at this point, you should always try to build a mock up, right, of your use cases. Even if it's an Excel sheet with dummy data. But as a pre step to align yourself with the stakeholders, which is one of the core principles of data house practices. But sometimes it's rush, it's, yeah, it's brushed off all under the table just to, you know, get faster to the results. Now, all of this, I think, can be addressed with a cold head. However, the organizational side of things may look different because, yeah, people not always, operate with a cold head. Some comments that you might come across are: "The conversion reports you build are not matching with one of the ones from the other team". So, this is this discrepancy that I mentioned. You can end up with two different wide tables, with the same metric, But with different versions of this information. Somebody can also mention: "Cool, but I'm not a technical person. Why don't you talk to the engineers?" Huge misconception here of what DataOps is. As I mentioned, this is a principle, or one of these principles is that you want to align the stakeholders with the engineers. And therefore, this is a joint effort. From the side of the engineers, you might also hear things like: "Hey, we have to work with tool X for so many years. Now you want me to change to tool Y". But at the end of the day, the, the only thing that is constant is in life is change. I think everyone can agree with that and DataOps is built for that. It's embracing change, and staying nimble, right? To stay, stay on your toes, always ready for whatever comes out of the curve. Because piece of requirements will change, KPIs will change, data will change, sources will change, tools, people, and best practices themselves might change. but the, the whole idea is that you, come up with a translation for DataOps that fits your organization, you put it to the test, you review, and you make it better and better and better, iterate over. Small steps, but very quick steps. Putting it all together, Data Lakehouse and Open Source Storage Formats are the natural response to the trends that we see in system integration. And they are, beneficial for us from a cost perspective because they take advantage of the building models of modern data platforms. they essentially trade more of the computational costs for the storage costs, which in turns are lower. we talked about why tables are, they're intuitive and performance. even their, their, pervasiveness due to requirements from the BI tooling. we also mentioned they have a downside because they are denormalized, but the benefits outperform this issues, let's say. Metadata, I cannot stress it enough is the key to removing silos, is really the key to boosting understandability. And given the fact that the cost of a skilled labor are still high and will remain high, the understandability is a huge factor on your cost control. So the more intuitive, the more easy it is for people to find information on a self service manner, the better you are off for the future. Finally, DataOps, as I mentioned, is a paradigm shift. It's not just something that the engineers cooked up. Is the way that you treat data, metadata, and how you adapt to change itself. If you are interested to learn a little bit more about this topic, I can recommend a few things. First of all, take a look at the DataOps Manifesto, 18 principles, very clear, direct to the business question. You can also read the best practices from the dev, devs from dbt. As I mentioned, dbt is an open source tool is very opinionated, but the documentation is very well written and the engineers will tell you every time they take a decision why they do it. So you can internalize, right the fundamental behind it, instead of just falling blindly into whatever the tool offers you. I can also recommend you check out the book from Dave Fowler et.al. And this is called "Cloud Data Management — Four Stages for Informed Companies". It's an open, free book. You can find it online. It's a, it's a good read. And, yeah, also maybe join the community on Slack. dbt community is, is very active. very humorous at times. I think there is upwards of seventy thousand people right now in there. And, yes, you will always find somebody with answers to your questions as well. That's all I have for you. I appreciate your time, And yeah, I hope to see you on the next one. Ciao ciao.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Bring your DataOps to the Data Lakehouse: What are the benefits and how to navigate the common challenges

Video size:

Abstract

Summary

Transcript

Slides

Jorge Loaiciga-Rodriguez

Senior Platform Engineer @ Perelyn

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Bring your DataOps to the Data Lakehouse: What are the benefits and how to navigate the common challenges

Video size:

Abstract

Summary

Transcript

Slides

Jorge Loaiciga-Rodriguez

Senior Platform Engineer @ Perelyn

Join the community!