Conf42 DevOps 2024 - Online

4 Tricks That Help You Leverage Your Airflow Pipelines

Video size:

Abstract

Unlock the full potential of Apache Airflow with 4 game-changing tricks. From CSV-based DAGs to event-driven workflows, Aliaksandr reveals the keys to supercharge your data engineering experience. Don’t miss out!

Summary

  • For tips that help you leverage your airflow pipelines. Today I'm going to tell you for real world cases is how I curatively solved for nonstandard data engineering tasks with airflow.
  • You don't need to create separate pipelines for all of them. We can auto generate the pipelines for them using some configuration. Having a single deck allow you to have more control over a single pipeline. Cons it may require some additional understanding of airflow.
  • In my case my client wanted to add configuration add pipelines on his own. I grab data via Google Sheet API and create pipelines based on this data. There are only two cons. It's probably a lot of validation work for this option. Overall, good option may save you a lots of time.
  • How to generate decks based on your airflow entities. All the configuration is already in the MySQL in your metadata DB. Pros no need to handle external API calls. Cons Configuration can be lost.
  • Do not rely on schedule. Create airflow entities. In case you need your results of your pipelines as soon as some event was triggered, you can make your airflow pipelines event driven. If you have any questions, add me to the social network.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you for joining my talk. For tips that help you leverage your airflow pipelines. Today I'm going to tell you for real world cases is how I curatively solved for nonstandard data engineering tasks with airflow. So yeah, let's go. I'm not going to stop you about who I am because I believe it should be told you in the intro. So let's go straight to the tips and because of we have only 30 minutes before the first tip, I know that most of you should face like with the very little experience in the data engineering should face the situation where you have a lot of similar pipelines which differs to each other just with small information like I dont know, ids, start date, which API is called and so on and so forth. Basically so you don't need to create separate pipelines for all of them. We can auto generate the pipelines for them using some configuration. In that particular case we can create a typical decks based on the CSV file. So my client has a lot of different decentralized organization which we need to grab the data from. The API was the same, we have just a different id for the data we should download and the information about this was started from the different start date. So I just add all the information about all of this decentralized application in the same file, in the CC file and created deck factory that created Ingodeck for all of these pipelines. What of pros of this approach? It's really easy to add a new entity, basically add a new entity, adding a new deck. It's just one line in the config, adding all the start date, all the information about ids and so on and so forth and scheduler will pick up it for you. Less code obviously we don't need to create any additional code. The DAC factory just create a DAC for you so no need to new files, single file, add in configuration it will work flexible. On my view it's flexible enough so we can create a start date for each particular pipelines and make sure that we do not create any additional garbage documents that will not add any value and will not download any information from it. Isolation we can control the store date. So as I said, but why I added this point just because we had an opportunity to create one single deck with all of these entities and upload the information in the single deck. Obviously for some of the pipelines, like for optimism, it will be a lot of API calls with empty data. We don't really want it. And as soon as we add something in the past, we have to retrigger all these decks in the past to download all the information, which is not really great. And so having a single deck allow you to have more control over a single pipeline. Also covered with data validation tests, which I didn't mention. We have a separate pipeline for the data validation and it was not included anywhere in the config. But adding line, adding one decentralized organization in this config was automatically picked up by the data validation pipelines and also it was included in the data validation test. So if we add any new decentralized organization, the data downloaded for it was already validated with our data validation test, which is really great without any additional efforts. Cons it may require some additional understanding of airflow. There are a lot of options to break something in this airflow or do not have patience to give scheduler to render all of these pipelines. So yeah, you may expect some complications with this approach. Kind of similar pipelines sometimes you may think that your pipelines are similar, but it's not really similar. They have a lot of different conditions and so on and so forth. So you may end up in the tier deck factory is a spaghetti bowl with a lot of if conditions and so on and so forth. So probably this may create more confusion than be easy for you. So sometimes it's really better to create separate pipelines for this kind of similar pipelines. Overall, good approach. I will recommend everyone to at least try that because having experience and having opportunity, having this tool under your belt is really really useful. So let's go further. Tip number two, you can create a typical deck based on the shareable config. Basically is an upgrade of the previous case. But in my case my client wanted to add configuration add pipelines on his own. I obviously gave him this opportunity by creating a sharebo config as a Google sheet. And basically what I did in the airflow code, I grab this data via Google Sheet API and create pipelines based on this data. Mostly this is it. But later I made an upgrade to this pipeline. I've downloaded data from this API and validated it and then loaded to a local file and then ask scheduler to create pipelines based on the information in the local file. Why I did that, I'll tell you a bit later. So let's go to the cons, to the pros, sorry, pros. Obviously other people can create airflow DAC for you. You don't need to maintain all the configuration. Basically the people who understand how to do this can do it for you and you can focus on different things and do not switch focus on maintaining these pipelines, which is a really good thing if you ask me. Changes available without deployment in the previous pipeline, Uni was required to commit changes to the configuration file, somehow deploy that to the airflow server and only after it was rendered and created in your Airflow server. Here, as soon as you added one line to the Google Sheet config it is picked up by scheduler and rendered in your airflow center is minus one step. Great. Yeah, there are only two cons. It's probably a lot of validation work for this option. First of all, it's performance. Before doing it, you have to make sure that API you are grabbing data from have a small delay because two or 3 seconds may be crucial for a scheduler and end up crushing your scheduler and stopping running or airflow pipelines, which is not great. You need to handle user experience. For those of you who didn't have an opportunity to work with the user, experience may be surprised how user can break everything. So obviously you need to documentation to explain your user how to add your pipelines and it better be extensive and understandable. Input data validation the step I previously told you, grabbing data from the API and only then validating it and then loading it to the local file and build your pipeline based on this file. It is really for validating the data because even though you added the documentation and invested a lot of your time in it, user may skip it, may do not understand it right, and so on and so forth. So you'd better to validate the data and do not pass the data that garbage to the scheduler because it obviously may broke your scheduler, which is not a good thing. May break your scheduler and even though you have documentation, you have your validation, test something. You're also a human and you can skip something and you find your user who will break your cluster. So you'd better to have an error notification to react as soon as it possible. Overall, good option may save you a lot of time, but it's better to have a lot of time to invest a lot of safety checkers to make sure that everything works fine. Let's go to the tip. Number three, how to generate decks based on your airflow entities. Basically, it's the combination of both of these previous cases. Why it can be useful in my experience, I have a case, had a case where my client had a lot of different MySQL databases. They were created, dropped and so on and so forth and I always had to create new pipelines, pipelines based on the different connection to the MySQL of this database. Just grab this data, drop the pipeline and so on and so forth. So a lot of maintenance. But as you may understand adding information to the configuration file or credentials to configuration file is not a good thing. So I decided to make it like that. I created airflow connection to the MySQL database with the certain prefix, certain prefix and in the airflow you can use SQL Alhami to query the metadata DB. Then list all the connection you have, filter only one with the certain prefix and create airflow pipelines based on all the information that you found. It is as it is really good thing you do not to maintain any configuration. All the configuration is already in the MySQL in your metadata DB. And if the connection is fine so everything should work fine. Pros no need to handle external API calls. So yeah no need to handle any configuration external APIs everything in your metadata DB. So if one way or another scheduler already call your metadata DB and everything is in it, what are the cons? Configuration can be lost. I dont know if you faced a situation where your database was metadata airflow database was lost, not really great. Or for example you have to migrate your airflow from one server to another. So also the situation where metadata DB can be lost. But once you lost it you also lost all of your connections and you lost all of your configuration. So it's better to have a backup of your configuration. It's slightly higher load on the metadata DB but one way or another scheduler make a lot of calls during its regular run. So one additional call will not be a problem. Yeah this is the one. So let's go next. Okay, tip number four, make your airflow pipelines even driving. Some of you may like what airflow is for scheduled by jobs, how we can make it even driving. But there are options. So sometimes you may require, I don't know, there is a low volume of events, say few like two or three an hour. But you may need result of this pipelines as soon as it possible. Once the event was put into the event occur. So yeah I had pretty much the same case that how I did it. So our application was put the event to the Google cloud pop up. Basically just event queue and once event was in it it triggered the cloud function. At cloud function triggered airflow DAC via airflow API. We can make it directly but we don't have control over the application. We just haven't had an opportunity to put the information in the event queue. Basically if you create airflow pipeline without schedule, you can trigger it manually. And if you have an API, you can trigger it via API, sort of a manual API call. What are the process of this approach result of the execution? Do not rely on schedule. Basically it's a definition of event dragon. Really really good things. Another way to do that would be to create a pipeline that runs once a minute, that check number of events in that queue. And if there are more than a zero, just grab all of them and run the pipelines for all of these pipelines. But in this particular case you end up having tons of 1 minute dag runs which will overload your metadata DB with tons of garbage dag runs and end up slow down metadata DB on all of your airflow experience. What are the cons? It's a limited limitation obviously of that approach is that if your pipeline have, I don't know, thousands of events an hour, it will not be applicable to the airflow and it will not work. So probably the best that airflow can give you is one event in five minutes, which is pretty good in mind. Like sometimes is enough. And if it's enough, why can't we do that? So basically these are the four tips I wanted to tell you about it. Let's sum all of them. Create your pipelines based on the CSV file, then upgrade it to shareable config. If you need some of your colleagues or non development or not, software engineering guy wanted to add pipelines to it. Create your entities. Create airflow entities. If the thing that you would need to create information based on are secure and you already stored it in the airflow and in case you need your results of your pipelines as soon as some event was triggered, you can make your airflow pipelines event driven. Yeah. So this is it. If you have any questions, add me to the social network. Feel free to ask any question. I would be glad to hear it. So yeah, this is it. Thank you for your time. Hope you all enjoyed the conference and see you there.
...

Aliaksandr Sheliutsin

Freelance Data Engineer & Architect

Aliaksandr Sheliutsin's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)