Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you for joining my talk. For tips
that help you leverage your airflow pipelines.
Today I'm going to tell you for real
world cases is how I curatively solved for nonstandard data
engineering tasks with airflow.
So yeah, let's go. I'm not going to stop
you about who I am because I believe it should be told
you in the intro. So let's go
straight to the tips and because of we have only 30 minutes before
the first tip, I know that
most of you should face like with
the very little experience in the data engineering should face
the situation where you have a lot of similar
pipelines which differs to each
other just with small information
like I dont know, ids, start date,
which API is called and so on and so forth.
Basically so you don't need to create
separate pipelines for all of them.
We can auto generate the pipelines for
them using some configuration.
In that particular case we can create a typical decks based on
the CSV file. So my client has a
lot of different decentralized organization which
we need to grab the data from. The API was
the same, we have just a different id for the data
we should download and the
information about this was started from
the different start date. So I
just add all the information about all of
this decentralized application in the same file, in the CC
file and created deck factory that created
Ingodeck for all of these pipelines.
What of pros of this approach? It's really easy to
add a new entity, basically add a new
entity, adding a new deck. It's just one line in the config,
adding all the start date, all the information about ids
and so on and so forth and scheduler will pick up it for you.
Less code obviously we don't need to
create any additional code.
The DAC factory just create a DAC for you so no
need to new files, single file, add in
configuration it will work flexible.
On my view it's flexible enough so
we can create a start date for each particular
pipelines and make sure that we
do not create any additional garbage documents
that will not add any value and will
not download any information from it.
Isolation we can control the store date.
So as I said,
but why I added this
point just because we had an
opportunity to create one single
deck with all of these entities
and upload the information in the single deck.
Obviously for some of the pipelines, like for optimism,
it will be a lot of API calls with empty data.
We don't really want it. And as soon as
we add something in the past, we have to retrigger
all these decks in the past to download all
the information, which is not really great.
And so having a single deck
allow you to have more control over a single
pipeline. Also covered with data validation tests,
which I didn't mention. We have a separate
pipeline for the data validation and it was not
included anywhere in the config.
But adding line, adding one
decentralized organization in this config
was automatically picked up by the data validation
pipelines and also it was included in
the data validation test. So if we add
any new decentralized organization,
the data downloaded for it was already validated with
our data validation test, which is really great without any additional
efforts. Cons it may
require some additional understanding of airflow. There are a
lot of options to break something in
this airflow or do not have patience to give
scheduler to render all of these pipelines.
So yeah,
you may expect some complications with this
approach. Kind of similar pipelines sometimes
you may think that your pipelines are similar,
but it's not really similar.
They have a lot of different conditions and
so on and so forth. So you may end up in the
tier deck factory is a spaghetti bowl with a
lot of if conditions and so on and so forth.
So probably this may
create more confusion than be
easy for you. So sometimes it's
really better to create separate pipelines for this
kind of similar pipelines.
Overall, good approach. I will
recommend everyone to at least try that because having
experience and having opportunity, having this
tool under your belt is really really useful.
So let's go further. Tip number two, you can create a typical
deck based on the shareable config.
Basically is an upgrade of the previous case.
But in my case my client wanted to add
configuration add pipelines on
his own. I obviously gave him
this opportunity by creating a sharebo config as a Google sheet.
And basically what I did
in the airflow code, I grab this data
via Google Sheet API and
create pipelines based on this data.
Mostly this is it.
But later I made an upgrade to this pipeline.
I've downloaded data from this
API and validated
it and then loaded to
a local file and then ask scheduler to create
pipelines based on the information in the local file.
Why I did that, I'll tell you a bit later.
So let's go to the cons, to the
pros, sorry, pros.
Obviously other people can create airflow DAC for you.
You don't need to maintain all the configuration.
Basically the people who understand how to do this can do it for you
and you can focus on different things and
do not switch focus on maintaining
these pipelines, which is a really good thing if you ask
me. Changes available without deployment
in the previous pipeline,
Uni was required to
commit changes to the configuration file,
somehow deploy that to the airflow server and
only after it was rendered and created in
your Airflow server. Here, as soon as you added one
line to the Google Sheet config it is picked up by
scheduler and rendered in your airflow center is minus
one step. Great. Yeah,
there are only two cons. It's probably a lot of
validation work for this option.
First of all, it's performance.
Before doing it, you have to make sure that API you
are grabbing data from have a small
delay because two or 3 seconds may be crucial for a
scheduler and end up crushing
your scheduler and stopping running
or airflow pipelines, which is not great.
You need to handle user experience. For those of you who didn't
have an opportunity to
work with the user, experience may be surprised how user can break
everything. So obviously you need to documentation
to explain your user how to add your pipelines and
it better be extensive and understandable.
Input data validation the
step I previously told you,
grabbing data from the API and only then
validating it and then loading
it to the local file and
build your pipeline based on this file.
It is really for validating the data
because even though you added the documentation
and invested a lot of your time in it,
user may skip it, may do not understand it
right, and so on and so forth. So you'd better to validate the data and
do not pass the data
that garbage to the scheduler
because it obviously may broke your scheduler,
which is not a good thing. May break your scheduler and
even though you have documentation, you have your validation,
test something. You're also a human and
you can skip something and
you find your user who will break your cluster.
So you'd better to have an error notification to
react as soon as it possible.
Overall, good option may save
you a lot of time, but it's
better to have a lot of time
to invest a lot of safety checkers
to make sure that everything works fine. Let's go
to the tip. Number three,
how to generate decks based on your airflow entities. Basically,
it's the combination of both of these previous cases.
Why it can be useful in my experience, I have a case,
had a case where
my client had a lot of different
MySQL databases. They were created,
dropped and so on and so forth and
I always had to create new pipelines,
pipelines based on the different connection
to the MySQL of this database. Just grab this data,
drop the pipeline and so on and so forth. So a
lot of maintenance. But as
you may understand adding information to the
configuration file or credentials to configuration file is
not a good thing. So I decided to make
it like that. I created airflow
connection to the MySQL database with the certain
prefix, certain prefix and
in the airflow you can use
SQL Alhami to query the metadata DB.
Then list all the connection
you have, filter only one with the certain prefix and
create airflow pipelines based
on all the information that you found.
It is as it is really good thing you do not to
maintain any configuration. All the configuration is
already in the MySQL in your
metadata DB. And if the connection is fine
so everything should work fine. Pros no
need to handle external API calls. So yeah no
need to handle any configuration external APIs
everything in your metadata DB. So if one
way or another scheduler already call your metadata
DB and everything
is in it, what are the
cons? Configuration can be lost. I dont know if you
faced a situation where your database
was metadata airflow database was lost,
not really great. Or for example you
have to migrate your airflow from one
server to another. So also the situation
where metadata DB can be lost.
But once you lost it you also lost all of your
connections and you lost all of your configuration. So it's
better to have a backup of your configuration.
It's slightly higher load on the metadata DB but one way
or another scheduler make
a lot of calls during its regular run. So one additional
call will not be a problem.
Yeah this is the
one. So let's go next.
Okay, tip number four, make your airflow
pipelines even driving.
Some of you may like what airflow is for
scheduled by jobs, how we can make it even
driving. But there are options. So sometimes
you may require, I don't know, there is a low
volume of events, say few like
two or three an hour. But you may need result
of this pipelines as soon as it possible. Once the
event was put into the event
occur.
So yeah I had pretty much the same
case that
how I did it. So our application was
put the event to the Google cloud
pop up. Basically just event queue and
once event was in it it
triggered the cloud function. At cloud function triggered
airflow DAC via airflow
API. We can make it directly
but we don't have control over
the application. We just haven't
had an opportunity to put the information in the event queue.
Basically if you create airflow pipeline
without schedule, you can trigger it manually.
And if you have an API, you can trigger
it via API, sort of a manual API
call. What are the
process of this approach result of the execution?
Do not rely on schedule. Basically it's
a definition of event dragon. Really really
good things. Another way to do that
would be to create a pipeline that runs
once a minute, that check number of events
in that queue. And if there are more than a zero,
just grab all of them and run the pipelines for
all of these pipelines. But in this
particular case you end up having tons of
1 minute dag runs which
will overload your metadata
DB with tons of garbage dag runs and
end up slow down metadata DB on all of your
airflow experience. What are the
cons? It's a limited limitation obviously
of that approach is
that if your pipeline
have, I don't know, thousands of events an hour,
it will not be applicable to the
airflow and it will
not work. So probably the best that
airflow can give you is one
event in five minutes,
which is pretty good in mind. Like sometimes
is enough. And if it's enough,
why can't we do that? So basically these
are the four tips I wanted to
tell you about it. Let's sum
all of them. Create your pipelines based on the CSV
file, then upgrade it to
shareable config. If you need some
of your colleagues or non
development or not, software engineering guy
wanted to add pipelines to it.
Create your entities. Create airflow entities.
If the
thing that you would need to create information based
on are secure and you already
stored it in the airflow and
in case you need your results of your pipelines
as soon as some event was triggered,
you can make your airflow
pipelines event driven. Yeah.
So this is it. If you have any questions,
add me to the social network. Feel free to ask
any question. I would be glad to hear
it. So yeah, this is
it. Thank you for your time. Hope you all
enjoyed the conference and see you there.