Abstract
DevOps revolutionized Software engineering by adopting agile, lean practices and fostering collaboration. The same need exists in Data Engineering.
In this talk, Antoni will go over how to adopt the best DevOps practices in the space of data engineering. And the challenges in adopting them considering the different skill sets of the data engineers and the various needs.
- What is the API for Data?
- What types of SLOs and SLAs do data engineers need to track?
- How do we adapt and automate the DevOps cycle - plan, code, build, test, release, deploy, operate, and monitor data?
Those are challenging questions, and the data engineering space does not have a good answer yet.
Antoni will demonstrate how a new open-source project Versatile Data Kit, answers those questions and helps introduce DevOps practices in data engineering.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Let's talk data ops.
DevOps revolutionized software engineering by adopting
a lot of agile limb practices post recuperation.
And we know the same needs to happen in data engineering. And today
that's what I'd like to talk about. What we can learn from
DevOps and how we can adapt and adopt it for data and make
data ops. Now, data is in our
everyday lives, everything around us, music, movies,
healthcare, shopping, travel, school, university,
everything relies on it. And there is no inspiration in saying that
every company needs to be a data company. And those
that are not, they are not very successful,
not for long. And it's not a secret
that efficiency of using data is still pretty bad. Nowhere we are
nearly as efficient in creating data project as we are creating software
products. Then way too many examples of failed data
project Gartner multiple or
other similar statistic comes out where aas easily
70, 60, 80% of data in
AI projects fail to reach production.
That's pretty bad. Data Ops
promises to fix that. Data Ops promise to ensure
the most efficient creation of business value from data.
That's about data. Most probably the only thing that people who
study data ops can agree with. It's go.
But there are so many different variety of ways of how to achieve that,
what to do, what does it mean?
And there's still no some
converging opinion about what the solution should be. There are some common
themes and one
of them is that we can learn from the success to
cause in DevOps and try to adapt those for
data, because this is not the same. And we will show how we do that
today. If we're going to talk about data Ops,
we should mention DevOps. It does have very similar goal.
Its simplest, the goal is to ensure most efficient software development from an idea
to a reality to a software product.
Again, how it's done is varied, though there are
now much better established best practices.
Still, depending on who you ask, who have completely different
idea about data DevOps. Really, there are
a lot of best practices in DevOps community though, and we can award
and apply and adopt and most importantly adapt
them from data community. Before that,
let's look at the problem from the different perspectives, different stakeholders involved,
and we'll group them in two categories
for simplifications purposes.
On one hand, we will introduce the
first hero of our story, the infrastructure operations team.
What I'm talking about, I'm talking people who,
when we talk about infrastructure sets, are the people who understand how to provision,
maybe containers, virtual machines, how to set up firewalls
and network for provision, spark Kafka cluster. They understand
performance implication of that infrastructure. Maybe it's
better to use big files with Kafka and small
files with HFS for example.
And operations people are those that say the best operations and
default practices, how to build continuous integrations, continuous development,
how to ensure code is version traceable.
And their goal is ultimately to make sure everything
works as written. They need to optimize reliability
and availability. The other hero
of our story would be those data practitioners, the people
who really create the products end products. That could
be for data. Those could be data engineers, data scientists,
data analysts, analytics engineers, mo engineers. There are a lot of
titles. They do have the domain a business knowledge
and responsible for certain analysis requests from different stakeholders,
marketing executives, so that
either company can make correct quick decisions or can create compelling products.
They do tend to have more domain knowledge, understands how to build data,
project how to join data, set different tables together, how to
report numbers and create predictive models to make recommendation systems.
And their focus is on optimizing agility.
Nowadays businesses need to move so very high speed
and if the data does not catch up then business will be forced
not to use data and probably
fail. In some ways pair goes fairly
conflicting as tend to be between products
and operations teams because their periodics
are conflicting. And this is fairly similar to what we observed
in development operations before DevOps 2030
years ago. Here the data person wants
to optimize cuddly value of data for operations.
Why would like to optimize availability of that data and
how we solve that? There is no clear separation often clarity between the team's
responsibility. Where operations team had to debug data
engineering work and data engineer engineers would need
to provision infrastructure. Well,
let's see how we can adopting and adapt them. And one
of the particular lessons that we need to learn from DevOps is we need to
start treating data as a product and
not just aas a side effect from those source
to the end, be it report
or another product. And Westar
data kit is a framework aimed
to help both the data teams. Information teams sort of make
sure that everything knows what needs to be done and everything is responsible
for their own. It enable
easy contribution to create new data projects and separate
ownership. It does this by introducing
two high level concepts.
One is automating and starting the data journey in
the types of starting the DevOps or data upcycle by
talking about automating and abstracting the data journey which is primarily
responsive of restarted data kit SDK which is
a library for automation of data extraction transformations evolving
of data and a very versatile plugin framework
which allows users to extend this
according to their specific requirements so that implementer people who know best,
for example, that you cannot try to send big messages into Kafka,
they can create very easy plugins that would automatically challenges the data
before it even is being sent.
And the control service
which automatically abstracts the DevOps cycle would allow users to create,
deploy, manage in
automated way those data jobs in a runtime environment
and allow automatic versioning deployment.
And at the same time allow DevOps people
in the company who best know how to build KCD to
extend using their own knowledge
and using their own best practices, they want to apply for their own
organization. Well, let's look at an
example. We're talking about
automate abstracting the DevOps journey and
the data journey. Here we
can see how one infrared structure team
can, for example, intercept through plugins every
single SQL query being sent to a database before it
even finishes the
job, before even including when the job is being run locally during
debug in development and apply
some kind of optimization and layering.
In this concrete example, in the picture we are saying
there's a plugin who collects lineage information to
enable easier troubleshooting and inspecting
of jobs so that one can see where the data comes
from. But this is just an example really.
The sky is those limit the improvements team
can do any kind of plugins and they can be applied across all jobs
AAS. The teams can also create their own plugins.
And then let's look at the DevOps cycle.
Can we do something to automate how development process leads
the DevOps cycle part? What versatile data
kit? What the control service particularly does is it flatten
it? Let's flatten it. And it's
important to provide sales service environment to data engineers to create end to end
data pipelines. This sales service environment
automate large part of default cycle. So as far as data
engineers or data team is concerned, they maybe click just one click button
deploy or CLI command deploy and
building, testing, releasing, deploying can happen automatically.
At the same time, we need to enable some either decentralized
data team or we go with our Persona
some operations of infra team to ensure
that there is a consistency correctness of
those data jobs and all the
compliance quality company policies that are
in place. And since the people they have best knowledge how to implement
these kinds of policies, especially DevOps best practices correctly, or the DevOps precious
people, there's a way for them to
ensure this across all jobs. Quick example
again, the DevOps plugins are actually
more of an build the Docker images that
can be extended and one for
example can extend the build and test those by implementing extending
the default job builder image. And let's say they add some
central assistant test to ensure quality.
Or we want to make sure that
no job can execute arbitrary files so we remove all
execution privileges and that's very
easy. It's pretty simple. Docker image which
can be configured when installing the verstaile data kit control service.
That's our intro into verstaile
data kit. If you want to learn more and
talk about us about these problems and try solving
together with us, contact us at any of the
channels. Those easiest one is through GitHub VMware
style that you contact. Thank you.