Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Agita and today I will talk a bit about DevOps best
practices and how they influence data ops mesh or
data ops and data mesh concepts. I will provide
an overview of each of these paradigms and explore how realistic
it is to combine them. So it's very beginner friendly.
And if you are data or
DevOps engineer and you're curious about data best practices
or data ops, this talk is for you as well as
if you are a data engineer and you are curious about DevOps best
practices, this also can be useful for you
a little bit. About me at the moment I work at VMware.
I'm part of versatile data kit team.
It's an open source project that is automating actually data ops
process. So that's why I'm curious to talk about it.
I'm going to touch a little bit on promo
versa dedicate because I'm on a mission to create a
community around these project at the moment and I believe it's relevant
to the topic and solves some problems that I will address.
My personal connection and interest to the topic is that
at the moment I'm a community manager but I have actually DevOps background.
I worked as a DevOps engineer for around five years before
transitioning to community management with three years
burnout period when I didn't work. So I have some context and
personal interest in talking about these concepts.
And as a DevOps or ex DevOps engineer,
I'm actually very excited to learn and use my experience to envision how
data engineering teams can use their full potential following
DevOps best practices.
Learning about data ops and data mesh as part of this project that
I'm working on now made me excited about this combination
and actually I'm a bit of efficiency freak
and I believe this can be very efficient
combination. At the same time, I believe
that there are some challenges that might arise and
I'm going to also address these challenges
and yeah, just final thing is that I want
to say also kind of thanks to open source in
general existence of open source projects because
kind of after a little bit I mentioned the burnout.
After the burnout I realized that I want to make this world a little bit
a better place. And for me, working or coming back to
data ops or DevOps world was only
possible if I work on some open source project.
Also you will see some graphics that I made. Actually in
these slides you will see some graphics that I made and
I just discovered that I have this little pen and I
have a touch screen. My screen is working with
this pen. And so I created several graphics here.
They are not professional at all. This is my first time drawing but
I just did it for my own pleasure and fun. So let's
start. So my agenda for today is
I will start with an introduction to DevOps
best practices and then I will explain data Ops
and how they are kind of correlated. Then I will
go into Versailles data kit introduction
very briefly because I'm going to explain data Ops
lifecycle in the context of, or kind of with an example
of the tool or framework.
Then I will explain what is data mesh. And these I'm
going to combine everything that I just talked about into this data ops
mesh concept that well basically I
came up with by myself, but I imagine that, and actually
I have seen that some people also are talking or writing about
this combination, powerful combination in my
opinion. And then I asked or kind of the problems
that I believe that might be there for the people who are trying
to implementing both of these or combine
these two concepts and
also some solutions that I came up with under
this research of creating this talk. Then I
will talk a little bit about open source projects and just
invite you to offer a little contribution.
So for me, DevOps best practices basically
are mainly focused, let's say, or I want to highlight two
things. And first is automation and second is collaboration.
Both of these are experienced, I think,
on personally working in two projects.
And my experience in DevOps was first I joined
a smaller project of seven people and I
worked there for three months. And the second project was quite large.
I think there were several thousand people in that project when
I joined as a DevOps. And I stayed in that project
for four years and to explain what I was doing there,
kind of maybe first project is more emphasized,
I will kind of speak about it in context of automation and
the second one in the context of collaboration.
So when I joined the first project, basically what happened on
my first day I arrived as a Junior DevOps engineer.
They gave me just, I don't know, maybe seven to
ten pages of a four printed paper with
instructions that I have to follow every day. Basically my
task was to support the migration
of older sharing version to a newer spring
version of the code or these tool that developers
were working on.
In order to do this, to support this
migration I had to change or change several
lines in some files,
copy some folders, rename some things,
do testing on Linux, do testing on windows and
well yeah, kind of deployment tasks
that were done manually, completely manually before I joined.
And so what happened actually in my first week I managed
to once go through the list of these instructions
and by the end of the day, really I'm super happy.
I'm turning to my lead and I'm saying, hey, hooray,
I finished this. So I went
through the instructions, so it's done. And my lead
turns back to me and he says, you know, today actually we have
to do this or repeat this four times more.
And it was approximately, actually 06:00 p.m.
So it was the end of the working day and I decided
to prove myself and stay at work and actually follow
the instructions and do it four more times. I think it took me 4
hours to do it the second, third and fourth time.
And yeah, it got me frustrated enough
so to say, to promise myself that I will not
repeat this. And so the next day when I arrived to the
office, I just decided to start writing code.
I started writing code with the simpler things like
copying files, renaming some folders, and then
kept going and automation them more into
changing the lines in the code. And basically
the idea is that by the end of those three months when I was in
this project, I have automated absolutely everything. So I
wrote code to automate every step of this process that was
written across these pieces
of paper. Well, basically I
learned, and to me it was kind of quite
obvious that if something can be written on paper,
it can be definitely automated. And the team
was actually really surprised and I think for them it wasn't
that obvious, but they were actually super
happy to see how I
automated this whole process, that by the end of those three
months this deployment could be done completely automatically,
I think around 280 times in 2 hours or
something. So yeah, literally I
had nothing to do there. And even though the team wanted to keep me around,
there was just nothing, no tasks to give me.
So this is how I believe that DevOps is supporting automation.
And also it goes a little bit in the second point in my slides,
which is faster, better, cheaper, so no human error
is there like everything is happening way faster than any manual
work. And of course it's cheaper to have automation doing things
instead of people. I was using CI CD
tools and writing jobs on Jenkins at the time and
making pipelines that are automating these process.
And yeah, basically following the DevOps lifecycle and
orchestration was not part of my job actually
in this project. So I go into kind of
these next project. When I finished with this three
months in the first project, I got invited
to another project that was basically infrastructure
automation or orchestration with chef.
But what I want to actually and what struck me the most in that
project was that it was really focused on
collaboration. So at the time I was living in
Latvia and the project was
on site, so it was in Germany. So every
Monday, every week I was traveling to Germany, and every Thursday I was
traveling back. And at first it didn't make
any sense, but with time and actually with experience, I understood that
it makes a lot and a lot of sense to be in the same room
with other people who are working on the same thing. As I
mentioned that the project was huge. There were many developers and apps
and things getting built, but all
the people who were working on it actually were in the same house,
like in the same office, or most of them,
but even more,
even further. The people who were automation things,
not just DevOps people, but everyone who was doing any type
of automation actually were in the same room. So I got
to meet these people that I'm working with, and actually it was great
experience to learn the methodology,
the mindset and this collaboration
to actually to grow in this collaborative space
where whenever something breaks, we all get together and
we are solving the problem instantly on the spot,
as soon as possible. And it just increases
efficiency to the maximum. When I know that
if I have a problem, I can go to these particular person or I can
ask someone who knows someone else who might conduct me directly
with the person and then we talk face to face.
And of course it was pre Covid
or before COVID times, and now the world is a
little bit different. But I still believe that it is possible to have
the same level of automation of collaboration between
people if these teams are really connected.
And this is going to be really relevant when I talk about data mesh,
because splitting teams or putting all the relevant
people in the same room or the same team even virtually, will make
the big difference.
So, data ops best
practices basically, in a nutshell, data ops is
DevOps, or data engineering equivalent of DevOps
best practices. The idea is to speed up the data deployment
while improving the quality.
Data ops lifecycle is similar to the DevOps lifecycle.
I'm going to dig deeper into it in the
next slide here. I wish to mention that data
Ops lifecycle actually hasn't been really agreed upon
in the community. There are several options and several ways to
see it, and I'm going to present my personal,
let's say, subjective view on it after reading or
researching on the topic.
The difference between traditional data engineering and data Ops
is that data Ops engineers work with data in an automated
way, building their workflows or data pipelines and
jobs that run automatically. So basically
these data pipelines and jobs are kind of similar
language or taken also from DevOps
World. So if previously one data team might be dependent
on another data team or on
infrastructure team because of the sequence of
data journey or some infrastructure permissions
or accesses, now it's kind of solved.
So data Ops is solving this, let's say decreasing
dependencies, but I would not say that it's completely eliminating
dependencies. I believe also that the people who
are building data pipelines should be enabled to set up the infrastructure
and orchestrate the entire data journey. So in
perfect case scenario, data pipelines and infrastructure are
built and maintained by the same people,
which is not always the case. But still I kind of want to
focus or have this in mind when I'm talking about data Ops.
Still data teams are sometimes
separated, but let's say
I believe that it shouldn't be the case. And as I talk
about this about data Ops, I take in consideration that
they are doing the same thing together. So orchestration
and building the pipelines.
So as I'm going to jump into data
Ops lifecycle and kind of give a practical example of
it with versatile Datakit tool, I just
wanted to highlight or explain a little bit what it does.
So versatile Datakit is an open source project, as I mentioned,
and it is found on GitHub.
The code is there. And basically what
it is is a framework that is created to
build, run and manage data pipelines with basic Python or
SQL knowledge on any cloud.
So the emphasis kind of that is maybe
relevant, and I believe relevant for this talk is that
we use the word basic and it's going to solve a
little bit later. One of the problems that I'm going to address.
So basically I'm going to explain also what is a data pipeline.
A data pipeline is a series of data processing steps
scheduled and executed in a sequence, the same as in case
of DevOps. But this
pipeline exists in order to ingest,
load and transform the data from its
source to where data analysts can work
with it.
In some cases, the steps are also in sequence. In some cases,
independent steps might run in parallel as well.
Yeah, and now I will go to data
Ops lifecycle to kind of explain also how it comes all together.
Yeah, so basically this
is the DevOps lifecycle as I see it.
Plan, code, orchestrate, test, deploy, execute,
monitor and feedback, and it goes and it cycles.
It is similar to DevOps
lifecycle in my opinion, but a little bit
different. So I'm going to go through each step and then
you can see the difference for yourself. So of course the
most important, I believe part of
the lifecycle is to plan. As we know, failing to
plan is planning to fail. So this is the crucial
step where the business value users and requirements
are gathered and the tools are selected and everything that needs to
be answered, needs to be answered here. So if these plan is solid
we can proceed to the second step. The second step
is code. Coding in data vault means writing the
code for a pipeline to ingest, transform the data and test
it locally. In our case,
virtual data kit automates this part by introducing software development kit.
The code can be written in Python SQL interchangeably as I mentioned,
and is used to create data jobs with steps that run
in specified sequence. Also database is
selected and configured and data jobs are executed locally
to test the code and make sure that data is ingested and
transformed properly. So simple commands
like VDK run is going to run the whole pipeline locally and give
me the output and I can check if these desirable
outcome is there.
There are many tools, other tools that automate this step.
Actually it may be using also other languages or even taking
coding part out of the equation and
providing can interface where data practitioners can do this by
clicking buttons. For some it might be very helpful and
for some it might be frustrating that actually we
cannot debug as easily things.
So the third step is orchestrate.
This is a crucial operations part of the cycle. This step
actually is independent from let's say code but
definitely needs to be implemented or brought
in before testing part.
Yeah it can be also these case that
these infrastructure team is setting up orchestration and data
team is doing the coding part. But as I
said previously, it's better when it's done
by the same team to
have a full ownership over the pipelines.
The infrastructure is built here for the data to go through the environments like
sharing where it gets tested and further promoted
to production environment or preprod and
used for analysis. Typically the code is
pushed to git and then data is ingested into staging before
it gets deployed to production. At this point,
scheduling and orchestration of the pipeline is configured in the configuration
files. In case of VDK, orchestration tools
can schedule jobs, manage workflows and coordinate dependencies
among tasks. VDK has implemented
a control service component that is taking care
of the infrastructure setup. So it is kind of also taking
part in automating with this SDK, the code stage
and the orchestration as well introducing this control
service that is creating the
infrastructure.
Then the next step is testing so the testing once
a data job runs on staging, the data
can be tested. Now there are alerts for any user
or system errors that can be tracked end to end. Testing ensures
that no existing functionality is impacted and
maybe some things that haven't been considered in the workflow, such as what
would happen if and so on,
and any type of bugs or discrepancies are fixed at this stage.
Deploy stages after validating the data and
resolving issues, it can be deployed to production.
Deployments can be fully automated. So let's say if the
data gets to staging and is tested automatically,
it can automatically also get deployed to production
or another way also can be introduced.
Okay then execution execute step is the
pipeline now is running automatically based on the configuration.
Automated data lifecycle processing is in place which schedules
ingestion, transformation, testing and monitoring.
Now reports can be generated. VDK control
service has functionality for both deployments and execution of the
jobs. Monitoring is in place
to track failing as quickly as possible.
Usually it's automation set up to alert the data jobs.
If the data jobs are failing, alerts will be sent out to
the user specified email and provide containing
data job name and type of error. It is also possible
with VDK in this case to detect if
it's a user side error or a system error, or like a
configuration error or
a platform error which is going to be also included
in this alert message.
Then monitoring is going to provide this information to help users
to troubleshoot and fix these pipelines as soon as they get
the alert. The final step is feedback.
And these, these additional requirements might arise.
Some data might be missing or something
might be necessary to improve.
So when the feedback is collected again, the planning
can be done and the cycle repeats.
So here I sketched a little bit with my little pen.
Also some components of EdK that I was mentioning here
in this previous slide for the visual representation. Just because
I understood for myself that I would need something like this in order to understand
it better. Some of the
components, but not all of these,
but basically, yeah. So this is, let's say data part,
and this is like Ops part that is
combined in the Versailleskit project.
So data jobs are running in Python SQL. There is a command line interface
where I can run jobs locally and test them.
The data jobs are following these steps and actually
it's prefixed by in alphanumerical order.
So let's say the name of the file is going
to also be the
sequence defining the sequence it is doing.
ETL or ELT. Actually automation which is extract,
transform and load by providing plugins and templates
and in general automating these parts.
So not really in depth knowledge is necessary to
do them. So basically it's for locally
running these jobs and then the other kind of ops component is
scheduling and execution of these jobs in kubernetes environment
using Git. So we upload the code
to git and then we deploy it and from there we can
set secrets if necessary and monitor
the pipeline.
And now I jump into data mesh.
And data mesh was actually quite a new concept,
I would say young, way younger
than DevOps and younger than data Ops as well.
But nevertheless it's extremely popular now.
I think people are really considering
this as a really good practices. So it was invented by Jamaic
Degani in 2019. And basically it's
a type of data platform architecture that
leverages domain oriented and self service design to
embrace the ubiquity of data enterprises.
In this case, the domain is a business area,
meaning each business area owns its data,
and data measures foster data ownership
among data owners who are held accountable
for delivering their data as products.
And so data as product actually kind of comes in.
Each domain is responsible for managing its pipelines,
and once the data has been served and transformed by
a given domain, the domain owners can then leverage the
data for their analytics or operational needs.
As the MAF argues, data architectures can be mostly
easily scaled by being broken down into smaller domain
oriented components. In short, data mesh
means that the data owners or domain owners,
people who are directly involved with particular data, are also
building and maintaining their data pipelines.
Users are becoming the owners, and so the data becomes
a product and is self serviced.
This alone also does not completely eliminate dependencies because
let's say if we implement data mesh,
and the orchestration and testing and CRCD are set up
by another and managed by another team, then the
domain owners will depend on the infrastructure engineers who will be supporting the
orchestration. So this
is the reason why I decided kind of to combine these two
paradigms or concepts. So here is the data mesh
on the left,
like adding governance, self service data as
a product, and domain ownership to
the data cycle and data ops part of it
that is providing these CI CD testing, observability and orchestration.
Basically kind of seeing this image, this is not my original
image. I found it in an article and
then I kind of made it into my own
graphics. But this is how I also inspired to
think about the combination of data ops and data mesh,
which I strongly believe enables true collaboration and powers
up the data engineering process by completely eliminating dependencies.
So if the data ops engineer owns the data pipelines and
is enabled to have ownership over the infrastructure and orchestration of
the data cycle and data mesh means that the
domain owners are owning the data as well as
the pipelines, and in this case infrastructure too.
I believe that the result will bring the speed to the data driven projects
to be as fast as possible with no dependencies,
which I believe is already happening in these DevOps role. Because sometimes,
let's say full stack developers or DevOps engineers can
set up the infrastructure, build the pipelines,
or be in the same room with the developers who are directly working with
building these product. So how
VDK supports data mesh? Well, data Mesh is
an organizational concept. It is implemented by managing
the teams and enabling them to work fully with their data.
Besides these, automation of the process, versatile data kit introduces the
functionality of creating teams. So basically, as a data
job is created, the prerequisite is to also
specify which team is going to be working on it.
Besides the team functionality, VDK also has templates and
plugins to support these teams. And in order
not to reinvent the wheel each time, teams can
share the work they've done with each other and collaboration
more efficiently. And these are some of these functionalities that
can support efficient data ops mesh implementation.
So what can go wrong? Well, I believe that two things can
go wrong or possibly prevent data ops
mesh from happening, and the first one is the skills.
So kind of while I was researching this, I was thinking that basically
I was thinking that the required skills
to execute data combined with data mesh
successfully are so skills
or knowledge. The person in these team, in each domain should have the knowledge
about the domain. Then they have to have
data engineering skills like Python or SQL or anything
else, depending on the tool that they're using for writing their pipelines.
And the third thing that they need to know is how to set up
their infrastructure and basically DevOps skills.
As I was creating these slides, my question was whether it's
easy or even possible to find a person who might fit
and have all these skills that are necessary for this,
or actually if it's possible to train them, or even if they
would be willing to learn. But as I presented
this talk to my team, this presentation,
and actually this topic in general, it became
obvious that the
skills issue actually is solved by VDK or
other tools that can automate the data engineering and operations
process. So as I said in the beginning, that basic
Python and SQL skills are required to
build, run and manage data pipelines with VDK and
as far as I know, and also from my personal, let's say,
DevOps experience, we learn every day as DevOps and
we do something, we use new tools, new languages every day.
So any person with understanding and kind of capability
of learning, I believe, could be able to
create a simple pipeline to start with, but also
to create more difficult pipelines also as well
with time, just by following documentation.
So actually not just creating a pipeline, but also setting up infrastructure,
following the documentation.
And this could be just one of the many solutions to implement data ops and
data mesh in real life. But this is one that I see
that is or creates this possibility.
And the second thing that I wanted to kind of highlight
or that came to my mind as possible
issue is that basically in
my opinion, it's important.
The trust is important because domains will now have full ownership
over that data and infrastructure. So when it's
possibly more reliable to have the infrastructure set up separately
from the domain and simply give access to the data,
some companies or some leads or some management might not have
enough trust to allow their domain owners to own the infrastructure as
well because of the risks that come with the full power over
it, like accidental deletion and any
type of kind of errors,
human errors also. But I
believe that this can be solved by implementing some rules and functionality
too. But still, this question remains kind
of unanswered to me, and I'm curious to see how data
ops and data mesh will evolve in the future.
So with these questions, it kind of sums
up from my side on the data ops and data mesh.
And yeah, to close this talk, I wanted to thank
you and say that I'm really open to hearing some
feedback. You can find me with my name,
Agita Yancem on LinkedIn,
any social media. I'm also on Twitter.
Connect with me and I'm really happy to kind
of explore these
concepts even more and actually find out if they work for some people.
And now I want to just dedicate a little moment to speak about these open
source projects. And I believe that open
source projects are like versatile data kit
and actually others are creating or kind of enabling us to
not to pay a lot of money for some functionality
so we actually get some free tools that we can use.
And I believe that also they deserve the visibility that
is crucial for open source tools or projects
to get more known and used and also to
gain potential contributors. And what I
want to say is that actually there is a little support
or a little contribution, like giving a star can go a long way.
And I suggest or invite you and I will be really grateful
if you support me my team and actually what we
do at the moment by creating this tool by giving us
a star so you can just scan the QR code
or just google versailles etiquette kit and on the top right
corner there is a little star and if you would just spend
a minute or so just doing this it could support me
greatly and enable some people to reach
and to find the project to use it or to contribute
or at least to try it out.
So yeah actually that concludes my
talk. I wanted to say thank you so much for taking time to be
with me I deeply appreciate having the opportunity to be on
conf fourty two and also I want to thank the organizer
these conference is organized very professionally and I feel so
far really positive about how
it is managed so as I said
I welcome feedback connect with me and my
name is agite and thank you so much and see you
next time bye.