Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Antoni Ivanov and today
I'm going to talk about applying DevOps practices in spatial data and
data as a service. Before we
dive in, let's talk a little
bit about moneyball.
Have you watched Moneyball? This is a really great movie and a book
about out Oakland athletics baseball team in general manager
Billy teams the story of Money Ball is a real
one starts in the 2000s with Oakland Athletics, a Major League Baseball team
faced with budget constraints that severely limited its ability
to complete for top talent.
So they were in the bottom right. Bean cooperated
with Paul de Podesta, a Harold graduate with bavarian
economics, to apply statistical analysis to player recruitment.
That means that he was looking at players that most
of the other scouts were missing because
he was using statistics instead of intuitions to
make his decisions. Not only that, but during games
he was prioritizing on base percentage
of a batting average, understanding that getting off base
was often more valuable than hitting home runs.
This totally changed the team for just one year in
2003 seasons, Aquaman 20 went to win 20
consecutive games, setting American League record.
This created paradigm shift in the entire baseball
where now analytics and data being actively
used by all successful teams.
This just goes to prove that good decisions, correct decisions,
are generally data driven decisions and intuition
based decisions generally don't make it.
Today we're going to go quickly over agenda. We're going to talk
about data, applications and API for data. What are
seos and seos for data and the data circle for data and
the open source products that I work on with style data kit and how it
tries to solve some of the challenges
for data engineers. So what are applications?
There are many different types of applications, right? We have ecommerce
applications where people want to buy things.
Of course there are lot of mobile applications,
iPhone, Android, different it
systems that are used to track different business processes,
accommodation system. All those application have one thing in common.
Every single application it generates data.
Huge amount of data. Things like databases
which we use to store data about customers,
right? Or log files that
we use to store the world operational data, click streams about
events, about what is happening and how the customer using the product metrics.
Operational metrics help us maintain our
services. One thing that most people don't realize is
this data is actively useful and
actually being used to create data applications.
Data analytics for example.
We use all kinds of usage data to be able
to understand better how our customers interact with
our products in order to make them better.
We use business intelligence to
understand how our product behave as well to
make better data driven decisions data
science teams built machine learning models so that
they can recommend the best features
to the customers that they need. Finance team
also need to do forecasting models and
so on. Those are all things
that rely on this kind of data, let's call it data existence,
that's produced about out of those applications.
A data application is also applications. They're simply
focusing on data structure slightly different. So what
is the usual data journey I'm going to use? For example,
building ecommerce app. The primary function of
ecommerce app is to be able to
enable customers to buy and sell by products,
right? They have product catalog, shopping cart, process transactions.
And this kind of data is saved in
systems like events are usually saved by
Kafka or other message queue in the databases and
walks. So if I want to create data applications,
I need to ingest the data. That's what we call data ingestion.
All these kind of data sources needs to be ingested in kind
of analytics system so they can be aggregated, joined to each other.
After the data is ingested, we need to ensure
that it's accurate, consistent, updated. That's why we need to transform it in a
format that other teams now can start use.
So our data applications is going to be also create
another data applications. So we need a good model,
sort of like stable model with quality,
test it and then maybe if
you want to work in ML training models, we need to do feature engineering
data enabling data partitioning with Anatabo data automating to create
our ML object. And this is used by all
kinds of other applications. This could be bi applications to
make data driven decision reports or data driven on other data driven products.
Five recommendation service this
of course relies on a lot of data infrastructure that needs to be built.
And all of that we call data application whole
thing. That's the data journey. The data teams are the one that general responsibility
data applications and some kind of, let's call it infra operations teams,
for instance, is the one that sort of provides infrastructure to the
data teams. But how do multiple applications
communicate between each other? Regardless of talking about data
application API API is the
standard between which any application
communicates between each other.
What are the API components? First, we have interface
and contract that includes employee definitions
and the request and response format. Whatever it is.
This sets how the API is going to be used, right? Of course we need
to make sure that the API is secure authentication mechanism,
so only people who have right to access endpoints
have access. It's important. The API is usable. It's easy,
intuitive, it provides quant libraries and
the features are easy to use, integrate in other applications and
of course it needs to be monitored and operated.
Logging, monitoring track usage, manage traffic, ensure smooth operations are
critical. The same thing goes for API for data,
but one of the main changes is what's
really the usual interface in coderack for an API
for data. Continuing with our ecommerce app example.
So we have two data sources, iotpdbs three service
which contains different types of product information. We need to
ingest this data to generate copy it
into our data lake.
This often lately means databases like Snowflake,
Redshift and so on, not necessarily just
bob storage. And now
we still want to expose more of a process
all combined product view that
we guaranteed contains all the necessary
product data. So let's call it our product data module,
which in this case what is our two data sets? But you can
imagine it could get much more complex. So where
is the API here? Well, with API is communicating between two
applications. We have of course API
between the source applications and our data application and we
need an API between our data application and downstream
applications like BI or other
software applications and services. Let's focus
on the right part. We don't have actually
good API for data between the sources and the raw data.
And how can we build those?
I'll give an example using the right part, but our product
data model, but the same things goes for the
API for data needing on the left part.
So let's say that we want to create this
kind of table or entity of products which contains information
about this.
So what do we consider API of
this kind of product data set? Well, there's schema
very important. For example in this case this could be
the different column names and the types of each column
name are part of the data based on the data schema definitions.
What else? Data semantics.
Each column name has certain semantics like
name being non named, string representing latest official name.
We can notice we need to be as specific as possible.
And finally we need a way to
access it. So data access it
could be tables in database using SQL. We could access
it by using Python enterprise that is setting parquet or arrow
data formats. So if you go
back to our data journey, where do we need API
for data? Where do applications communicate?
So as we establish between the data sources in our main data
application, between our data application output,
be the data module or the module object, and the BI tools
and data driven programs that want to bring any value and insights
out of this data. But we can also of course think
about data application. It's a huge thing and we
probably want to make it into subcomponents. So probably
we need even internal APIs within different components of our
data applications, like between the data
that's being ingested in raw data lake and the dimension module,
or maybe dimension module and the module object that's
being created by training the data.
And this currently is missing. There are no tools,
there are no customs for creating API
for data. And that's one of the biggest gaps
that we have in the data engineering that currently
do not exist generally in the software engineering.
The call for here is that the data sources generally in
control of the application developer. That's if
you are developing any application service, mobile app,
ecommerce app, whatever, you are outputting this kind of data into
different database entries
in your IOTP database, and those things are going to
be used in a data application.
So we need to actually put them under API.
Consider them also can API. What else do we
need for this API? Well, we need to
have service level objectives, right? Metrics which guarantee
certain processes to the API users. And that's where
service agreements also come. Generally those
come from the data semantics, a large point. Once we know the
data semantics, we can create rules and
those rules would calculate our data accuracy seos
then we have other more standard seos in the data
and like availability seos, right.
How often the data is quarrable.
We also of course have freshness sels which is very important
that the data needs to be available in
the analytics system within 1 hour, for example.
It depends. It's important that
our data precious co is the correct one for some
hours is okay, for other seconds is
needed. This APIs for data
SEO for data could be also thought about as
creating a data contract. There is a lot that's
being talked about data contracts currently in the industry. I encourage
everyone to go check out both post by Chad Sanderson on
this link to learn much more about it.
They're pretty great. He dives deeply into
the team and the third part we want to discuss is
DevOps cycle for data. With simplicity,
let's sort of flatten out the DevOps cycle. We know
the standard DevOps cycle is one God, right? The first
path, build, test, release, deploy, operating monitor.
So how do we map the data journey to
the DevOps cycle? It's fairly intuitive,
right? The first thing you need to do is to plan.
So you need to go and discover the data. You need
to go and explore the data, find where it is,
find for example where the product info is kept. Is it applying s three
is it capping database and try to access it.
Then you need to ingest it, actually core the ingestions,
build the data ingestion pipelines and
probably transform it. Transformation log this is another type of core
information log needs to be encoded, usually using SQL
or Python or something like this and built in data pipeline.
Those roughly go into the blank code in build stages
or generally tasks that you do during those stages.
Once the data has been stored, ingest and transform.
It's ready to be deployed for using reports and
it data visible machine learning modules or other applications,
be it data applications or normal software applications. So we
had the whole depth wall. We need to test
the data or release it and
deploy it similarly as we do in the data DevOps
cycle. And of course we need to maintain it.
It must be consistently managed to ensure it stays accurate,
secure and available, similar to how applications
operate. Monitoring DevOps environment let's call
this managed data. Now that we sort of covered all
the aspects of how do we apply DevOps for data API,
SE World and DevOps cycle, I want
to introduce wastal data kit. It's an open
source framework which provides solution
for users to have self service environment for data engineers
to create end to end data part points using this kind of code,
first decentralized, fully automated way.
Its focus is more on the DevOps part and
the data journey part. But let's
see. So it provides really two things. One is the
SDK or a framework almost. You can figure the spring for data
which you can allow you to develop data jobs using iterpython
so you have some methods to extract raw data or secure.
You can do things like parameterized transformations
and it allows us to deploy monitor using
control service and operations UI.
This data and data applications that you're building,
we call them here data jobs. Looking at the data journey
versatile data kit ultimately fits the ingestion and transform
part. It can be used also to train data modules and
generally you can use to export data in itself.
Constell data kit is a framework, so it allows you to write
the code that you need to ingest, transform, train,
export the data and integrates
with different databases which can
like Snowflake, empower, postgres,
MySQL or different compute
engines. You can use like Sparklink
to actually make the
heavy lift computation a water focus
has been meant to simplify the heights complexity of data infrastructure
and generate DevOps for the data team as much as possible.
So it allows you to basically plug in
between data infrastructure and the data applications,
allowing a lot of control both over the DevOps cycle and over
the monitoring aspects and the infrastructure to
the sort of infra operations team. So the data teams can focus
only on data and the overall watch mount
about the software part. So for example, let's show a
quick demo of what I mean. Let's say that we have
this kind of databases and the data teams are developing their data
application, running SQL queries and maybe they're
running such big SQL queries that are breaking the database.
That's it. So what do we do?
We have to either block the data team or ask them to stop.
Would be much nicer if instead of finding out when those
queries land production, we find that you can give up a time as
early as possible. And that's what you can do with Versaille data git.
Let's say the data team is here developing
their job, running this kind of SQL queries and accounts from employees
and different, using different, all kinds of
different methods. What completely data teams
without even knowing? Of course you will know. The platform
team or operation team or data center team could evolve
this kind of VDK query validation. It could be as simple as
compare the size of a query or it could be much more complex by doing
some kind of analytics on the query. It doesn't matter because VDK
allows you to intercept both the data and the metadata
in plugins. You can intercept each query statement
and decide what to do with it through plugin.
In this case, let's say that we reject all queries
bigger than 1000 expressions or 1.
While the data user and data team is developing their
job locally, they'll complete the error immediately.
The query wouldn't leave their work
environment. I don't think there's a
huge benefit to be able to do this very quickly.
This type of query
validation can be used to enforce data
quality rules and data APIs.
Basically as early as possible at the data,
at the query, when the query is being developed,
basically the data model is being created and that's pretty
powerful. Now let's look at more of the
operations and monitoring and deployment part,
not just in the developer part. So how can we help with the deployment
part? So yeah, let's go back to our DevOps cycle.
What we want to do is enabling fruit operators or centralized data team
to sort of have this kind of more control over
how the data applications, the data jobs are
being built and even tested, how they are being released and deployed.
Because there are a lot of concerns. You have to think about version, you have
to be containerization, you have to think about adding
metrics, metadata and so on, creating docker images,
creating cron jobs or kubernetes objects.
This all should be completely hidden
out for data engineers and data user data because you
want them to focus on core data modeling and applying
business inside, applying business transformation insights on
the data and not to worry about all these kind of software engineering
concerns, DevOps concerns, and that's
very important for them. So data teams
focus on planning, coding, data application. Of course they need some way
to monitor their data while the let's call
it it team or whatever, this small DevOps team can establish both
policies through this kind of
extensibility mechanism the VDK provides. What policy?
Well, let's look at example, very simple example
the way by default VDK also install
a kind of dependency package, the data team job ready
for automatic executions in the cloud environment.
In VDK case it's using kubernetes. But let's say
that we want to make sure that we add some centralized system tests
for all jobs that verify certain quality or
super API contract or center metrics.
We can do this very simplified everything. The way plugins are built in the
control service are through docker
images. Basically we can extend a style
data into a builder document. Anything you can
run system teams, you can remove execution privileges.
And this script is being run during the
build and test phase. So before anything is released and deployed,
after the job is being built,
they can run this kind of test and change
the job scripts by this way guaranteeing certain
level of quality. So if the system test failed,
the job wouldn't be deployed and security by removing
unnecessary permissions. In summary,
what do we think about by DevOps for data?
Well, one spec is API for data, right? API for
data means we need data enabling, we need data semantics defined.
We need some kind of validations and documentation that people can
explore the data, right? They need to be able to access it easily,
find out where it is, explore it's
best well documented, the same normal API, we need
seos and teams for data.
So we need to be able to collect metrics about the data,
things like freshness, things like unique
values and so on, and what if metrics.
And we need the data to follow the
DevOps cycle. Basically the whole pipeline from
sources to insights should be automated using the
best DevOps practices, which means ability
to plan code, ingestion, transformation logic, ability to
deploy that logic and to manage it to operate it.
Versatile data kit actively helped in a lot of those,
especially in the DevOps data part. It also through plugins
could help, as we discovered,
to create can enforced API for data.
This is really direction that we need a lot of more research and a lot
of more work to do in Versaille data kit.
So I will be really helpful if anything that you find
interesting here. If you'd
like to learn more, just get in contact with us.
Yeah, you can do this by going to the gift of repo and starting
a discussion, contacting our swap
channel, writing an email or contacting
us on LinkedIn. And there are a lot of work
to make sure that data engineering practices
are able to adapt and adopt good staff engineer and
DevOps practices. A lot of this was discussed is what
it would be nice for staff engineer to have already and it's still
missing dollars which were started. We tried to bridge
the gap. We know that we have a lot of work to do
and your input, anybody's input will
be extremely valuable. I hope
you get in contact with us and thank you again
for listening to these 30 minutes of me talking.
And have a nice day or night or
whichever part of your.