Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Ismayer. I am cloud native developer
at Wescale. In today's talk with Charles,
we want you to introduce you with the infamous notion of
data mesh coined by Zamag Delgani in 2019.
In our famous article with shards,
we believe that this world became more and more a buzzword,
meaning that most people talk about it,
but not really master it. As a software developer,
we saw that this data mesh notion is
deeply rooted into software designs considerations,
and we want you to share with
you this understanding so that data mesh
is no more a mysterious notion for you and
you are in capacity to efficiently
implement it. But first of all,
let's take a citation that is not from us,
but that would guide us through this presentation.
There is no sense in talking about the solution before we agree
on the problem and no sense talking about
the implementation steps before we agree on the
solution. And this will serve as a guideline to
this presentation. We will first show what
is the problem that the data mesh is trying to
solve. Then we would see
the data mesh and for what reasons.
It seems to be the solution to the problem we just
introduced. Finally, we will share
with you possible implementation, and I insist on
the word possible of the data mesh.
So, prologue when we talk
about data, we talk in fact about a wide
reality. We have many jobs, many notions
to talk about, but we gather all these
notion under the name of data.
And this is very important for an enterprise
because it's from the data that will fetch
insights, that will be important to
create new features. And the cycle begins with
final users which will create what we
call operational data serialized inside
relational database, for instance, and which represents
the business entities which are manipulated.
And then from this operational world,
we want to understand, to have a broad understanding
of our business in order to maybe fix
it, or more likely evolve it,
enhance it, to answer new kind of needs
from the final users. And this operation
will consist in bridging this very
operational data into an analytics
world where we would mix
the facts from the operational data with
new dimensions coming from third party providers.
In order to mix joins all those data
and explicit them into graphs,
for instance, that business owners and analysts will
share with product owners which will be in
position to later create new
features to the final users that would be in
use. And we see that we have this cycle and there
is no secret in the sense that when we call the data the
new 21st century oil, that is true because
it's from the data that you would get
some valuable insights to make evolve your
applications and let's not forget about the
data people who are the key
for this cycle as we have here, data engineers,
database administrators, data scientists and
so on. And thanks to those people, we are going to create
a current in order to make the
dialogue between the operational and analytics
world possible.
We are talking about operational and analytics. What are the
fundamental differences? In an operational world,
we focus on the business entities and
their relationship. Moreover,
we require consistency
over availability, maybe real time,
and we usually take volumetries such as
gigabytes.
Conversely, in the analytics world,
we want to have a broad understanding of the business. It's not
about business entities, but it's about the business.
And we will manipulate facts
rather than business entities that we mix
with dimensions in order to create
this broad understanding and maybe have
a better new understanding of the business to create
new features.
So what would be the problem?
Because we have different needs between operational
world and analytics world, we most
likely want to bridge these different
approaches in order to extract
from the operational world the needed information to do
analytics. And usually we
pass through a dedicated pipeline
called ETL pipeline for extract,
transform, load that would make
this transition between the operational world and analytics.
So we first extract from the databases
such as SQL for instance, the data,
we transform it and then we load it into
dedicated analysis database that we
usually call data warehouses. So let's
take a closer look to this operational and analytic
bridge. We have this operational word represented by for
instance a MySQL database, an analytics word represented by
the data warehouse and in between the transformation pipeline.
It is a logic called extract, transform, load ETL. Like I
said, two problems with this approach. First,
one is we try to put an entire
business domain inside the very same
data warehouse. So we have to think about
a consistent way to put all these operational facts
inside the analytics database,
which is not a trivial issue
as we need to keep this understanding to have
the good insight from our business domain. The second
problem is about coupling between this operational
and analytics world. What happens if we decide
to change the schema of a table here? We would
break the pipeline because at some
point we use those schema
as a contract between the operational and this
pipeline represented by the ETL pipeline.
So from a database administration point of view we
would say that wait a minute, I don't have to change
this schema because I know that we have some hundreds of
pipelines sourcing from this table, very same table.
I don't think it's a valid reason. So from operational
perspective, we don't mind about this analytics world
and this transformation has to be
kind of agnostic of the schema we
have here. And this is why we introduce data
lake technology in order to
first extract and load inside a
data lake to get the ownership back on the schema
when we transform the data to
load it inside the data warehouse.
So here we are not worried about
the fact that some schema may change because of database administration
operations. We have the ownership back here
as we extracted and cloud the raw data inside
the data lake. But we still have this
first problem about putting indistinctly all
the data inside the data lake. That would become
a data swamp from which it is hard to get
sense of. So if we
sum up, it's not just about the two problems
I mentioned, it's also an
organizational issue because the
classical approach, as the Conway's law
stated, it is to split our
project or products by technical teams.
We have the data engineering teams, we have the DBA
team, and we have the data analyst data science team.
All of these will communicate by Jira
tickets. Usually, for instance, the data
science team would ask new dimensions
to the data engineering teams, which has no clue
on what this means in term of business.
All the translation here are only in
terms of technical needs. And another
problem is between the
team's data science team and data engineering team. Usually we would
have a bottleneck because this central
team became the central point.
When all those teams have an issue,
have a need, and what would happen usually is
that, okay, you don't give me the
feature in time, I will do it myself.
And we have shadow it appearing different
source of truth, which will harm the broad understanding
obviously of our business. So I
would say that the problem is not really technical.
All these technologies will scale.
Problem is mainly organizational.
It would be hard to maintain solvent of pipelines.
It would be hard to maintain an efficient communication
between all those teams being from the operational team
to engineering team, but also from
the data engineering team to the data analyst science team.
We have an issue to solve. So what
would be the solution? In her article,
Zamag Delgani tells us, but solutions
coming from the software design world. And what I didn't mention
is that at the time, Zamac was an employee of
thoughtwork, a software consultancy firm specialized
in software design. And I think there is no coincidence
that it was one of their employee
who came with this notion of data mesh,
because like Zamax said, we can find
some insights inside the domain driven
design approach. This approach is about discussion
with a strategic phase and a tactical
phase. In this first strategic phase, we are
going to understand the business
domain dividing it into subdomains
and gathering them into bounded context,
which appears as physical boundaries
between concrete that should communicate
with one another, but at the same time be
autonomous in their growth.
And this discussion should happens between a multidisciplinary
team inside a multidisciplinary team made
of business analysts, product owner, product manager
and also developers. So that we
begin to have an ubiquitous language that
we would use to create the different user stories.
And at that point when we have the
different words verbs,
relationship between our business entities,
when we have this ubiquitous language, we can define this
bounded context. We begin to implement
it through a tactical phase. And this implementation would
come with technical patterns such as exagonal
architecture, securers, event sourcing
and so on, in order to have an application that
is testable, maintainable,
evolvable, and so on. And the fact
that we talked about just before
about data swamp, meaning a data lake which is
really hard to understand. We can make a parallel
with software design where we have the same kind of notion called big
ball of mud, and the fact to use domain
driven design is a good approach to
avoid at all cost. This notion of big
ball of mud and domain driven design is a
kind of cycle. It does not end when we have our
ebikitus language or even the code to represent
the different user stories. We would need new features.
So we would create new,
enrich our ambiguous language with
new verbs, new nouns, and maybe create
another kind of language to create a new bounded
context. And the idea of
Zamac was to apply this
way of thinking into the data world,
in particular the data analytics world. If we sum
up the main goal of the
DDD, domain driven design is to
make emerge or different ubiquitous languages
which will be protected by the bonded context
and then implemented as domain model decomposed
into subdomains and finally implemented
using some software design pattern. You may
know some of them, MVC for model view
controller, exagonal, secure rest, event sourcing and
so on. If we take business domain
like marine, we may have some different subdomains.
And the idea of these slides is to show you that the
discussion between the product owners, the business analysts
and developers may conclude in different stories.
We may have different bonded contexts according
to the kind of discussion we would have, and especially the
kind of issue we want to tackle. So at some point we may
have four different bonded contexts or three different.
It depends on your business needs, obviously,
because all of those bonded contexts are
part of the same business domain. They are to communicate,
they cannot live alone. So communication
also have to be consistent in terms of
models we use to communicate between all
those bonded, between all bonded contexts.
And we have some patterns to apply this consistency.
For instance, an anticorruption layer that would
give the consumer here represented
by the context two, the guarantee that
what we consume from the provider context one,
will match the needs in term of
type, in terms of fields that
we have in the context two, conversely,
the provider, the data provider, would also be
able to apply a logical layer
that would allow him, allow it to
create its own published language,
to not pollute, let's say, the inner
ubiquitous language, and have this autonomy we
want for each bonded context.
In term of organization, the DDD requires
you to have a unique team per bonded context.
And it's very important that we have this unique team so
that the ownership is clearly stated and
this very specific team will be in charge,
will be accountable for the quality of its
bonded context first to create SLA,
SLO, for instance, but also accountable
to the global consistency, the global rules we have
inside this business domain
splitted into bonded contexts. We don't want all those
bonded contexts communicate in their way.
We want them to communicate as
if they were part of the same business domain. So here, for instance,
we have the team one, which is accountable for bonded context
one and two. It is possible it is a one to many
relationship, but a given bonded context cannot
have two teams accountable. And that's why the
second team here will not be in charge of the first context,
because team one is already in charge. So here we
are, datamesh. So what is, for God's sake,
the relationship between DDE and data mesh? Well,
it's kind of obvious, because instead of thinking this way,
operational world, analytics world, and then the ETL pipeline
bridge, we apply the same mechanism we just
saw in DDD. We have a multidisciplinary
team which will discuss about a
business domain and start to subdividing it into
bounded context. So in a way, the data
mesh where we have data domain should
be called data bonded context.
And this mesh is made of nodes represented by those
bonded context or data domain and vertices by
the different communications we have between those
domains. So let's not forget that each domain has
an ownership of a given ubiquitous language,
but is not alone. In a way, we have to
consume data coming from other domains in order to produce the
different analytics we need and
what is inside each domain.
It's up to you. In fact, we can
get black inside this domain to the very legacy
way of thinking with the operational world
being bridged to the analytics one through ETL
and this is what we would observe usually.
So data mesh is not saying that
this approach is wrong, it's just saying
that we have to have a step back and
think the same way the DDD tells us
in order to organizationally scale.
What we have to understand is that the
data mesh is a sociotechnical concept which brings
especially an organizational scale
and not really a technical one. The technical scaling
is already solved, in my opinion. We have all the database
and data warehouses we need. The pipelining is scaling
with for instance, Apache beam spark and
so on. So the problem is not here, it's more
about tackling an organizational issue.
And with the data mesh, as we have in the domain
driven design, we have teams in ownership
of well designed data domain,
and we have to apply some pillars
where the domain ownership is the main one
and is backed up by three other pillars. Data as a product,
self serve platform and federated computational governance.
Data ownership is like we specified in
DDD is. But stop thinking your business
domain as a monolith. You have to split it into
bonded context so that we have a
team in charge to keep, to catalyte
like we can say as a product, to provide
SLA SlO, to provide a quality of service
through a self serve platform which provides
you with the technical assets you need
technical assets that would scale,
especially if we consider managed services on the cloud.
But we still need at the same time a federated
computational governance, so that, for instance,
we don't exceed API quota. We keep
in line with the naming policy,
with the communication rules
between all those data domains and so on.
What we have to consider though, is unlike
the service mesh for the DevOps who are listening,
data mesh is not a purely technical concept.
It's not like, okay, I am on my cloud platform and
I will install a data mesh. It's not working like that.
You have to first think your business domain
and have a discussion between all the
different jobs you have, developers, data scientists,
business analysts and so on, in order to
make emerge different data domains
and activities such as event storming could
help you to do so. So like I said, it's a social
technical concrete which solve an
organizational scalability issue. So be
cautious about solutions that sells itself
as data mesh ready solution.
What exists, on the other hand, is our
enablers, but a solution that would tell
you, okay, you just have to put the coin inside the machine
and here it is, you have your data mesh does not exist.
And a data mesh is a path for better collaboration. It's not an
end in itself, it's a means to reach a
better collaboration between your team, a better understanding of your
data. So where it shines, it is
where your domain is complex, your business domain is
complex. Where you have different subdomains,
where you have a rich communication between entities,
their data mesh will shine. But if your business domain is
simple enough, there is nothing wrong to have the
legacy approach. Considering only the operational
world, analytics one, and the bridge
in between represented by pipelines,
it's perfectly fine to act this way.
But once you begin to have organizational
issue, once you begin to not understand
what your business is, to not have the right insight to make
evolve your business, maybe data mesh
should be a solution here. So we were talking about data
mesh from a theoretical
point of view. Let's see what kind of implementations
we can imagine. And I insist,
imagine like I said, there is no data mesh
ready solution. So Charles, it's up to you.
So let's dig into the catalog. The catalog
is the place where all domain can push their
own products by product.
Let's understand that we are talking about the data that
each domain collect, store and
want to make available for
other domain. Let's see it like Marketplace,
a catalog where every producer
of data. So let's understand. A data domain
can push and make available
a product which is an aggregation,
a formatted amount of data
that subscriber can.
So we will find here in the catalog a
place where the data domain, so let's call
them producer can push their own projects
and subscriber people out of the domain can
subscribe to those projects and start pull them.
Each data owner will be in charge
of describing its project and
push describe define few
parameters. Those projects will have a set of characteristics
which will define basically what a product is.
So you will find of course
schema. You can also find information
related to the API where you were pulling the refresh
vacancy and all the information that producer
can find relevant. This will help all the
people from outside so the subscribers to
pull the data correctly and automate
the phases of pulling. We can also consider
that subscriber can build their own project based
on those projects, how it works.
So you can open a contract,
subscribe to a project and from there
build your own set of data based on this
project and enrich this
project and build your own product on top of this.
This means that you will create a project from
other project by aggregating those data and transform those data.
This is definitely doable and needs to be included
into the project characteristic saying that this project is beta
on this one and we are doing transformation
on the first product. So how
it works. We will dig into it into the third chapter.
So now let's talk about the orchestrator.
The orchestrator will be in charge of managing
the data from the moment we pull it
from the data sources and it became available
for pulling by the subscribers.
The orchestrator will also monitor all stages on
the data pipeline, starting from how
the data is ingested, if the old data has been ingested
correctly, if the data is transformed
correctly based on the description of transformation in the
catalog, and if the data is correctly
loaded into the data stores. From this
moment, the orchestrator will work
with the catalog to make sure that the state
of the project is correctly updated, saying for example
that the last refresh time is on
22 of March 2023.
Also, the orchestrator can provide
an administration panel. This is helpful
when you want to debug and see what is
happening into the pipeline. For example, you just notice
that project has not been refreshed as it
should and you want to see what is happening.
So the administrator panel will be able to see
what job is currently running and maybe
why. Also it is taking so much time.
We can for example notice that our
data sources is taking much longer to pull and
to give us the data. If the usage of
data mesh is growing within the organization,
it can become hard to debug and
see all the job and states on all the job at
the current time. So having an administration
panel to help you seeing graphically
through a graphical interfaces what is
happening within the data
mesh infrastructures can help you and
gain a lot of time on debug.
So now let's dig into what kind of architecture
can be built to host those services.
Here we are, this is a zoom on a
data domain. Here on the right hand side you
will see data sources. All those data sources
is basically a set of data like
databases, another application,
CSV files, whatever you want
and this will be used as sources
for our product. So let's
start by the catalog. The catalog
is in charge of creating products and
based on those characteristics,
on those parameters, the application
transformation the orchestrator will
be in charge of making this project available.
So we will start by ingest the data
from data sources. It can come from one
to end sources.
Once they are downloaded they will be pushed to a
cage. So storing the data into the cage
will avoid redownloading all the data from sources.
If there is any issue with later
operations like transformation for the transformation
here we use spark with
EMR on AWS to
help us doing all those transformation. Basically the transformation
that has been defined into the catalog.
Once all those transformations are done, the data will
be pushed to s three, redshift or aurora
postgres. Why using proposing
these three data stores? Because of the difference
we can have in terms of data complexity
or amount of data. For example,
s three will be very useful if we have large amount
of data, but redshift will
also allow us to use SQL
queries. So redshift can be very useful with
large amount of data and if the application of
exploration is using for example GDPC driver
and want to use SQL queries to run against the data store
the same way, aurora postgres can be very useful if
the concurrency is very high. We all know
that Redshift is a very powerful tool, but the
concurrency is very hard to deal with
that kind of data stores. Aura postgres allow us to
be very efficient in terms of queries with
a large amount of data and can give us very
high amount of I ops and also can help us
to have a very high amount of concurrent queries
because of two main features,
the rate auto scaling of course and also the fact that we can
have very big instances and last
but not least the exploration
application. So this application will be
in charge of retrieving efficiently
the data within our data stores and make
all those data so the product available
for all the subscribers. This application will
need to take in charge of those operations,
meaning it needs to control the way it
is retrieving the data to do not put too much pressure on data stores
and do not impact all subscribers.
And it needs to be intelligent enough
to load, balance, shard or optimize
customer queries. This is what
data domain can put in place in AWS,
for example to build
their own data mesh. Here we can
use containers within ecs,
for example, the usage of containers
is recommended as few of those operations can
need data within the containers
or also run for a long time.
So I would recommend to use containers here instead
of lambda as we can use it here
for example to run our application orchestrator
or catalog based on DynamoDB
lambda and API gateway, basically the serverless
framework. So this is all the things that we can
build to make our projects available within
our data domain. Now I will give the hand back to Ismail
who will be introducing a Google project that
is aiming to provide all
those services and manage everything
under the hood on Google side. Thank you Charles.
Before concluding, I want to present you
a quick overview of a product called GCP Dataplex.
As we saw when we talk about data
Mesh 3D product, we have to be very cautious because the
real problem is not technical but more like organizational.
So it requires you more to think about your
business domain than buying another
product. But we saw that we have enablers
that enables you to implement the
pillars of domain ownership data as a product self
services platform and federated computational governance.
We think that GCP dataplex from
Google is a good example of such products
because it
will offer you logical layer that
will federate the different existing services
on GCP, such as bigquery,
dataflow, cloud storage and so on. In order
to give you a sense of what
should be data mesh. If we look at
this logical layer, we have a lake
which represents in fact the data domain
and which relies on certain amount of
services such as the data catalog,
which will store metadata related to the different
data that you will store and make compute
on. And also of course Google Cloud
IAM which will allow you to federate
to give you the federated computational governance
on different assets on different GCP
assets such as bigquery or cloud storage.
Another point is that the lake will be
separated into zones which represent
a kind of logical separation of
your data. We can interpret it as package
if we reason in terms of language
development, for instance, and each of which
will be attached to different kind of assets.
Depending what the team associated
to zone wants to do. Each asset will
benefit from by design technical metadata
such as schema for instance, the type of
the different schema of big rate tables for instance,
and will be automatically reported to the lake.
It is interesting to observe that
the different assets we link
to the lake are not necessarily part
of the same GCP project. In fact,
we can see a dataplex lake as the same
kind of abstraction of a GCP project,
but only for data where the GCP
project allows you to abstract the billing
and API quota. The data lake will allow
you to abstract the notion of data mesh through
this federated computational governance which is
not really per project but
per lake. We do not forget also that given
lake which represents a data domain is not enough.
We also have other lakes which represent other data
domains and as we saw, in the end they will be
able to communicate according the
permission we set in the federated computational
governance layer and also according obviously
the need of communication between those domains.
So let's see short how
it illustrates. So here I am on GCP
console. So obviously when we consider Dataplex,
we have to be familiarized
with the Google environment and
we will have in the manage section the different
lakes that represent our data domain. We can create
new one if we have the right permissions
and inside each lake
we can do certain amount of action. For instance,
we can federate the permissions on the different
zones of this lake and we can grant
access to those specific zone.
Here we have three different zones, one of
which I created two different assets.
So if I go inside the zones, I will be
able to see my different assets. I can also
create and delete the existing assets and of
course add permissions on the zone
itself, but also on the
asset itself. So here my two different assets
are a
bigquery data set and a storage
bucket. In Dataplex,
those are the main asset type.
That doesn't mean that you cannot set
other kind of assets, but it would be necessary
through those pillar assets,
since bigquery, for instance, allows you
to later fetch information from
history, blob objects, or even
on premise databases.
See Bigquery Omni from
the manage section. I won't be able to access the
assets. It would be from the
catalog, the data cataloging feature of Dataplex,
which is a kind of search engine
which will rely on the
metadata, the technical one, of course, the name
of my different schema, the name of the
column and the types, but also on
the business metadata, and we will see how
to provide them to Dataplex.
Here I can see that I indeed
can access to my assets
and I have a certain amount of filter
that allows me to
add more criteria in my
search. If I go on
the assets,
I can have different kind of metadata, technical metadata
on it. I can access the schema
and we can see that I can associate
them to business terms. Those are the specific
business metadata I was talking about
and which will be in fact fed
by a glossary which explains
the data domain we are working on.
This is very important because it will allows new users
to have context
on what kind of business we
are working on. And we can
see that here I created a people domain
that I document and finally
I create a new element,
gods, with a definition on which
I can create relationship. So here a commercial
is related to a customer by this
definition. And I added the link
and I can also add a steward which is
a kind of owner of this definition so
that any people in question
for this notion is able to contact
the right person.
And if we go back to the
assets which consume those elements,
I am able in the catalog
to search for customer, for instance,
and see that the assets
associated to this notion is
bring back the search engine. So this
is very interesting in terms of data exploration and
data and business understanding.
Of course, my data lake goal
is not to just expose my
data to understand the business data domain,
but also to apply transformation on it and to
have quality insights on it.
And this section of process allow you to
create tasks under different data that
you consume, that you stored inside the
assets we just saw.
And those processes are
provided by services,
Google services, which are
not to create by yourself,
but which is possibly
given to you through templates for
the different tasks that are common. But keep in
mind that you are also able to provide your own business
logic through for instance, dataflow pipelines.
You also have the capacity to define
specific processing to have more
insight on the quality of your data.
This feature is still on preview, but relies on
dedicated data quality project from
Google which allows you to expose the different
rules you want to apply on your data through
YamL file. And it
is quite interesting in terms of possibility
we can have on this feature.
And of course we have also the secure section
that allows you to have a broad overview of
what kind of access you can give on the
lake, but also the zones and the
assets associated to them.
So like we saw, this product
is more like an abstraction layer rather
than a real service like bigquery
and gives you pointers to
different assets. So bigquery,
data set and cloud storage bucket
to federate the different processes
you will apply on your data,
but also the different permissions you
will apply on it. And last but not
least, it also enables the exploration
of your entire data set with
elements that are technical, the type of
data you are looking for, but also related to
the business. Thanks to the glossary we
just it will concrete this presentation.
Thank you for your attention and if
you have any questions feel free to join us on the chat.
See you.