Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to today's conference where we delve into
modern data architecture. In an era where data has
become the cornerstone of innovation and
competitive advantage, organizations worldwide
are facing to dont modern data strategies that can
unlocking its full potential. But why
do we need modern data architecture? Today? We will explore
this question and more understanding how the landscape of
data has evolved and why traditional approaches
are no longer sufficient. From the explosion of data sources
to the demand of no real time insights,
the challenges facing organizations in harnessing the
data and the solution lying the modernizing
our approaches to data architecture. Our discussion
will not only focus on the theoretical aspects of modern
data strategy, but will also delve into practical
insights on how to build modern data
architecture that are agile, scalable and resilient.
We will explore the latest technologies,
methodologies and the frameworks that empower
organizations to transfer data into actionable insights,
driving innovation and failing growth.
Throughout this talk, we will share best practices
gleaned from industry leaders and experts,
offering key takeaways that you can implement within
your own organizations. So whether you are a seasoned
data professional or just beginning your journey into the world
of modern data architecture, I encourage you to engage,
collaborate and immerse yourself in the wealth of knowledge
and the insight that this talk has to offer.
Welcome to the journey of building modern data architectures.
So, before going
into the details of how
we architect the modern data organizations,
let's take a step by step approach.
So in every organization there will be a need to kind
of write the data in
some area and then read that data by the consumer.
So in a very simple format, the data lake
is like a simple storage where structured and unstructured
data can be stored and consumed from.
So in short, in its basic form,
an organization will need a place to write data
that will be consistent and durable. They will also need
a way from consumer to consume that data in a consistent way.
This could be as simple as s three in AWS.
This will store data that can be structured,
unstructured or semi structured. However,
this simple storage will have a number of challenges.
If every publisher writes data in their native format,
then consumer will not have a consistent way to read the data.
Organizing the data to be written so
that consumer know where to read from data
is added at some interval. This creates
a problem to present a comprehensive view going
a little bit detail on this part. In its rudimentary form,
data lake serves as a reservoir for both structured and unstructured data,
providing a centralized repository from which the
information can be stored and retrieved. Writers deposit
the data into the data lake while readers access
and extract the data from its storage. However,
this seemingly straightforward approach presents
the number of challenges that we discussed.
So addressing these challenges is essential
for optimizing the functionality and efficiency
of the data lake, thereby enhancing its utility
and value within the organization ecosystem. Now,
I will be talking a little bit more about how to
organize the data storage.
So basically, in this talk, the way we are
kind of trying to achieve this one is in a step by
step way. So we were talking about the simple form.
Now going a little bit deeper, how can we store
the data? And we'll follow this pattern in our entire talk.
Coming back to organizing data storage.
So if you think that every data publisher will
need to write data to the
storage, right, but we do not want all of them to invent
a new way to write this data. Hence, there will
be a need for a standard way to write the data.
This can be fulfilled by creating a component called
data writer. Data writers are assigned
specific areas within data lake designated for their
respective suborganizations. They must
authenticate their identity and specify the
data set they are writing to. Meanwhile,
the data readers allow consumers to access data based on their
identity and the data set that they are authorized
to access. Data within each server organizations
is structured into distinct data sets.
Within each dataset's designated area,
data is organized into delta files and aggregates.
Additionally, data is partitioned for efficient
storage and retrieval. Standard formats such as Parquet
are employed for storage, while advanced formats like Iceberg
and Hoodie may be utilized to enable transactional
capabilities. In short,
I'm right now focusing on the right
part, how the publisher will kind of
leverage and data writer component to
store that data, and then a little bit
how to organize that data within an organization and
then how to manage the delta. That will come
on a constant basis or
some predefined frequency. But we need
to make sure that the deltas are coming and the deltas are getting
updated into the master files. Now,
let's transition from storage to utilization.
So basically, we'll be talking here on the
consumption side, right? And then before going into
the details, the very high level questions
would be who will consume it, where from they
will consume it and how will they consume it.
Now there will be various personnel who will consume this
data. This could be data analyst,
data scientist, applications, or consumers
that we do not know of and why. I'm saying
that the consumer that we do not know of, there could
be an innovative kind of scenario where
the consumer fit into
data analyst, data scientist, or application bucket. So that's why I'm
creating another bucket for the consumer that
we do not know right now. And all this would
need the data to be stored in a fit for
consumption platform. This means that
the data for a batch application will be stored
in a data lake. Data for real time
access would most likely be accessed from a
database or from a low latency storage.
Additionally, for consumer patents that we
do not know of, this platform will have to enable
the creation of custom syncs to fulfill the
specific consumption needs. Further, the consumers
will need tools that are appropriate
for their consumption. This could be API for
real time access or notebooks for data scientists.
Now, even with this upgrade,
the challenges are consistency
of data in various things, the duplication of
effort to write this data and the
control to prevent unauthorized access.
And we will be kind of moving forward.
We will see how to address these challenges in a more holistic
view or with a more holistic view. Now let's discuss
how the challenges that we talked earlier can be addressed.
Multiple data publishers write data in various formats.
For example, a publisher may write data in
CSV format to the data lake, and there
could be another publisher that may want to write
data in avro format in streams.
So we would need a component that will enable publisher
to publish in an uniform way. We will
need a component that will be able
to move the published data into any of the data
platforms that consumer want to consume
from. This component will also write data in
a standard format. For example, part three.
We call this as the data pipeline.
Some of the challenges that this component will have to solve are
ensure that there is consistency in the data
that is being written to these various platforms.
Excuse me, ensure that the data is being written in
a standard format so that there is a predictability from
consumer perspective. Implement governance so
that there is a trust in the data and then the burden on each
platform to implement governance is not there.
Hence, it is important for us to establish a first class
data storage platform. If there
is an inconsistency in the data, then this
is the word we will use to rebuild the platform.
For the consumer to consume data,
we will need to have a discoverability
of the data. This can be achieved by establishing
a data catalog. Data catalog tells consumer
what data is there for consumption,
but we will have to solve for who can access what
data. This is where data access control will come
into picture. Now, before going into or
doing a deep dive on data pipeline, let's talk or spend a
few minutes about the role of data writer.
So the data writer component
plays an pivotal role in managing both real time and past data
stream. Within the data ecosystem, real time
streams are critical for instantaneous decision
making and monitoring processes. Data can
originate from APIs or change data capture
mechanisms regardless of the source. Real time
data is channeled into a continuous flow,
forming a dynamic data stream. Bash data
processing is essential for handling large volumes
of data in scheduled intervals.
Bash data may originate from file based storage logs or
databases. Bass datasets are directed towards a centralized
data lake for storage and subsequent analysis.
Let's discuss about the data pipeline now.
The basic idea of the data pipeline is to be
able to preprocess the data before it is written to
the storage platforms. The basic steps that
will be within the pipeline are implementation of the data governance
platform or sorry perform light transformation,
route data to the appropriate syncs. Additionally,
more steps should be addible in the future, so it
should be flexible enough custom steps down the
line as well. All of these tapes are organized
using an orchestrator. This could be something like infling
or airflow. Now let's talk about
each of these steps. Data governance
implement policies and procedures for managing data access,
security and compliance. It will also do
a schema check to
validate the schema and to ensure the conformity with
predefined standards and structures. It will also
do and quality checks to identify and address the data integrity issues
or anomalies. It will also do and scan for
sensitive data and conduct in order to
identify and protect sensitive data and ensuring the compliance
with privacy regulations.
Data pipeline will also routing the data into the destination,
so direct data stream to diverse destination based on predefined routing
roles and criteria ensure data is
eventually retained. So it's like implementing
mechanisms to ensure data persistence reliability
even under challenging conditions. It will also do
an light transformation, so applying lightweight transformation
to optimize the data for downstream processing
such as changing data types or formatting.
Now, achieving all the above can be streamlined with
a data pipeline offering a plug and play approach for adding and
updating data governance. For batch data, it routes
both streaming and batch platforms for real time
data. It batches before routing to the lake
and sent it to the other downstream storage syncs simultaneously.
Before sharing the responsibilities of a data transformation platform,
let's revisit the challenges associated with it.
If each team must develop its own computing platform,
optimal capacity utilization becomes challenges.
Second, each team will expand effort in maintaining infrastructure
individually. And the third, there is
no standardized approach to implementing data
ops, leading to inconsistency across teams.
To tackle these challenges, a data transformation
platform can be utilized to accept transformation
scripts in various formats via a Ci CD pipeline.
Initiate and execute jobs on an on demand
computing infrastructure.
Third is dynamically grant access to specific
data based on job context and the
lastly, the fourth one transfer the transform
data to a data writer for storage on data
platforms. As you see in this diagram,
I'm going to talk about the data organization and
then the left hand side of this diagram. It's more like
talking about how to organize the data in different
lobs and what's the role of individual publishers
within that lob. The right hand side is like
how that enterprise catalog would look like and what
would be the role of different sublob within
the overall framework in general,
right? Every organization is composed
of sublobs. So when I'm talking about lobs
means each of these sublobs will have multiple
publisher and each of these publisher may publish
various data sets. So these data sets may
in turn be sent to various data platforms like the lake
or database. We will call this data distribution.
Further, there can be multiple
consumers in each of these sublobs which
may want to access data from any of the data
sets across any sublob.
So it is imperative that this data is discoverable
by the consumers. Hence, the publisher will be responsible
for registering the data in the catalog. In many organizations,
each sublob may have invested in
their own catalog. In such cases,
they will need to roll up all this registration
in a central catalog so that the data is discoverable.
Now let's take a look at things that are
needed to enable publishers and consumers.
We will need an UI for the data users.
We will call this as data portal.
Many organizations will call it by different names.
As we have already discussed, we will need a place for
the admins to define an organization and
its organization. We will also need
to let the publisher define their data in the catalog.
This would be data sets, data distribution and
other things. After defining
the catalog, the publishers may be asked
to define the data quality, lineage and
other things so that the consumer have confidence in
the data that they are consuming. The basic idea
here is to reduce friction for the data to be published.
Hence the tiers just publish but no
consumption. Publish data with basic governance information
like schema and tier three, additional information like
data quality. Besides supporting the publisher
and consumer, the data portal would need to support other
use cases like reporting, searching and access
request. Let's now focus on machine learning.
How can this platform that we have defined till now support
machine learning? As we can see in the diagram
at a rudimentary level, there are a few stages.
Raw data will be available in the data lake.
A transformation job within the data transformation platform
will extract the features and store them in a feature platform.
This feature platform may be created using the following
feature registration in the data portal, data storage
in data lake and perhaps a low latency cache feature
serving using an API. A job to train
the model within the transformation platform.
Then the trained model will create offline predictions
and store that in a data lake and a
low latency storage here. One other thing needs
to be noted here that we are not covering the real time prediction here.
Now putting everything together, we will have these
components. The data writer,
it will be able to read and write data from various sources and
various format, be able to scale up and down according to
load and cater to both batch as well as real time
data. The storage platforms it will
be something like the data lakes, streams and other
storage things. A data pipeline to
implement various preprocessing steps and
binding all these steps through an orchestration.
Now potentially there
could be a data transformation layer or platform
which will help us to do the needed transformation
to send the data in various things. The consumption
tools for consuming data from the various platforms in a
way that is convenient for the consumer and finally
having an UI that binds all this together,
that is the data portal. Now I
will spend some time to share
some of the challenges related to the cloud
or with how
how each lob will create or share their AWS accounts.
And then it's a common challenge.
And then the solutions option could be like create one central account
to store the data and each lob creates their own account.
There could be a potential challenge with data security. There is an added
concern of data breach right sensitive data. The solution
for that one is like sensitive data should be scanned,
an appropriate remediation performed such
that even if the data is compromised,
it will be of very little use.
Some options are encryption, masking or deletion.
Data integration, integrating on premise data
with new data. The solution could be
historical data remain on premise, migrating historical
data to cloud and disconnect with on prem data.
There could be challenge with workflow migration and that could be
potentially solved by migrating on premise job to the cloud
so that new data can be created on cloud and also let
the job remain on premise and create new data on premise.
Create another job to move the data from an
ops point of view. Each service and storage are tagged
with the Subalobi's identity so that this information can be
used for billing. There could be challenge with data
governance. So in that case
I would recommend know CDMC as a standard and
also use custom data governance component.
Cloud capacity is
also a concern. Cloud is someone
else data center. Although it's good abstraction, we will need
to plan for the capacity as each account has
a soft and hard limit on capacity available.
Well, thank you once again for your time and attention.
It has been a pleasure to share my
talk with you all today. I have shared my email
and I would be happy to take any questions you may
have. Thank you.