Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Satish Mane. I will talk about data lake table
formats and their integration with AWS Analytics Services to
build cloud native data lake house on AWS Cloud.
Hi, my name is Rajeev Jaiiswal. In this session I'll take
you on a journey to Data Lake. Thanks for joining in.
Before we dive further, let's first understand couple of trends we
see in businesses. The first one is expectation of the customers.
In the digital era, customers expect the kind of
experience they get from Airbnb, Uber and all
these technologies. Apart from these experiences, the BDI is
personalized, demonstrating a true understanding
of a customer and their context. People's expectations
vary from industry to industry as we offer contextually.
The second trend is new data volume.
Data is growing at an unprecedented rate,
exploiting from terabyte to petabyte and sometimes exabytes.
Traditional on premise data analytics approaches do not scale
well and are too expensive to handle these volumes of data.
We often hear from businesses that they are trying to
extract more value from their data, but are struggling to capture,
store and analyze all the data generated by today's modern
digital business. Data grow exponentially.
They come from new sources, are becoming more diverse, and need to
be securedly accessed and analyzed by any number of applications
and people. All this brings us
to the subject of technology. Before diving into the broader analytics
architecture, let's first understand how legacy or traditional on
premise data analytics stacks up. There is typically
an operating system and database for storing customer records
and transactions, followed by a reporting
database for data Mart and data lakehouse. Type use cases there
are four main problems with the type of architecture.
First, the analytics implementation cycle is too long,
as moving data sets and building dashboard can take weeks or
even months. The second issue is scalability,
higher cost because you always have to plan ahead to
buy more hardware and pay for more licenses.
Third, this architecture is not suitable for modern analytics use cases
such as machine learning adapt queries for data sciences use cases.
Finally, organizations struggle to keep up with the
pace of changing business needs.
Now, how you can solve all these problems?
The answer is data lake. A data lake makes it easier to
derive insights from all your data by providing a single place
to access structured data, semi structured data
and unstructured data. Customers need highly
scalable, highly available, secure and flexible data
store that can handle very large data sets at a reasonable
cost. Therefore, three key points are important for data lakes
data in its original form and format, no matter how much,
what kind, and how fast it is generated.
Structures and processing rules should be defined only when
necessary. Also, known as reading schema. As data
is used by large community, we need to democratize data.
What you are seeing is a high level architecture of data lake in the cloud
as a low cost, durable and scalable storage. Amazon S
three provides storage layer that is completely decoupled from
data processing and various big data tools and has a zero
operation over it. Customer can choose a data lake file format such
as Apache, Parquet, spark powered AWS services
such as AWS, Glow, Amazon, EMR and Athena enables
access and compute at scale. The meta level stores metadata
about tables, columns and partitions in the AWS glue catalog.
To keep the data in its original form and format, you need the ability to
handle various file formats such as JSON, CSV, Parquet,
Avro and more. Each format is suitable for
different use cases. For example,
CSV is popular for its low volume and human readable format.
CSV is what we call the line storage parquet
file. Organize data into column column store files
are more optimized because you can perform better compression on
each column. Parquet is well suited for bulk processing
of complex data.
Now you understand the data lake component.
As such, if you're creating your own data lake, there are some companies
that you will most likely need to create and maintain for your data lake.
Now data need to be collected and must be scalable
storage. Without ETL transformation,
all data must be cataloged because without a catalog you cannot
manage data, find data and organize access control.
All kind of analytics are needed including patch
analytics, stream analytics and advanced analytics
like machine learning. Therefore endtoend data ingestion and analysis.
Processes need to be coordinated.
Data should be available to all kind of people,
users and roles. Most importantly,
you need a framework for managing analytics and data.
Without governance, finding good solution is impossible.
Data analysts can query data lakes directly using fast
compute engines such as redshift and their preferred language SQL.
Data scientists then have all the data they need to
build robust models. Data engineers can also
easily simplify data and focus on infrastructure.
So let's understand benefit of serverless data lakes services
is a native architecture of the cloud, allowing you to offload more operational
responsibilities to AWS. Increase agility
and innovation by allowing you to focus on writing the business
logic and services. Your customer services
technology offers automatic scaling, built in, high availability,
and a consumption based billing model for cost optimization.
Serverless allow you to build and run applications and services without
worrying about infrastructure. Eliminate infrastructure management tasks
such as server or cluster provisioning, patching,
operating system maintenance and capacity provisioning.
AWS offers many other serverless services which I won't cover here
such as DynamoDB and Redshift, serverless etc.
Now I'm going to hand over the call to Satish who is going to deep
dive Lake house architecture. Thank you Rajiv. Now that you understand
regular data lake, let me explain the building blocks of lake House architecture
data lakes have become default repository for all kinds of data.
A data lake services as a single source of truth for a large number of
users querying from a variety of analytics and machine learning tools.
Is your data lake getting unmanageable? Do you want to build
a highly scalable, cost effective data lakes with transactional capabilities?
Are you struggling to comply with data regulations as to how customer
data in data lakes can be used? If you are facing these challenges
then this session talks about how lake house architecture solves
those challenges.
What challenges do typical data lake face?
Regular data lakes provide scalable and cost effective storage.
However, this is not possible with regular data lakes when
continuously ingesting data and using transactional capabilities
to query from many analytics tools. At same time,
under CCPA and GDPR regulations,
businesses must change or delete all of customers'data
upon request to comply with customer's right to be
forgotten or change of data. Change of consent to the use
of data it is difficult to make these kind of record level
changes in regular data lakes. Some customers find
change data capture pipeline difficult to handle.
This is especially true for recent data or erroneous data
that needs to be rewritten. A typical data lake would
have to reprocess missing or corrupted data due to job
failures, which can be a big problem.
Regular data lake do not enforce schema when writing
so you cannot avoid ingesting low quality data.
Also, one should know the partition or table structure
to avoid full table scan and listing files from
all partitions.
So let's see how a open table format can be used
to address these challenges mentioned on previous slide one of
the key characteristics expected of lake house architecture
is transactional or acid properties.
You do not have to write any code in the transactional or
data lake format which I will cover in next
few slides. Transactions are automatically written to the log presenting
a single source of truth. Advanced features such as
time travel, data transformation with DML, and concurrent
read and writes are also expected in data lake to handle use
cases such as change data capture and late arriving streaming
data over time. You can also expect data lake to have
features such AWS, schema evolution and schema enforcement.
These features allow you to update your schema over time to
ensure data quality during ingestion.
Engine neutrality is also expected in future of
data architecture. Today you use a compute engine to process data,
but tomorrow you can use different engine for new needs
for time travel data lake table format
versions, the big data that you store in the data lake.
You can access any historical versions of the data,
simplifying data management with easy to audit rollback
data in case of accidental bad writes or deletes and reproduce
experiments and reports.
Time travel enables reproducible queries by allowing two
different versions to be queried at same time.
Opentable format work at scale by automatically checkpointing
and summarizing large amounts of data, many files and their
metadata.
So what are your options for creating a data lake house
architecture to solve those regular data lake challenges?
Customers often face a dilemma when it covers to choosing right
data architecture for building data lake house. As such,
some customers use data warehouse to
eliminate the need of data lakes and the complexity
that comes with the data lake. However, a new pattern that
is emerging as a popular pattern for implementing
data lake house on AWS is to combine both
data lake and data warehousing capabilities.
This pattern is known as lake House architectural pattern.
There are two options for creating lakes House on AWS,
which I will talk about in next few slides.
So before diving into each data lake table format and
the lake house architectural options on AWS,
let me quickly compare the building blocks
that we discussed on or I discussed on previous slide.
So depending on your needs, a typical organization will
need both a data warehouse and
data lake that serve different needs and use cases.
Data lakes store both structured and unstructured
data from various other data sources such as
mobile apps, IoT devices, and social media.
The structure of the data or schema is not defined at
the time of data collection. This means you can store
all your data without having to plan carefully
or know what questions you will need to answer in the future.
A data warehouse is a database optimized for analyzing
relational data from transactional systems. Data structures and
schemas are predefined to optimize fast SQL queries,
the result of which are typically used for operational reporting
and analysis. Data is cleaned, enriched, and transformed
so that it can serve as a single source of truth that users
can rely on. However, once organizations
with data warehouse recognize the benefits of data lakes
house that provide the functionality of both data lake
and data warehouse, they can evolve their data warehouse
to include data lake house and enable various
query capabilities.
So the first lake house architecture option is ready to
use platform on AWS. This approach
allows for separate data warehouse with transactional capabilities
such as Amazon, Redshift, and a cost effective
scalable data lake on Amazon s three technologies
such as Amazon Redshift spectrum can then be used to
integrate strategically distributed data in both data
warehouse and data lake. This approach definitely
simplifies the engineering effort free developers
to focus on feature development and leave the infrastructure to the cloud to harness
the power of serverless technology from storage to processing
and to presentation layer. In this pattern,
data from various data sources is aggregated
into Amazon s three before transformation or loading
into data warehouse. This pattern is useful if
you want to keep the raw data in the data lake and process data in
the data warehouse to avoid scaling cost. With this option,
you can take advantage of Amazon Redshift's transactional
capabilities and also run low latency analytical queries.
The second option is do it yourself option for creating
a larger data lakes house. Why do it
yourself? This pattern is growing in popularity because
of three table formats, Apache hoodie, Delta Lake,
and Iceberg, that have emerged over the past few
years to power data lake house that
support acid transactions, time travel,
granular access control, and deliver a very good
performance compared to regular data lake. These open table
data lake formats combines the scalability and
cost effectiveness of data lake on Amazon s
three and transactional capabilities, reliability and
performance of data warehouse to ensure greater scale.
Table formats or data lake table formats are instrumental for
getting the scalability benefits of data lake and the underlying
Amazon s three object store, while at the same time getting the data
quality and governance associated with data warehouses.
These data lake table format framework also
add additional governance compared to regular data lake.
Optionally, you can connect Amazon redshift for
low latency OLAP access to business ready data.
Now I will quickly walk through three popular table formats.
First one is Apache hoodie. Apache Hoodie follows
timeline based transaction model. A timeline
contains all actions performed on the table at different instance
of time. The time could provide instantaneous views
of table and support to get data in the order of arrival.
Apache Hoodie offers both multi
version concurrency control and optimistic concurrency control.
Using multi version concurrency control, Hoodie provides
snapshot isolation between an ingestion writer and
multiple concurrent readers. It also apply
optimistic concurrency control for a reader and writer.
Hoodie supports file level optimistic
concurrency control, that is, for any two commits or writers
happening to the same table. If they do not have rights
to overlapping files being changed, both writers
are allowed to succeed. The next one is time
travel. You could also do a time travel. According to hoodie,
commit time hoodie supports schema evolution to
add, delete, modify, and move columns, but it does not
support partition evolution. You cannot change partition
column when it comes to storage optimization.
Auto file sizing and auto companies is great for ensuring
storage optimization by avoiding small files in
Apache hoodie, and the last one is indexing. By default.
Hoodie uses index that stores mapping between record
key and file group id it belongs to. When modeling,
use record key that is monotonically increasing.
For example timestamp prefix for best index performance
by range pruning to filter out the files.
The second option is a table format is Apache iceberg.
Apache Iceberg follows snapshot based transaction
model. A snapshot is a complete list of files in
the table. The table state is maintained in metadata
files. All changes to table state create a new metadata file
and they replace old metadata file. With atomic swap,
iceberg follows optimistic concurrency control.
The writers create a table metadata files optimistically assuming
that current version will not be changed before the writers commit.
Once writer has created can update, it commits
by swapping the table's metadata file pointer from the base version
to the new version. If the snapshot on which update is
based is no longer current, the writer must retry
the update based on the new version current the
new version time
travel user could also do time travel using according
to snapshot id and timestamp. When it
comes to storage optimization, you can clean up unused older
snapshots by marking them as expired based on certain time period
and then manually run spark job to delete them.
To optimize files into larger files, you need
to run spark job in background manually
and the last one is indexing. Apache iceberg uses
value range for columns to skip data files
and partition fields to skip manifest files
when executing query.
Delta Lake Delta Lake has a transaction model
based on transaction log. It logs the file operations
in JSON file and then commit to the table using
atomic operations. Delta Lake automatically generates checkpoint
files every ten commits into parquet file.
Delta Lake employs optimistic concurrency control optimistic
concurrency control is a method of dealing with concurrent transactions that
assume that transactions or changes made to
table by different users can complete without conflicting with one another.
User cloud also do time travel query according to the timestamp
or version number deltax supports
or lets you update schema by a schema of
a table by adding new column or reordering existing column.
And when it comes to storage optimization, delta Lake
does not have companies as it follows copy
and write. Hence file sizing is manual.
You need to run vacuum and optimize file size
command to convert small files into large files.
Delta Lake collects column sets for
data skipping index during query so it takes
advantage of this information, the minimum maximum values
of each column to add query time to provide faster queries.
The zorder index technique it uses
to colocate the data skipping information in same file
for a particular column to be used in Z order.
So this is a quick snapshot of how these table
formats are integrated with AWS analytics
services. Amazon Athena has better integration with
Apache Iceberg in terms of read write operations, whereas it
supports only read operations on Apache hoodie
and Delta Lake. Amazon redshift spectrum supports
both Apache hoodie and Delta Lake for reading data.
EMR and glue supports both read write
against these table formats. These three table formats
you can also manage permissions in Amazon Athena
using AWS Lake formation for Apache hoodie and Apache iceberg
table format. Similarly, you can manage permissions in
Amazon redshift spectrum using AWS Lake formation for
Delta Lake and Apache Hoodie.
To conclude, here are final thoughts on choosing data
Lake table format for building lake House architecture on AWS
based on your use case as well as integration with AWS
analytics services. Apache hoodie is considered
suitable for streaming use cases whether it's IoT data or
change data capture from database. Hoodie provides highly
flexible three types of indexes for optimizing
query performance and also optimizing data storage.
Because of autofile sizing and clustering optimization feature
backed by index lookup, it is great for streaming use cases.
It comes with managed data ingestion tool called delta
Streamer unlike other two table formats.
Second option is Apache Iceberg. If you're looking for
easy management of schema and partition evolution,
then Apache Iceberg is suitable table format. One of
the advantage of Apache iceberg is how it handles partitions.
Basically IT services partition value from data fields used
in a SQL where condition.
One does not need to specify exact partition key in
the SQL query. Unlike Hoodie
and Delta Lake, Iceberg allows easily to change
partition column on table. It simply starts writing to
new partition. Third option is Delta
Lake. This table format is suitable if your data platform is
built around spark framework with deep integration of spark features.
AWS Delta Lake stores all metadata state information in
transaction log and checkpoint files instead of metastore.
You can use separate spark cluster to build table
state independently without relying on central Metastore
and this really helps to scale and meet the performance requirements
depending on your spark cluster size. Thank you for listening. Me and my
colleague hope you all enjoyed this session.