Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, I'm Guy Adornock, founding engineer at Reverse, the
company behind LakeFS, an open source platform for managing data at scale.
I spent the past few years working on LakeFS, helping teams version and
manage their data more efficiently.
In this talk, we'll explore how data workflows have shifted from
running locally to the cloud.
We'll look at Python libraries like Pandas and TensorFlow and
how they handle cloud storage.
A key part of this transition is FSSpec, a Python library that makes working
with different storage systems seamless.
We'll see how it helps both users accessing object stores
and storage providers looking to improve their compatibility.
Finally, we'll dive into how FSSpec can be implemented for an object
store using Lake Office as an example.
Let's start by looking at the local environment.
working locally is simple and convenient.
And here's why.
of all, built in library support.
Popular Python libraries like Pandas, TensorFlow, and PyArrow have always
worked seamlessly with local files.
Second, standard file handling.
Python's built in file interface makes it easy to work with files
using familiar methods like read and write and setup required.
Just point, point to a file and it works.
No extra configuration needed.
As we see in these examples, working locally is simple and straightforward.
Whether we're using pandas to read a csv or tensorflow to load a model, we
just point to a file path and it works.
No extra steps, no setup required.
However, while working locally is easy, it does have its limits.
As data grows and project scales, those limitations become more noticeable.
cloud offers some clear advantages in overcoming these challenges.
Let's look at the key benefits of moving to cloud.
Redundancy.
Cloud storage keeps your data backed up and protected from local failures.
Scalability.
You can handle massive data sets without worrying about running
out of storage or compute power.
Collaboration.
Teams can easily share and access data without needing to pass around files.
Security.
Cloud providers offer strong security measures, often better than what's
available on personal machines.
And accessibility.
Your data is available from anywhere, making it easy to work
across devices and locations.
So moving to the cloud unlocks new possibilities, but it
also brings in new challenges.
Let's take a closer look.
At what this transition looks like transitioning to the cloud poses
challenges, especially when it comes to using familiar libraries.
How can we bridge the gap between favorite tools and cloud storage?
One approach is to manually sync data between the local and the cloud storage.
For instance, use AWS CLI or border three and download and
upload the data all the time.
So while this approach can get the job done.
It's not the most practical, and it becomes error prone, cumbersome, and
hard to maintain as things change.
Another approach is using library specific connectors.
Some libraries, like Pandas and TensorFlow, offer built in
support for certain cloud storage providers, making integration easier.
At first, this sounds great.
It allows for seamless access to cloud data without extra
steps, But there are tradeoffs.
Not every object store is supported, meaning if you're using a less common
provider, you might be out of luck.
library also needs to maintain its own implementation, which
leads to duplicate effort.
And even when support exists, way each library handles configuration,
dependencies, and even basic operations, things inconsistent.
while this approach can work, it's not a one size fits all.
we need something more flexible and standardized.
That's where FSpec in.
The creators of FSpec saw the need for a unified way to interact with
file systems, both local and remote.
Instead of each library implementing its own cloud integration, FSpec
provides a standard interface that simplifies the process.
FSpec.
com It provides a unified interface, local and remote file systems behave the
same, multiple backend support, access a variety of storage systems via the same
interface, of integration, libraries and tools can leverage FSPEC without
worrying about backend specific details, and enhanced capabilities, features like
caching, transactions, and concurrency come built in improving efficiency.
Instead of every library re implementing these best practices with FSpec,
we only need a single, optimized implementation that handles them all.
Once it's in place, everyone can use it.
using FSpec, we get the flexibility of cloud storage,
without giving up the simplicity.
Of working locally for the FS spec audience, we have the end user library
implements and the backend providers.
So FS spec allows you to interact with cloud storage just as
you would with local files.
For example, with Fs Specs, S3 implementation, S3 FS,
can read a file from S3.
Just like a local file, moving to any other provider will only require changing
the F initialization and the paths.
as you can see in the example.
FSpec simplifies the process in two ways.
Explicit configuration.
You can manually provide the necessary configuration when
initializing the file system.
This gives you control over how the connection is made.
The other option?
No initialization needed at all.
In many cases, you don't need to initialize the file system.
With just an open function call, FSPEC automatically handles the connection,
pulling the required configuration directly from your environment.
Just like AWS CLI does.
this means you don't have to worry about the underlying setup at all.
Library developers.
Some libraries integrate FSPEC directly, allowing seamless cloud storage
access without additional setup.
For example, Pandas uses FSpec under the hood.
This means that any storage backend implementing FSpec
automatically works with Pandas.
beyond just using FSpec, you can extend it by creating a new
backend for a storage system.
take a real world example.
as I mentioned earlier, I work at Tervoz, the company behind LakeFS.
And one of the users from the community, recognized the value of integrating LakeFS
with FSpec and decided to contribute LakeFS FSpec to make that happen.
while LakeFS itself isn't an object store, it adds data versioning on top
of one and can be accessed like one.
So we'll showcase the LakeFS FSpec implementation and demonstrate
how it enables compatibility with Pandas, PyArrow, TensorFlow,
and many other libraries.
before we dive into LakeFS FSpec, let's take a step back.
and talk about what is LakeFS and why is it so useful.
LakeFS is an open source project that sits on top of your object store and
provides Git like capabilities such as commit, merge, revert, tag, and so on.
we take a look at your current data ecosystem, working with LakeFS is
almost identical to working with S3.
When accessing the data, the only difference will be That now in
addition to the path, you need to reference the branch you are working.
Lake Office is accessible by UI, a CLI called Lake CTL, and clients
in Python, Java, and many more.
Let's take a high level look at how it works.
Mainly to emphasize that there is no need in copying data.
commit is a list of pointers to objects in your object store.
we edit one file and commit it.
you can see, the next commit will point to all the unchanged data
from before, but instead of pointing to 790, we now point to 214.
if we were working on a data set of maps that is a few terabytes, And we want to
test our code, all we need to do is create a new branch, and we get a dev environment
in no time, and no duplication of data.
Or, if we trained our model a few months ago, and we had some change in our code,
can just go back in time, check out from a commit back then, and check our
updated model on the exact same data.
Now, let's see what would be needed in order to implement FSpec for LakeFS.
In order to implement FSpec, all we need to do is implement
the class AbstractFileSystem.
The class AbstractFileSystem provides a template of methods that a potential
implementation should supply, as well as default implementations of
functionalities that depend on these.
Methods that could be implemented are marked with not implement error or pass.
Pass for directory operations that might not be required for some backends
where directories are emulated.
Note that not all of the methods are emulated.
For example, some implementations may be read only, in which things
like pipe, put, touch, rm, and so on can be left as not implemented.
might also choose to return some read only exception or raise a permission error.
your backend supports async operations, can implement async file system, which
extends abstract file systems, and that way you offer asynchronous capabilities.
To register a new backend with FSpec, you can use entry point and setup tools.
example, to add a new file system protocol, myfs, You'd
add these to your setup UI.
Alternatively, you can register it manually or submit it submitting
a PR to the FSpec Repository to include in the known implementations.
LakeFS has an FSpec implementation called lakefs fspec.
That was contributed by the community, specifically by a
cool company called Material.
ai.
won't go into the full implementation, but aside from optimization and caching,
most of it is pretty straightforward.
LS would call LakeFS LS using the SDK.
Copy will call LakeFS copy, and so on.
One particularly interesting aspect is how they handle transactions
using ephemeral branches.
When a transaction starts, an ephemeral branch is created.
Once the transaction is committed, we do a regular LakeFS commit and
merge it back using LakeFS merge.
What's really cool here is how naturally LakeFS and FSpec fit together.
FSpec defines a file system interface, and LakeFS, with its commit based
model, happens to provide an elegant way to support atomic transactions.
It's one of those cases where the pieces just align perfectly.
By using ephemeral branches, every transaction starts in isolation,
allowing changes to be done safely.
when everything is ready, we can merge, making the process atomic and reliable.
This means users get the benefits of versioning and consistency
without any extra complexity.
let's see FSpec in action.
so let's see the slide.
the first thing we'll do is go to LakeFS.
We have a cloud installation.
T Reverse organization.
We will need to log in.
Here's my Google Workspace.
I already logged into Google Workspace, so it'll just take me there.
And we will create our new repository.
So you can see here a bunch of repositories.
Let's create our own.
And call it fspec example.
We need to put, provide a storage namespace.
This is basically place where the data will exist in S3.
As I said, we don't save the data is in our underlying storage, in this case,
it's S3, and we will add a sample data.
Okay, so this is our repository, we have our sample data
here, and there's the lakes.
k file, we could take a look at it, and, yeah, is this.
now, let's go back to pandas, and, use, Fspec and LakeFS implementation
in order to read the data.
the first thing we'll do is import pandas as pd and create a datatime.
ipv8.
get.
Let's go back and take the URI for the object and use it here.
take a look.
That's it.
We basically used Pandas, and Pandas uses FSpec, and FSpec has a LakeFS
implementation, so once we gave the LakeFS schema, it all works.
let's take a look on a transaction would look like.
let's say we want to make a few changes.
We want to create a branch.
Do our changes and eventually merge them all back in.
So first of all, we would need to use LakeFS FSpec so it LakeFSspec and we will
input the LakeFS file system initiate it.
we don't need to give anything because it will read the configuration from our,
in this case, the Lake CTL YAML file.
And now let's do a transaction.
fsnc.
transaction.
We provide the repository we're working on, fspec.
example, and branch.
Let's see what this did.
It's closed.
we started a transaction, and now let's read the data frame.
it's basically this, but instead of that, we will use the transactionSuppositor.
Instead of branch, we will use the transaction, sorry,
the transactionBranchId.
Mmkay.
now, let's filter out the lakes from Canada.
Okay, what's it?
Dixie.
And once we've filtered out the lakes from Canada, Yeah, this looks Okay.
Yeah, this looks good.
we take the Canada lakes, write them to pocket, write them on the
repository, on the branch, call it Canada lakes, dot pocket, and that's it.
And now we can commit.
tx.
commit, and we add Canada Lakes.
Now let's do the same thing for Germany.
we'll take Germany Lakes, will be Germany.
And here we'll call it johnX and then johnX.
once we run this, if everything is correct, we will read the data
from, we will create a new branch.
as a transaction, read the data from our ephemeral branch, out the lakes from
Canada, it in to our branch, commit it.
We will do the same thing from, for Germany, commit that as well.
And once the transaction is over, it will merge all the data
in one atomic action into main.
see, let's move here.
Let's look if we have any branches, if we're lucky.
Yeah.
We could see the transaction branch.
So as you can see, the transaction branch has two commits.
if I refresh, we won't have it anymore because it was an ephemeral branch
and it was deleted let's go back.
Yeah.
As I said, it was deleted.
Now we have only one branch and as we could see, we have the data here.
we could also take a look and a look at it.
Of course.
Yeah.
So I took Canada lakes here.
You get it.
okay.
we could also see when we inserted the data.
So we could do a blame and at the commit.
So this is the ad Germany lakes.
We could see what was added here.
See the changes.
We could see the parents.
one before was at Canada lakes.
also look at all comments and we could see The last commit was
basically a merge commit that merges the transaction branch into main.
yeah, that's about it.
to sum it up, we've explored the evolution from working with local files to handling
data in the cloud, and how tools like FSpec bridge the gap, providing a unified
interface for working with local files.
Various storage systems by implementing FSpec for LakeFS.
We've seen how it enables seamless integration with Python libraries
like Pandas and TensorFlow.
LakeFS FSpec, contributed by the community, not only makes
LakeFS more accessible, but also introduces powerful transaction
support using ephemeral branches.
ensuring atomic and reliable operations.
As data continues to scale, solutions like LakeFS and FSPEC
help teams work efficiently, the flexibility of cloud storage with
the simplicity of local workflows.
If you're looking to bring version control and transactional consistency
to your data workflows, LakeFS with FSPEC is a great tool to explore.
Thank you.
Happy to take any questions.