Distributed File Systems Made Easy with Python's fsspec

Video size:

Abstract

One of the ongoing challenges in the data engineering world remains the local and distributed cloud native storage. In this talk we’ll show hands-on examples of working with fsspec with some of the most popular data tools in the Python community: Pandas, Tensorflow and PyArrow.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey everyone, I'm Guy Adornock, founding engineer at Reverse, the company behind LakeFS, an open source platform for managing data at scale. I spent the past few years working on LakeFS, helping teams version and manage their data more efficiently. In this talk, we'll explore how data workflows have shifted from running locally to the cloud. We'll look at Python libraries like Pandas and TensorFlow and how they handle cloud storage. A key part of this transition is FSSpec, a Python library that makes working with different storage systems seamless. We'll see how it helps both users accessing object stores and storage providers looking to improve their compatibility. Finally, we'll dive into how FSSpec can be implemented for an object store using Lake Office as an example. Let's start by looking at the local environment. working locally is simple and convenient. And here's why. of all, built in library support. Popular Python libraries like Pandas, TensorFlow, and PyArrow have always worked seamlessly with local files. Second, standard file handling. Python's built in file interface makes it easy to work with files using familiar methods like read and write and setup required. Just point, point to a file and it works. No extra configuration needed. As we see in these examples, working locally is simple and straightforward. Whether we're using pandas to read a csv or tensorflow to load a model, we just point to a file path and it works. No extra steps, no setup required. However, while working locally is easy, it does have its limits. As data grows and project scales, those limitations become more noticeable. cloud offers some clear advantages in overcoming these challenges. Let's look at the key benefits of moving to cloud. Redundancy. Cloud storage keeps your data backed up and protected from local failures. Scalability. You can handle massive data sets without worrying about running out of storage or compute power. Collaboration. Teams can easily share and access data without needing to pass around files. Security. Cloud providers offer strong security measures, often better than what's available on personal machines. And accessibility. Your data is available from anywhere, making it easy to work across devices and locations. So moving to the cloud unlocks new possibilities, but it also brings in new challenges. Let's take a closer look. At what this transition looks like transitioning to the cloud poses challenges, especially when it comes to using familiar libraries. How can we bridge the gap between favorite tools and cloud storage? One approach is to manually sync data between the local and the cloud storage. For instance, use AWS CLI or border three and download and upload the data all the time. So while this approach can get the job done. It's not the most practical, and it becomes error prone, cumbersome, and hard to maintain as things change. Another approach is using library specific connectors. Some libraries, like Pandas and TensorFlow, offer built in support for certain cloud storage providers, making integration easier. At first, this sounds great. It allows for seamless access to cloud data without extra steps, But there are tradeoffs. Not every object store is supported, meaning if you're using a less common provider, you might be out of luck. library also needs to maintain its own implementation, which leads to duplicate effort. And even when support exists, way each library handles configuration, dependencies, and even basic operations, things inconsistent. while this approach can work, it's not a one size fits all. we need something more flexible and standardized. That's where FSpec in. The creators of FSpec saw the need for a unified way to interact with file systems, both local and remote. Instead of each library implementing its own cloud integration, FSpec provides a standard interface that simplifies the process. FSpec. com It provides a unified interface, local and remote file systems behave the same, multiple backend support, access a variety of storage systems via the same interface, of integration, libraries and tools can leverage FSPEC without worrying about backend specific details, and enhanced capabilities, features like caching, transactions, and concurrency come built in improving efficiency. Instead of every library re implementing these best practices with FSpec, we only need a single, optimized implementation that handles them all. Once it's in place, everyone can use it. using FSpec, we get the flexibility of cloud storage, without giving up the simplicity. Of working locally for the FS spec audience, we have the end user library implements and the backend providers. So FS spec allows you to interact with cloud storage just as you would with local files. For example, with Fs Specs, S3 implementation, S3 FS, can read a file from S3. Just like a local file, moving to any other provider will only require changing the F initialization and the paths. as you can see in the example. FSpec simplifies the process in two ways. Explicit configuration. You can manually provide the necessary configuration when initializing the file system. This gives you control over how the connection is made. The other option? No initialization needed at all. In many cases, you don't need to initialize the file system. With just an open function call, FSPEC automatically handles the connection, pulling the required configuration directly from your environment. Just like AWS CLI does. this means you don't have to worry about the underlying setup at all. Library developers. Some libraries integrate FSPEC directly, allowing seamless cloud storage access without additional setup. For example, Pandas uses FSpec under the hood. This means that any storage backend implementing FSpec automatically works with Pandas. beyond just using FSpec, you can extend it by creating a new backend for a storage system. take a real world example. as I mentioned earlier, I work at Tervoz, the company behind LakeFS. And one of the users from the community, recognized the value of integrating LakeFS with FSpec and decided to contribute LakeFS FSpec to make that happen. while LakeFS itself isn't an object store, it adds data versioning on top of one and can be accessed like one. So we'll showcase the LakeFS FSpec implementation and demonstrate how it enables compatibility with Pandas, PyArrow, TensorFlow, and many other libraries. before we dive into LakeFS FSpec, let's take a step back. and talk about what is LakeFS and why is it so useful. LakeFS is an open source project that sits on top of your object store and provides Git like capabilities such as commit, merge, revert, tag, and so on. we take a look at your current data ecosystem, working with LakeFS is almost identical to working with S3. When accessing the data, the only difference will be That now in addition to the path, you need to reference the branch you are working. Lake Office is accessible by UI, a CLI called Lake CTL, and clients in Python, Java, and many more. Let's take a high level look at how it works. Mainly to emphasize that there is no need in copying data. commit is a list of pointers to objects in your object store. we edit one file and commit it. you can see, the next commit will point to all the unchanged data from before, but instead of pointing to 790, we now point to 214. if we were working on a data set of maps that is a few terabytes, And we want to test our code, all we need to do is create a new branch, and we get a dev environment in no time, and no duplication of data. Or, if we trained our model a few months ago, and we had some change in our code, can just go back in time, check out from a commit back then, and check our updated model on the exact same data. Now, let's see what would be needed in order to implement FSpec for LakeFS. In order to implement FSpec, all we need to do is implement the class AbstractFileSystem. The class AbstractFileSystem provides a template of methods that a potential implementation should supply, as well as default implementations of functionalities that depend on these. Methods that could be implemented are marked with not implement error or pass. Pass for directory operations that might not be required for some backends where directories are emulated. Note that not all of the methods are emulated. For example, some implementations may be read only, in which things like pipe, put, touch, rm, and so on can be left as not implemented. might also choose to return some read only exception or raise a permission error. your backend supports async operations, can implement async file system, which extends abstract file systems, and that way you offer asynchronous capabilities. To register a new backend with FSpec, you can use entry point and setup tools. example, to add a new file system protocol, myfs, You'd add these to your setup UI. Alternatively, you can register it manually or submit it submitting a PR to the FSpec Repository to include in the known implementations. LakeFS has an FSpec implementation called lakefs fspec. That was contributed by the community, specifically by a cool company called Material. ai. won't go into the full implementation, but aside from optimization and caching, most of it is pretty straightforward. LS would call LakeFS LS using the SDK. Copy will call LakeFS copy, and so on. One particularly interesting aspect is how they handle transactions using ephemeral branches. When a transaction starts, an ephemeral branch is created. Once the transaction is committed, we do a regular LakeFS commit and merge it back using LakeFS merge. What's really cool here is how naturally LakeFS and FSpec fit together. FSpec defines a file system interface, and LakeFS, with its commit based model, happens to provide an elegant way to support atomic transactions. It's one of those cases where the pieces just align perfectly. By using ephemeral branches, every transaction starts in isolation, allowing changes to be done safely. when everything is ready, we can merge, making the process atomic and reliable. This means users get the benefits of versioning and consistency without any extra complexity. let's see FSpec in action. so let's see the slide. the first thing we'll do is go to LakeFS. We have a cloud installation. T Reverse organization. We will need to log in. Here's my Google Workspace. I already logged into Google Workspace, so it'll just take me there. And we will create our new repository. So you can see here a bunch of repositories. Let's create our own. And call it fspec example. We need to put, provide a storage namespace. This is basically place where the data will exist in S3. As I said, we don't save the data is in our underlying storage, in this case, it's S3, and we will add a sample data. Okay, so this is our repository, we have our sample data here, and there's the lakes. k file, we could take a look at it, and, yeah, is this. now, let's go back to pandas, and, use, Fspec and LakeFS implementation in order to read the data. the first thing we'll do is import pandas as pd and create a datatime. ipv8. get. Let's go back and take the URI for the object and use it here. take a look. That's it. We basically used Pandas, and Pandas uses FSpec, and FSpec has a LakeFS implementation, so once we gave the LakeFS schema, it all works. let's take a look on a transaction would look like. let's say we want to make a few changes. We want to create a branch. Do our changes and eventually merge them all back in. So first of all, we would need to use LakeFS FSpec so it LakeFSspec and we will input the LakeFS file system initiate it. we don't need to give anything because it will read the configuration from our, in this case, the Lake CTL YAML file. And now let's do a transaction. fsnc. transaction. We provide the repository we're working on, fspec. example, and branch. Let's see what this did. It's closed. we started a transaction, and now let's read the data frame. it's basically this, but instead of that, we will use the transactionSuppositor. Instead of branch, we will use the transaction, sorry, the transactionBranchId. Mmkay. now, let's filter out the lakes from Canada. Okay, what's it? Dixie. And once we've filtered out the lakes from Canada, Yeah, this looks Okay. Yeah, this looks good. we take the Canada lakes, write them to pocket, write them on the repository, on the branch, call it Canada lakes, dot pocket, and that's it. And now we can commit. tx. commit, and we add Canada Lakes. Now let's do the same thing for Germany. we'll take Germany Lakes, will be Germany. And here we'll call it johnX and then johnX. once we run this, if everything is correct, we will read the data from, we will create a new branch. as a transaction, read the data from our ephemeral branch, out the lakes from Canada, it in to our branch, commit it. We will do the same thing from, for Germany, commit that as well. And once the transaction is over, it will merge all the data in one atomic action into main. see, let's move here. Let's look if we have any branches, if we're lucky. Yeah. We could see the transaction branch. So as you can see, the transaction branch has two commits. if I refresh, we won't have it anymore because it was an ephemeral branch and it was deleted let's go back. Yeah. As I said, it was deleted. Now we have only one branch and as we could see, we have the data here. we could also take a look and a look at it. Of course. Yeah. So I took Canada lakes here. You get it. okay. we could also see when we inserted the data. So we could do a blame and at the commit. So this is the ad Germany lakes. We could see what was added here. See the changes. We could see the parents. one before was at Canada lakes. also look at all comments and we could see The last commit was basically a merge commit that merges the transaction branch into main. yeah, that's about it. to sum it up, we've explored the evolution from working with local files to handling data in the cloud, and how tools like FSpec bridge the gap, providing a unified interface for working with local files. Various storage systems by implementing FSpec for LakeFS. We've seen how it enables seamless integration with Python libraries like Pandas and TensorFlow. LakeFS FSpec, contributed by the community, not only makes LakeFS more accessible, but also introduces powerful transaction support using ephemeral branches. ensuring atomic and reliable operations. As data continues to scale, solutions like LakeFS and FSPEC help teams work efficiently, the flexibility of cloud storage with the simplicity of local workflows. If you're looking to bring version control and transactional consistency to your data workflows, LakeFS with FSPEC is a great tool to explore. Thank you. Happy to take any questions.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Distributed File Systems Made Easy with Python's fsspec

Video size:

Abstract

Summary

Transcript

Slides

Guy Hardonag

Software Engineer @ Treeverse

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Distributed File Systems Made Easy with Python's fsspec

Video size:

Abstract

Summary

Transcript

Slides

Guy Hardonag

Software Engineer @ Treeverse

Join the community!