Unlocking Your Data's Potential with Python

Video size:

Abstract

The quick and easy way to supercharge your applications with many sources and types of data including unstructured, semistructured and structured. I will show you how with Python and some other interesting tools. I will also touch on Apache Iceberg, Apache NiFi, Snowflake and AI.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, Tim Spam here, Senior Solutions Engineer. Today, I'm going to be talking about unlocking your data's potential utilizing Python. Let's get to it. I've been doing data, AI, IoT, Edge stuff for a number of years at some different companies. right now I'm at Snowflake and it's pretty awesome. We have some amazing technology, really easy to use, really easy to use any kind of data. And easy to work with, lots of different, open source technologies that we support. Apache NiFi, Apache Iceberg, the newest, very interesting, Apache Polaris, Streamlit, let you build some cool apps. And, on the side I'm also doing some Flink and Kafka, because those are a lot of fun in the streaming space. so let's get into it. We'll do the intro, do an overview, look at some super friends, where, what, why, all the things. When you're going to unlock a lot of data, you need a team, and that team could be a combination of people, AI, and technology. We're going to cover some of the technologies that intersect with Python, and let you do some interesting data stuff. And that's Snowpark, of course Python, Polaris, Iceberg, Streamlit, NiFi, and in the core of it all, Snowflake that makes things easy, fast, and cheap, and supports all this great open source. So let's first look at, the different types of data out there. And there's lots of ways to define it. But we're going to look at the data that, makes sense to process today. Structured, semi structured, and unstructured. So cats, let's take it away. First, we'll look at semi structured. This one, there's a lot of it. Oh, I have a kitty here. Maybe he won't come on camera. Actually, I don't know if you could see him. He is a good man. His name is Mr. Splunk. Probably wants some kind of snack. That's what he's into. so semi structured data. Very common for open data. the ones from OpenAQ. Air quality data. Things like location, time, sensors. Lots of different formats out there. Like Apache Avro. Apache Parquet. Apache ORC. These also can be used for more structured data. When we layer on top Iceberg. JSON and XML, we'll work with, that today. Today actually we're going to work with Protobuf, convert it into JSON and use that. This is where you'll see things like hierarchical data, arrays, things that are, they're they got a structure, but it's not, easily do in SQL, it's a pain. You probably want that more structured to be faster, easier to work with. Things like logs, key value, and this is a source of a lot of the raw data out there, and a good source of data. We're never going to forget about that. Unstructured data, in the past you just put it on the side, maybe used a specialized, functioning search engine against it. didn't do too much with it, maybe pulled some text out. We were doing that for the last couple of years with NiFi. There's a lot of formats, things like text, documents, PDF, images, videos, audio, email, variants. Now fortunately, with the new tools out there, using Apache NiFi and Python, there's a lot of libraries that let us extract, convert things, turn a Word document into Markdown, grab that text, use it for different, brag approaches, use it for AI. Doing really good stuff with, Snowflake Cortex and NiFi here. Now Structured Data, this we've been using for a very long time. And this is commonly stored in things like Snowflake Tables, Snowflake Hybrid Tables, it lets you do some interesting stuff. Iceberg Tables, the standard relational tables that can be set in any of your databases. Postgres is a good example. Also, CSV and tab or fixed width files are very structured as they're not, hierarchical. There's no arrays in there. It's flat. It maps to fields really well. Of course, in a CSV, you can't enforce, types or other things, but it's very structured. Now, next up, Iceberg. You haven't used Iceberg yet. You really should. It is high performance. It is open source. There's a great community there and it works out really well for extremely large analytical tables. SQL support, very strong. Tons of different engines, which is nice is this lets you set up things the way you want and makes data available for different tools, whether it's Snowflake, Trino, Flink, Presto, Hive. Tons of different things can, query this and run your applications. Great way to, make data available for different tools. I like this as a good way to back up your data into a data lake. When you have like your production data and snowflake, what's nice with iceberg supports, lots of different data catalogs and the, API for that is very open. The two big ones are Polaris. Which is the shiny new open source one that's got some interesting Features there and then been around for a while Nessie, which has an awesome Logo, it does some cool git stuff. They may start getting closer together as there's a lot of people working in common on those but Pick which one works best in your environment What's nice if you're using Snowflake things you're handles where you don't have to worry about it. A couple of features in Iceberg that are pretty awesome. support for the full evolution of your schema so things can change as data changes over time. That's the real world. You get time travels. You can go back and look at older data. This is great if you accidentally deleted something you shouldn't have and you'll be able to go back or compare a current thing. Then you do a load, look at it afterwards. Partitioning, which used to be super hard back in the old days of Hive, this is managed for you, it's hidden, it works, and you can, make changes to it, you don't have to think about it, you don't have to see it, there's no big directory structures lying everywhere, this is much better with metadata. You can also roll back, like we mentioned, and compact your data over time so it's not, Big and sparse, the problem with appending data on things like object storage, S3. It can be, very sparse. You remember the old, hard drives. We had to keep defragging them. with the compaction, it's doing that for you, so that's pretty nice. lots of engines. Again, I like Flink and Pi Iceberg and Trino and Snowflake for this. Catalog with some RBAC. Gets you to your iceberg, whichever cloud you're running on. Very easy to do. Now, to get data into Iceberg, we can append that with NiFi. There's a processor for that, a couple actually, depending on whose NiFi you're using. And, you could use Snowpark's Python, just to write some data frames there. Save that as a table. Pretty easy, we've got a link down there. And yes, after that you can have the data, which is nice. We'll look at Snowpark. Again, Python library. And I can code from anywhere. I don't have to code from Snowflake. But if you've got Snowflake, just use their notebook. It's a little easier, but whatever. so you've got this engine. It's elastic, supports other languages. We'll pretend Python is the only language in the world, but there's others. There's libraries in there for data engineering, anything you need to do there. Lots of machine learning and AI stuff. Streamlit to build cool, extremely simple apps. I wrote one a couple of months ago and I'm like, let me just put these three things there. I'm like, it's a full app. I use that for my ghost hunting app that stored things into vectors. data access super fast. faster than almost any library out there with full security you get data frames you get pandas it looks like what you'd expect when you're doing pi data stuff why should you use it very easy to build these scalable data pipelines you build them on models apps any kind of data processing you want to do and you still get all the governance and security which is nice and again you can write this custom code from any notebook or id IDE and this you know like visual code and have that automatically push down into the compute engine You don't have to Worry if it's going to be performant or you've got to do anything else. It just works and works fast really cool And like I said, it's where it's jupyter notebooks. You can do it from from sql worksheets or snowflake notebooks visual studio code, which is nice. I run that Get your apis It's out there for machine learning and for data pipelines. Run this on top of your virtual warehouse. And again, you could also do this in Java or Scala if you want. And you can run this inside a container and use GPUs if you need them or have them. If you have them, that's pretty awesome. Pretty easy to do that. Should store some data. Which is a good idea. let's look at NiFi. Now, patching NiFi in the past. It's been thought of as a Java tool, or a tool that had nothing to do with Python, or very minimal. fortunately, in the latest version, we have added in 9520, and now we're even further along, the ability to run your Python code, not just as a little script somewhere, not just to a REST endpoint, but as code inside of Python. of our flow, which makes it very powerful, very easy to use. And I've written a number of components to try different things out. I've got one that'll pull company names out of chunks of text. And this uses Hugging Faced, NLP, Spacey, which is underrated, and PyTorch to be able to pull these out. And I use this when I need to grab the company name someone's talking about. Maybe to do a lookup on that with wiki, or grab, stock quotes, whatever. I got one to caption images. Now, again, we talked about that unstructured data. With Apache NiFi, I can work with images. so as images come in, I use, Salesforce Blip image captioning model. Which I might look at, an updated one in the future. This one's pretty good though. And this gives me captions for my images. which is nice as part of, flow when I'm working with that. Then I got another one, doing ResNet 50. Probably some upgrades I could do. This gives me classification labels for anything I'm working there. You don't have to download the images, happens in line. this one's important if you're letting people upload images. using this model I found on Hugging Face. If you haven't looked at Hugging Face, sign up. It's got so much. So much good information on models. Snowflake has some cool Arctic models there. Everything's there. It's awesome. this will give you a score on the safety of an image if you don't accidentally upload bad ones. And if you're not sure if it's close, put that in the side and let that be looked at later. Got another one working on images that looks at, facial emotions. Does some image classification with that. That one's pretty easy. this one is very helpful when you're trying to do geo stuff. And I like this one because once I, convert an address to latitude and longitude, then I can start doing some interesting geo stuff with Snowflake, which is cool. This, library, Nominatum, no relation, from OpenStreetMaps does a pretty good job. No. I played around with four or five different libraries, and among the free ones, this is probably the best. It'll get you pretty close. Doesn't need a full address. This is not a bad thing if you don't know exactly where you are, and you can't grab that from something else. This is nice. I like this library. Now, we showed you some of the Python components that I've written. You can write anything you want. You can take any of your Python and put it in here as, horsepower to process these flows. But what's a flow? What's not if I do? It is a tool for data ingest, movement, routing. And it does this at whatever scale you need to. Start small, scales out. Might sound familiar if you use Snowflake. Some of the same ideas using the advantages of Snowflake. clusters and clouds here. Guarantee you get your data. And this feature is really awesome for buffering and built in back pressure. So that if things slow down, you're not going to lose things. It'll queue up and wait. You can adjust how much latency or throughput you want. And while you're doing that, you could hook that to auto scaling if you need to scale higher. Now data provenance. or lineage is really awesome if you want to do auditing or just figure out what's going on with your distributive processes this keeps a track of everything going on every record through the system everything that's done and you can do things that are push or pull whatever it is there are a lot of processors out there a bunch come with default installs there is a ton more that are companies specific to snowflake And there's a bunch more in the open source like the ones I've written that you can use, and you can write your own. It's pretty easy with Python or Java. Everything's visual. It's got every kind of security pluggable in there. Very easy to extend things out. Very easy to do version control and everything you expect when you're developing code. And you can move binary data, unstructured data, zip files. Images. Data that looks like tables, structured data, similar structured data. You can enrich it. Use, graphical controls to visual process things. Do a simple event processing. We're not replacing Flink here. Route things where they need to go. If they need to go to 20 different databases, I can do that. Send things into some central messaging like Kafka or Pulsar. It supports almost any modern protocol out there. Again, that Python in there is super important. Everything is parameterized. for my Python friends, you probably don't know JDK 21, what's that mean? it's written in Java, and this makes it very modern in ability to work with Python and other things really well. Things run really fast. There's a rules engine in there to help you out. You can use database tables as your schema, support Amazon Glue, OpenTelemetry is supported. you can integrate with a lot of cool, SaaS and online things like Zendesk and Slack. It is, again, scalable. Things are stored so you don't lose things. We're not going to go into that. We'll go through some demos, really quickly. Just want to show you this before I move over into running coding and forget all about these slides, which you can get through, COM 42. I do a weekly newsletter. It is everywhere. It's in LinkedIn, it's on my blog, it's in Medium, it's in DevTools, it's in Hashnode, it's in Substack. everywhere you want, or just look at it in GitHub. I cover a lot of cool open source stuff. Bunch of snowflake stuff. Everything in unstructured, structured, semi structured data. Lots of open source stuff. If there's any kind of news or cool tools you want to, get mentioned, send them to me. I'll put them out. Do that weekly. Let's start automating stuff. Let's get things running. Let's see if we've timed out. We might have. Maybe I should zoom in a little bit. Might be a little small here. So in this is Apache NiFi and we could look at diversion here. This is two one. This is running on my laptop. And we could take a look at, I've got six Giga Ram here. I could see how much storage I have. I'm running on my little server here. You can see I'm using JDP 21. In what version? So you can see that things are running. I could see what's going on in the system real time. And what's cool about NiFi, this UI, everything here can be seen through a REST endpoint, so you don't have to use the GUI if you don't want. But this is how I code. I drag and drop things on here. And if I wanted to add one of the Python processors that I just added, I could just pick one like that. Or I could look at other things, like I want something from AWS. What are the AWS specific things? What are the things that are Azure specific? what's Google stuff? and I could do things like grab things from the Google Drive. I could also look at a lot of different, file systems. And other things like MQTT, WebSockets, Dropbox, Vox, Google Drive. JSON, all kinds of stuff. Listen to FTP, TCP, SNMP, UDP. any attributes you want there. Very good for working with logs. Very good for working with files. And network stuff. anything you've got a wait on, it can wait for data to come. But let's give you an example. I'm gonna just run this once. And this is my Python custom processor. This will grab GTFS data from, the Boston Transit System. Now GTFS data is a Google, transit format that's in Protobuf. And I have my Python thing get it. Make sure it's valid. And then return it as JSON. So we get JSON out of that. And then here, this is that provenance I talked about. And I can look at the, data that came through. And I can see it was a 400K file. I can see some of the metadata, where it came from. And some unique identifications on it. I could also look at the raw data if I wanted. So that data is in system. It has got process. I split it up into individual records. And then I check to see if there's a special, term in the data. Um, multi carriage in, GTFS data means that there's multiple buses or multiple buses connected together. Those records are weird, so I process them separately over here, where I pull out the extra fields there, split them up, then put it back into the other ones, because I want to pull out special fields, and I can do that with JSON path, again, not writing code. And then I'm going to build a new file, I want to add a, timestamp to that, and then I'm going to say that this state is JSON, and I'm going to use, our query record processor to convert it. So what I'm going to do is make sure the route isn't null, and use that SQL to convert JSON to JSON. But, if I decided I wanted to use, output this in another format, I could go, okay, let me create a new one. I can make it comma separated file, or I can create a new one. So I could just go, let me create a new service here. And as you can see, I can make Avro, I can make my own freeform text, JSON, XML. I could also look this up and have it do it dynamically based on some attributes, which is cool. But we're not going to do that. We'll just start that up again. Cool. splits and here I got a bunch of splits. What this one's gonna do is call a database to find, the route data, get that route data from another table and combine it with this record. So that's cool. Also in this other one, let's go to this other bit of code. I am calling a live set of planes that I've got running on a server over there. And this is tracking all the planes near me using ADS B transponders. So I'm grabbing that data. Make sure we will grab one more of these. I split that data out. Get the fields I want. Add a timestamp to the record. Plus a unique ID. Filter this out and then I can push that right to Snowflake. Very simple here. How you want to do that. And then, also I'm splitting that data out, putting out fields I like, and then I'm pushing them to Slack. And pushing things to Slack is really easy. I gotta use the token, put it in the channel I want, and then just format the data along with whatever data I want is my output. And then we just send that to Slack. when I'm done posting over here, I can push that into an S3 bucket. let's see here. Here's a couple more here. This one just got sent in. You gotta keep that running. Keep that running. Send a couple more. And you can see we got new ones coming in. And it's just, the different data here. Altitude, the time stamp. Yeah, I'm sending too many. They don't like when I send that many. I have the free Slack account. Sorry, Slack. I won't send too many messages. But yeah, I could just send anything I want over there. what planes, what their altitude, that sort of thing. just to give you an example while the code is running. pretty easy to do. drop your Python in there, and it doesn't look like anything different. Like I said, in this one we don't have one, but we could just decide, okay, let me add a new Python processor, this fake record one, add it here, connect it, set some parameters, and you're good to go. thanks for watching my talk.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Unlocking Your Data's Potential with Python

Video size:

Abstract

Summary

Transcript

Slides

Tim Spann

Senior Sales Engineer @ Snowflake

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Unlocking Your Data's Potential with Python

Video size:

Abstract

Summary

Transcript

Slides

Tim Spann

Senior Sales Engineer @ Snowflake

Join the community!