Hail Hydrate! From Stream to Lake

Video size:

Abstract

A cloud data lake that is empty is not useful to anyone.

How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLaNK stack enables any data engineer, programmer or analyst to build reusable modules with low or no code.

In this talk we will utilize Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before.

I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let’s hope everyone is ready to go from 0 to Petabyte hero.

Summary

Timothy Span: I'm a develop or advocate. I work mostly in streaming and in open source. I'll show you a lot of different things you might not have thought of before. Let's just get that data, run it as fast as possible, and do it in a way that's scalable and easy to use.
Timothy Stanley: I'm going to show you different ways that you could work with data lakes. Also any other source or sync of data. If you're missing something or going, I wish Tim would do this or that. Drop me a line.
The primary use case is to populate a data lake. With the open sources, I can have visibility on everything going on in the stream. There could be a lot of challenges around that. Things need to be open accessible, and a community with everything else.
Pulsar Pulsar is a great way to do distributed messaging regardless of what cloud it's running on or on premise. It's very scalable out to millions of events. Hundreds of pre built components for you to drag and drop and build your flows.
Flink SQl lets you write massively distributed real time pipelines for machine learning. This is amazingly scalable, used by some of the largest companies in the world. And then you can write analytics on it. Whether it's a continuous sql, some kind of decisions, whatever it may be.
We are done with the slides. Let's show you real code running. Right now I am in Apache Nifi. There is a GUI environment that lets you build, monitor, deploy this. And if you see here, I've got a real time stream of data.
The NOAA puts out a zip file of every weather reading in the country every 15 minutes. I convert that to a whole bunch of JSON, very easy to read. That same data is also flowing through my messaging system so that I can write more advanced analytics on here.
Flink SQl is going to be pretty familiar. Uses Apache Calcite which is used in many open source projects like Phoenix. Can be used to do data analysis on different data streams. Shows some of the results of my continuous SQL.
Pazdev: I'm always interested in talking about streaming and getting data to a data lake. He says it's easy, even if it's for machine learning or deep learning. If you're looking to learn a little more, definitely follow me.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, I'm Timothy Span. I'm a develop or advocate. I work mostly in streaming and in open source. That might seem a little strange in the machine learning track, but I'll show you a lot of different things you might not have thought of before on how to get data, manipulate data, run different processes that you needed to run to be able to do your machine learning and your training, classification. Lots of different things there. So this is for data engineers, this is for programmer, this is for data scientists. Anyone who wants to work with data and do it fast, not have to wait for a batch or something to load. Let's just get that data, run it as fast as possible, and do it in a way that's scalable, open source, and easy to use. Welcome to my talk here. This is hale hydrate from stream to lake. Now, I'm going to show you different ways that you could work with data lakes. Also any other source or sync of data. So just a could of quick flink on me. You'll have these slides so you can go through all these sites and see sources, code articles, deep dives into different technologies, whether it's Apache Nifi, Apache Spark, Apache OpenNLP, Apache Pulsar, Apache Kafka, Apache Flink. I also have some things on Tensorflow. I do a lot of different devices. I've got Nvidia Jetsons, raspberry Pis, tons of different content. And if you're missing something or going, I wish Tim would do this or that, drop me a line, whether on Twitter, LinkedIn, wherever you see me, comment in this conference, I am always looking for more things to work on, more apps to build more real time data pipelines. Show me something you're missing and let's do it. One thing you might notice is, I like cats. I have some cats, so you might see some in the pictures. Those pictures are perhaps unrelated. So the agents today, in this 40 or so minutes that I'm going to do is as follows. My use case, the primary one is to populate a data lake. This could be in any cloud, this could be from any vendor, whether it's Snowflake, Cloudera, Amazon, Microsoft. You know, you could build your own open source and put it wherever you need it to be. It could be in a cluster you've built of raspberry pis, some obscure cloud, a vm, whatever. We'll go over some of the challenges. What are the different impacts they have on what you're trying to do. Give you a solution, show you what the final outcome of that is. Talk to you about a couple of different streaming technologies in the open source, which means it's a great community behind it and you don't have to pay. But once you get into production, probably want some people to support you, but show you my successful architecture, spend as much time in that demo as possible and have a couple of next steps so you know where to go next. So use case primary one, I'm going to show you a couple because it's that easy. We're going to do Iot ingestion. So I've got a device running here on my desk here. If we were in person, it'd be sitting in front of me in the podium, which is pretty cool. So we've got a high volume streaming sources, different types of messages, different protocols, many vendors. So there could be a lot of challenges around that. There's no cookie cutter solution there. Things are different. Sometimes you're in a big hand, sometimes you're in a little door, you never know. So first thing that comes up, like we mentioned, lets of different protocols, different vendors, people like to do things their own way, but I want to be able to see what's going on right away. I don't want to wait for an hour, I don't want to wait for the end of the day. I need this data now because I'm going to be continually training my machine learning models, or I need to feed data into something that maybe is using that model, doing some classification and giving you some analysis to say, hey, there's a fraud condition, there's an alert, something's too high, something's been damaged, you should buy a stock, the weather's changing, you better bring things inside, you better start selling more ice cream because it's 100 degrees fahrenheit, those sort of things. What's nice as well, we'll show you. With the open sources, I can have visibility, which is often really hard on everything going on in the stream, because think about it, there's devices on the edge. I could be using various cloud servers, maybe I have a server that's sitting in a gateway in the back of a moving truck. Lots of things can go wrong. There could be bottlenecks, things could slow down, messages can get lost. I'll show you how. We could keep an eye on all that, automate that, make that simple. You could start today taking real time data, marrying it with machine learning, and doing that all in real time. Pretty easy. Now, one thing that's difficult here is sure, maybe you could find a solution. Maybe you wrote a ton of python scripts maybe there's a shell script. You've got code all over the place. That sprawl is hard to handle. Could also be difficult. You're hiring maybe consultants, maybe a bunch of different developers. You have to maintain that. Lots of different tools, nothing standardized. Can't find all the source code, don't know where it is running. Is it running over here? I shut down a vm, I shut down Kubernetes pod. Now things aren't working. Now I got to learn another tool, another language, another package. That's painful. It takes you forever to write this code. Those delays are not going to make anyone happy. Whether it's who's going to use the final code, or apps, or just your developers, your data scientists, your data analysts can't develop things quickly if they don't have the data, if they can't put their models out there quickly, if I can't get things to run my classifiers, you're losing money. You're losing who knows how many things. You need data fast. You need to be able to develop fast. Things need to be open, accessible, and have a community. So if you get stuck, you could keep moving on quickly. So the main solution here is using something like Apache Nifi. That is an open sources tool that does a ton of different sources. You don't have to learn a new tool for a new source. Supports lots of different message formats and protocols and vendors, easy to use, drag and drop, huge community and fits in well with everything else in streaming. Supports any type of data. It doesn't matter if it's xml, if it's binary, a word, document, PDF, CSV, tab delimited, JSON Avro, parquet I could say all kinds of words. Some of them you may know, some of you may not. Doesn't matter. Tons of different data. Sometimes the data is big, sometimes it's small, sometimes it's fast, sometimes it's slow. Doesn't matter. We could do a lot of that with little to no code. That makes it awesome. I can clean up my data, get it into a format. So if I have to do something more complex, I could do that with a pulsar function. I could do that with Flink, flink SQl. Really great. Nifi also gives me really rich visibility. It's got something called data provenance, which is a really robust lineage. And you could see everything about your data, all the metadata, all the metrics. You could see everything end to end. This is awesome. And you'll see how powerful that is in the demo. So now, instead of spending time trying to write one input app, I could build new use cases, new apps, get that data to the analysts and the data scientists who could do more models, more apps, more solutions. Things are cheaper now. I could develop more things for lets cost smaller teams to do this part. Let people do the fun part of building apps, building modules, solving problems with machine learning and deep learning makes you a lot more agile. I'm not spending weeks deploying things. I'm not spending weeks trying to figure out how to deal with new data when it comes in. To do things in a matter of minutes or hours makes you very agile when you need to change. Because tomorrow I have to use a different provider for weather. I have a new IoT device. Instead of spending weeks or months trying to customize it to whatever that format is, I could adapt on the fly. Makes it very powerful. Now I call my staff a couple of different things. This stack of open source Apache streaming tools out there, sometimes I'll call it flip, sometimes I'll call it flank. Flip is when I'm using Flink and pulsar, usually Nifi and a couple other tools as well. Focus on cloud data engineers, especially around the in machine learning to help your data scientists. But this could be in many other platforms. Cloud tends to be the way we're going now, but it could be in any environment, whether it's on premise, vms, docker containers, kubernetes, pods, wherever. With this stack I could support as many users as you have different frameworks, different languages, any of those clouds, you have lots of different data sources, lots of clusters. So if you have any experience with Python or Java, maybe a little SQL, maybe at least you have the concept of streaming, maybe a little ETL, you're ready to go using this. If you're a can, you could be involved here. Even you could use Nifi. Cat's going to question how much you're spending on cloud, but who doesn't? And if you happen to be cognizant machine learning code there, I could run you whether I'm in Nifi. Pulsar functions, flink, minify agents at the edge, wherever you need to run that code, whether it's sentient or not. We could do that. My flip stack, just to show you what it is. It uses this really cool pulsar flink connector that's been developed by stream native in the open sources. This makes it very easy for me to connect between flink code and pulsar, whether it's a source or a sync or both. Both tends to be common because I put something into a queue or a topic, I want to process it and send it down its way. Maybe it's cleaned up, maybe it's joined together, maybe it's aggregated. We show you a little bit there. I touched on Nifi as a big part of this. Why scalably? Real time streaming platform. You can collect, curate, use it as a universal gateway, could run my analytics there, I could run my machine learning models there. Does a lot of things. It does it very easily. And it could be so many different sources of data, whether they're cloud based, relational database, NoSQL stores like Mongo, elastic solar logs, text files, pdf, zip files, pull things off of government sources from a slack channel, hash it up, encrypt it, split text files apart, put them together, take apart data out of a zip file, look at the tail end of a log file, route things different way, add metadata, use metadata, all that while being very scalable out to millions of events. A second run on as big a cluster as you need, as many nodes, whether you're running it in kubernetes, whether you're running it in vms, in any of the clouds, on your laptop, on a desktop, wherever that is, hundreds of pre built components for you to drag and drop and build your flows very easily with full back pressure in there, security, all those features that you'd expect, guaranteed delivery. This is stable, scalable and a great solution regardless of how big the application is, how big your data streams are, what the data looks like, wherever it's running. You could do that here. Pulsar Pulsar is a great way to do distributed messaging regardless of what cloud it's running on or on premise. Again, the same idea with Nifi, open source Apache, huge community, lets of different options for sources and syncs. Lots of capabilities here. It's a no brainer for putting your different event data and ML data in there, stream your data around, do that in a secure, durable manner. It's georeplicated, it's great. Even better than my fluffy cat, any kind of pub sub. So you don't just have to replace workloads that are common in big data. This can also be in different things where you might be doing JMS or MQTT, which we see a lot in IoT, JMS, Kafka, any of those different protocols supported by this one messaging system makes it pretty easy. You can run functions very well integrated here. That makes it easy for you to write and run some of your machine learning against it like you'd expect. Very scalable. One thing that's really cool here. That's unusual is this tiered persistent storage. So as you put in data into this messaging system, it could store it out into the cloud and say s three buckets without you having to write special code, without having to do the heavy lifting that you might have to do elsewhere. Full rest API for managing, monitoring everything you need command flink interface so you make this easy, hook it up to your DevOps tools and there's all the clients you'd expect. Whether you write Python or you write Java, whatever it is, there's a client out there for you. So you could easily consume and produce messages. This makes this pretty great. I touched on Flink. I'm going to show you that in the application. If you haven't used Flink before, it's something you really need to start looking at. Flink scales out tremendously well, integrates with all the different frameworks you have typically written in Java, but with the enhancements that's in there. Now for Flink SQl, instead of having to write these complex distributed apps by hand, compile them, test them locally, deploy it out to my clusters to write a SQL statement like I mentioned before, if you know a little python, maybe a little Java, a little SQL, all of a sudden now I'm writing massively distributed real time pipelines for machine learning. Pretty straightforward. This is amazingly scalable, used by some of the largest companies in the world. Has everything you need for fault tolerance, resiliency, high availability, runs in yarn, runs in kubernetes, all those features that you want. Pretty awesome there. Now this is a machine learning talk. Get back to this. For some reason it's not showing my image here. Hopefully I lost my image. We'll show you that later. But basically what I have here is I've included in my Nifi distribution a couple of open sources components I've written so that I could do deep learning as part of that stream. And these are built using Apache, Mxnet, which is Java and DL for J. This is a deep learning Java library that lets me write in Java and then deploy it as a final model and say tensorflow or Mxnet or Pytorch. Really powerful. And sometimes I need to do some natural language processing. Patchy provides a great library for that. I put that into Nifi, so you could use that as part of your flow very easily. You don't have to call third party services, don't have to call another library. Pretty easy. This is my solution here. This is our architecture. Got a lot of different files. Lets me show you a couple different ones. XMl and JSON come in from different sources. If I get some, validates them, cleans them up, routes them where they need to go, we get that through our messaging system and then landed in the cloud. And then you can write analytics on it. And I'll show you some analytics there. And I'll show you some continuous queries in flink just to show you the ideas of what you can do with data while it's happening. Real time events, event happens, you're working on at the same time, there's no delays there. As soon as that network can get it to you, you're doing something with it. Whether it's a continuous sql, some kind of decisions, whatever it may be. This is that IoT data I was talking about. You got things like temperature, you got things like different sensor readings, important things. If you're running any kind of edge application, real time vehicles, maintenance, all kinds of devices out there also have ones for weather. And I've got links to some more contents you can go into that. This is. Thank you. We are done with the slides. I think everyone's happy to be done with the slides. I'm happy to be done with the slides. Let's show you real code running. That's probably more interesting than what you've seen before. So right now I am in Apache Nifi. We talked about it enough. This is it. This is not the flow diagram for it. This is not some documentation for it. This is my running system. There is a GUI environment that lets you build, monitor, deploy this. This could be locked down and secure. We could hide this UI if you don't want anyone ever to see it. But this is running. And if you see here, I've got a real time stream of data coming in. I have edge devices that are making HTTPs calls in, sending me data, and I have it queued because I didn't want to run it until I could show it to you. So right now I'm going to start something. This is pretty amazing. I could start and stop, and I could do this either through the UI, through a rest call or command flink interface, any part of the system. So if I want to pause something, maybe I ran out of cloud money this month. I'm going to pause it here, queue it up until I can get myself some more cloud availability, or maybe spin up some new unstructured, or maybe something's having problems downstream. I can pause it, never lose data and have no issues there. So I'm getting data, I'm routing energy data in sending that to my messaging system. I'm running some real time queries on some of this data as it comes in this query I have in my parameters so I could isolate that out when I deploy my code using DevOps tools. This is my current query. If the temperature in fahrenheit is over 60, we may be getting tools hot. Obviously you set these yourself. Maybe I could have machine learning figure out what's the new normal for temperature for that sensor reading or the current weather. That's up to you. Maybe I compare it against weather, maybe I compare it against other sensors of the same class. You get the idea of various things you could do there, stream the data in and then when I'm ready I split it up into individual records so I'm not sending too big a file out, pushing it to a couple different messaging systems so that I can push that up to the cloud. If you look here now, I'm in an Amazon hosted cluster, doesn't look much different to you. I have permissions in both. Here I've got this router stopped so no data was processing. I do a refresh and that's all been processed. Hundreds of records on a single code. Very easy to do. I add some metadata here, things like table name, what I'm doing with it and I'm going to send it to another messaging queue for some further analytics with flink here. I'm sending this to a cloud data store and you want to see how hard it was to write that code, point it to the name of the server, give it a table. And I'm defining that dynamically so that nothing's hard coded. And I'm saying it's JSOn. That's it. Oh, and I want to do up lets this one supports upset, could have been update or insert whatever. So if you notice something, where's the fields? I don't have to write any, handwrite any SQL. I don't have to write any mapping code. My record reader technology looks at it, sees if you have a schema, if you've got a schema defined somewhere, I'll use that. So I know exactly field names, field types, exactly what you want it to have and then I'll just match that up to the table. Very easy. Now sometimes your data might change a lot and you might have it, just have the code infer it for you. So it'll look at a number of records and say okay, these are the fields I see in this JSON or XML or Avro or different types of files. Let me show you how those readers work. So I could just pick a different one. I could do avro comma separated value. Grok will read logs or any kind of semi structured text. IP fix files, parquet syslogs window events XML, get the idea? Pretty easy to do that. Don't have to learn anything else. If you have any custom weird formats, open sources, take a look. Someone else might have written it, you write it yourself or get a consultant to do it. Pretty easy to do. These little boxes, I've written about 50 of them. It's Java. Really simple kind of. If you've ever done spring code like that, you write it once, build a jar file, put it on a cluster, it's ready to go. And I've got links to some of mine that you can download and use in your system. So that was Iot data. That's a great type of data. I want to show you a different type of data since we've got some time here. This is weather data. If you're in the United States, there's an organization for the US government called the NOAA and they put out forecasts from different weather stations and current readings from weather stations all around the country. And amazingly enough, every 15 minutes, not really streaming, but pretty quick batch, they put out a zip file of every reading in the country. That's pretty awesome. So I'm going to download that zip file so I could run this once just to give you an idea. Get that flowing. Already have it. I'll uncompress that zip file, pull out all the individual files in there and then route them. Make sure that they're an airport. There's some that aren't airports for my customers. They care about weather conditions around airports. That's really the main parts of the US. So I have all these XML files coming in. I want to convert them to JSON and I'll run a little query on there, make sure they have location data. I don't want junk coming out. So at the end result of this is a whole bunch of JSON, very easy to read. Could have kept it in XML. I don't really like XML. I'm going to get rid of that as soon as possible. So here we take a look. I could see all that provenance and lineage data. Telling me what server it's run on gives it a unique id, what's the size, all the different attributes, where it ran, what type it is. Here's that flink. So that's the airport ko five. It's an XML all kinds of metadata, how it downloaded that from the website and then the actual content. And I could see, oh yeah, here it is. There's all that JSON data. That was XML data before. Now it's JSON data. So I took that data in, parsed it apart, I put it in Kafka. And we could show you that in a minute because I wanted to distribute it. I could usually put it in pulsar. This ones I put in Kafka. Sometimes I'll put it in jms. Lots of different messaging options. Probably want to put that in Pulsar is probably your best option there. But I could do that from, I could do that from any of those different clients out there. So I'm consuming those messages back. And let's show you that whole thing we talked about, stream to lake. This is the stream coming from Pulsar or Kafka. Here is my lake. I'm creating Orc files, which is a type of file that's used by Hive. And I'm putting that in a directory on s three automatically. Don't have to do anything. This has, again, something that reads those JSON files, writes them there. I do the same with parquet, same with Kudu. That's as easy as it is to take that stream and get that into my data lake as fast as I need it to be. And that same data is also flowing through my messaging system so that I can write some more advanced analytics on here. So before I send all that data out, I run some validation on here. I check it against the schema that I have to make sure that nothing's weird, because sometimes they give us bad data, government data, you don't always get good data. I don't want that broken data. I could store it, I could put it in a directory, maybe. It's usually junk. That's up to you. If you see value in it, maybe you could data mine that later. And then I just put it into messaging system to be doing some more queries. Now, I mentioned I had some custom processors to do deep learning. This is one right here. This is as hard as it is for you. You pick your data set. There's a couple of parameters there. This is a Resnet 50 for doing this. And here I'm just running a couple at a time just to show you the results. And if we take a look here, you can see here by the name that these are images. Nifi works on images. See 700K versus those little files we had before. And inside the metadata, I put the results of my deep learning classification. Here's a bounding box that I could draw around. What I found in the image, which you see here, is a person. I could have up to five, I could do more, but I limit it to five because gets a little hectic for my processing here. So I found results. It was a person. There is the images, height, min, max, those sort of things. What's the probability that it's actually a person? They have pretty good confidence there. So let's see what that image actually is. It's the side of my head. I'm a person. I'm very happy. Sometimes I'm not a person. Today deep learning says I'm a person. Everyone rejoice. Yeah. AI has not taken over yet because sometimes I'm not a person. Sometimes the cat's labeled a person. You never know what you're getting. So we have all those messages here. I've got over 25,000 in there already and these are just going through my message queue and I'm going to read them with Flink SQl and do some different analytics on them. That IoT stream and this weather stream, pretty straightforward, but gives you an idea what you could do with different data as it's coming in. You watch it over time. Here's that IoT sensor data. Got a lot of data has gone through there. Same with the weather. So this is how I'm running my Flink SQL. There's lots of different consoles out there. Stream native has one, Verberica has one, Cloudera has one. There's one with Apache, Flink. That one's a command line one. Whatever. You're writing lots of different ways to run Flink SQl and it's very scalable now. You could also wrap it in your own Java code if you want to manage the deployment yourself. Maybe you're doing it all open source and you don't have advanced environment to do that. Here I'm showing some of the results of my continuous SQL. So when I'm building this query, I could see the results. And this one, if you notice, if you've worked with SQl before, looks pretty familiar. I'm grabbing a location. This is wherever that airport was that they took the weather. And I want the max temperature in fahrenheit, average temperature, minimum temperature, and I'm just displaying them here. This is over a short period of time. We can set up windows of time with these streaming systems. So maybe I look at all the forecasts in the last 6 hours, take the minute max, give you ideas there because remember this is not in my final data store. This is in stream. While this happens, another record shows up. It's added to this sql, this is continuously running and it just keeps going. We got another one here that I'm not doing any aggregates, I'm just looking at every record where the location is not null. I mean we did the validation, but sometimes something doesn't show up. Part of these reads at some of these weather stations are done manually. Someone's typing in the weather forecast or the current conditions. So there's sometimes a little bit of human error gets in there. But you can see some of the fields here. And nice thing is we wrap this in a materialized view so that now I have a rest endpoint that people can query and this is what it looks like. You get a JSON array of all that weather data. I could pull that into Jupyter notebook, pull that into an application, do what you need to do there. I've got another Flink SQl here that's joining together two streams. These are two different topics. I've got one for energy data, one for my sensor data. I join them together. This could have been a full outer join, left outer join, right outer join. Again, if you've done any ANSI SQL 92, Flink SQl is going to be pretty familiar. Uses Apache Calcite, which is used in a ton of open source projects like Phoenix. So you get used to this SQL once and you pretty much get the syntax. There's some extra things for doing really interesting complex event processing and windowing, but it's pretty much SQL. You don't have to write any Jav or any custom apps here, but if you've got operators and functions running within your queuing system, those can be executed before they get here or after. Gives you some power. Again, another materialized view here, so that people who don't have the libraries can just call this rest endpoint, get a bunch of JSON and process it as they want to. And I've got a whole bunch of different jobs running here. I've got four different flink applications that I could see in the dashboard and I could dive into them and see what's going on, see how many records, different things going on. This one's interesting because you've got two different source tables and a join, and you can see the number of records processing through there. Pretty basic, but gives you the idea sometimes you want to see data stored in the cloud. We said lake. Well, here's my lake. This is a table on top of Amazon s three. And it looks like a regular table. I could see the location and the details, where it's stored. And it makes it pretty easy for me to do what I need to do here. And it just acts like a database for me. And I have all that permanent data, so I have all the readings that have ever happened here. I could store them, do whatever I need to do with that. Same thing with the sensor readings. Pretty straightforward, regardless of where I want to store that, whatever my lake is, like I said, it could be cloud era, it could be Amazon, Microsoft, Redshift, snowflake, whatever's that next dremio, whatever's your next data lake, I could put it there. I create a little dashboard on it. You can see there's a lot of different readings across the country. These are some that are close to me. This is a local airport by me, and I could see all the data there and download it if I wanted to. And that has things like Latin long, so I could put it on a map here. And if you look, there's a lot of different airports where they're doing weather data in the United States. You zoom out enough and it's just a nice, pretty design there because there's thousands of records there. Makes it very easy. Something else I can do is I can send real time alerts to slack. This is great for DevOps, but this also may be helpful for your data scientists and analysts to see. Okay, there's some new data coming in. Maybe we do a sampling of it. Someone says, okay, there hasn't been data in a while. Now I see temperatures at can airport. Maybe that gives me an idea for what I can do next. Lots of options there, pretty straightforward, I think now we're at the end. Hopefully there's time during the real time event for questions. If there isn't, please reach out to me. I've got all my contact information here. Whether you contact me at pazdev on Twitter, see me on my website, open a pull request in know however you want to contact me. I'm always interested in talking about streaming and getting data to a data lake. Really easy, even if it's for machine learning or deep learning or whatever you need it to be for. Straightforward thing there. Thanks for coming to my talk. Hope you learned something. If you're looking to learn a little more, definitely follow me. We have deeper dives. We could do whole day events. So reach out. Thanks a lot.

Slides

Download slides (PDF)

See all 23 talks at this event!

Conf42 Machine Learning 2021 - Online

July 29 2021

Hail Hydrate! From Stream to Lake

Video size:

Abstract

Summary

Transcript

Slides

Tim Spann

Developer Advocate @ StreamNative

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2021 - Online

July 29 2021

Hail Hydrate! From Stream to Lake

Video size:

Abstract

Summary

Transcript

Slides

Tim Spann

Developer Advocate @ StreamNative

Join the community!