A Gentle Introduction to Building Data-Intensive Applications in Go

Video size:

Abstract

Let’s face it, learning how to build scalable applications can be hard. Join Joe to learn the fundamentals of data-intensive architectures as we build a data-intensive application from start to finish using SingleStore and Redpanda in Golang. In this talk, we will:

discuss best practices of scaling up your data workloads on your applications.
consume a streaming dataset from Redpanda using SingleStore Pipelines, process some of the data using Stored Procedures, and query the data from a Grafana dashboard.

You will leave this talk a clear understanding of what makes an application “data-intensive” and see that by leveraging modern data infrastructure, you can scale seamlessly as user numbers grow.

Summary

Building data intensive applications is a little bit like SpongeBob squarepants. Knowing how to do this is critical for building applications that can scale in the future. Andela has matched thousands of technologists across the globe to their next career adventure. Now the future of work is yours to create.
Joe Carlson: Today we're going to be introducing concepts about data intensive applications. We'll be going into the fundamentals of how to design and build a scalable data intensive application. Three key tenants are reliability, scalability and maintainability.
We live in a world of dataintensive applications. Data intensive applications is where data is the main constraint or bottleneck. One of the things that we're going to be struggling with the most is designing applications to scale up with their data intensity, not with the compute cycles.
Single stored is the single database for data intensive applications. We scale better, we're faster, we have lowest latencies, and we handle the most data models and most parallelism than any other database. Single stored makes it easier to build these applications.
What does an architecture for a data intensive application look like? They tend to be pretty complicated. Single store does make that a lot easier. Lets of users are coming and using us to replace lots of different aspects of their data intensive applications. It can handle that scale and simplify their architectures.
The three big ideas are reliability, scalability and maintainability. These are the three things that make up a scalable, data intensive application. Let's dig into all three of them and kind of explore each one with some real world examples.
Golang: The majority of our software cost is actually not from the development of the initial product. It's the ongoing maintenance. He says it's important for us to build systems that are maintainable long term. Golang: Make sure your systems are built to evolve and change.
Next step I would recommend you checking out designing data intensive applications by Martin Kletman I also recommend setting up your own project. The best way to learn something is to build it yourself.
Joe Karlsson is a software engineer and he works at single store. If you want to follow me, check me out at Joe Karlsson one on Twitter. All right, I'm ahead out. Talk to you later.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

What if you could work with some of the world's most innovative companies, all from the comfort of a remote workplace? Andela has matched thousands of technologists across the globe to their next career adventure. We're empowering new talent worldwide, from Sao Paulo to Egypt and Lagos to Warsaw. Now the future of work is yours to create. Anytime, anywhere. The world is at your fingertips. This is Andela. Hey, friends. My name is Joe Carlson, and this is a gentle introduction to building dataintensive applications. So, first of all, do you know how much data is created every day in 2022? Well, it looks like about 80 zetabytes of data. And by 2025, we're looking at about 175 zettabytes. That's over double the amount of current data we're processing today. In just, was it three years? Three years from now? That's coming up fast. Right? So I want to talk to you about what it takes to build a data intensive application, because knowing how to do this is going to be critical for building applications that can scale in the future. And it's hard, right? But wet your pal a little bit. I want to show you today how building data intensive applications is a little bit like SpongeBob squarepants. And yes, I'm serious. We're about to dig in, right? SpongeBob is a reliable friend. He can absorb things like a sponge and scale to huge sizes. He's able to withstand nautical nonsense even as our servers and users flop like a fish. And most importantly, he can be simple. Right? Let's dig in and see how SpongeBob can help us learn how to build data intensive applications. So my name is Joe Carlson. I work for a company called Single Store, and I make a lot of TikToks and Twitter. So if you like what you see today, be sure to check that out. You can go to JoeCarlson dev links for everything discussed here in this talk today, as well as all my socials. So go check that out. Okay, so before we begin, I'll be checking out the chat if anyone has any questions. And credit to Martin Kletman for pioneering a lot of this content in his book, designing data intensive applications, which you should totally go check out. A lot of this is from his content. Today we're going to be introducing concepts about data intensive applications and how you can get started building them. And this is for developers who maybe have some comfortability with building some simple applications and are looking to scale up. Any knowledge about SQL or rdbms is also really useful, too. Okay, so we're going to be discussing what data intensive applications are, then we'll be going into the fundamentals of how to design and build a scalable data intensive application. Three key tenants are reliability, scalability and maintainability. And then I know this isn't live, so we're not going to do A-Q-A today, but let's get started with our intro content. Clearly we live in a world of dataintensive applications, and if we're not currently, it's going to be taking up more and more sectors very soon. So you may be asking yourself, what exactly is a data intensive application? And in my humble opinion, it is an application that has one of these five core tenets. It has an application that has a mass amount of data, or where data is streaming in really quickly, where low latency queries on your database is critical, if you have a complex series of databases with joins or whatever, or if you have a massive parallelism or concurrency in your application. The bottom line is though that data intensive applications is where data is the main constraint or bottleneck. I think previously cpu cycles was one of the main constraints, but in my humble opinion, with the advent of the cloud, that made that really less of an issue. Right? With a Kubernetes cluster or whatever in the cloud, we can just start scaling up servers and new nodes to start handling our applications. And in my opinion, as developers, one of the things that we're going to be struggling with the most is designing applications to scale up with their data intensity, not with the compute cycles, which is why this is important. So anyway, glad to be here. So obviously, businesses are becoming more and more data intensive. We're seeing massively rising complexity and data analysis and data science, machine learning models consuming mass amounts of data in real time, and users who are becoming more requiring that as their application scale. We want the apps we use be fast. We want them to be in real time. We want those real time alerts. Being able to wait 24 hours for a batch processing job no longer cuts it. And our applications are struggling to keep up, and for good reason. Like building these apps is hard, it takes time and it takes experimentation, and it usually involves hybrid approaches in order to scale our applications up. That's okay. I like giving some more concrete examples here, and I've broken these up by industries and verticals and some use cases for each of them. This is just a small example of what's available for data intensive applications here. Dashboards real time ML streaming, ads optimization real time what's the word? Recommendations. IoT devices streaming in mass amounts of data into dataset, real time data dashboards, whatever, there's so many things you can possibly do. And again, this is just a small spattering. Okay, so I'm obviously biased here, but single stored is the single database for data intensive applications, and I'm going to tell you why today. Single stored handles this and makes it easier to build these applications because they can get incredibly difficult. When you need a database that's going to help you solve those problems as easily as possible. We're going to try to keep our architecture as simple as possible, and single stored can help you do that. So if you're going to take anything home today, just know single store is the best database for handling data intensive applications. We have some comparisons here too. We scale better, we're faster, we have lowest latencies, and we handle the most data models and most parallelism than any other database. So go check those out. We're great for data analytics, real time machine learning, real time recommendations, whatever. We handle all those amazingly well. Okay, let's jump into it. What does an architecture for a data intensive application look like? Again, great question, dear listener. They tend to be pretty complicated. You have lots of different dataset pieces kind of fitting together, shuffling, transforming and saving data in lots of places as they flow through. We're seeing a lot of microservices kind of decoupling databases, which is fine, but it does makes our applications more complicated, which I don't know about you, I think developers have a tendency to overcomplicate things, to appear smarter. We are handling harder problems and I think we're kind of in this age of trying to understand and simplify these things as we scale. I think kubernetes is an example of something that's going through that right now and getting easier and better to use as it matures. But data intensive architectures and dataset just aren't quite there yet. But again, single store does make that a lot easier. We simplify that by handling more data types and better performant than alternatives. Lets of users are coming and using us to replace lots of different aspects of their data intensive application because it can handle that scale and simplifies their architectures. Okay, much, much better. Stripping all down. Right, thank you, SpongeBob. All right, so let's get into the nuts and bolts of designing a data intensive application. So the three big ideas here today are reliability, scalability and maintainability. These are the three things that make up a scalable, data intensive application. Let's dig into all three of them and kind of explore each one with some real world. Examples. First up, reliability. We got a SpongeBob with this good old reliable friend, Patrick Starr, right? Best buddies there to the end through thick and thin, just like our applications and our databases supporting those applications, right? So typically, as developers building a data intensive application, we're expecting them to perform as expected. Duh. We want them to be able to tolerate any errors that a user might make. Okay? Yes, good luck predicting those. We want it to be fast performance, and we want it to prevent any abuse or insecurities or any leaking of any data or whatever, right? We don't want any user secrets kind of getting out into the ether. This all seems pretty straightforward, right? But it gets complicated actually trying to make sure all these things actually happen that we want, right? So first thing with reliability is hardware and software faults. So that means a lot of times, especially for databases, redundancy was really difficult, even frowned upon in a lot of situations. You saw a lot of single node databases popping up. But we're moving towards a future of systems that tolerate more losses of the machines, right? And we need that to include our databases too. Netflix, for example, has a tool called Chaos monkey that randomly shuts down servers and databases to ensure that other systems are still working. And it's backing up, which is really helpful for their scalable architectures. But you also have software architectures too. It's unlikely, again, not impossible. You want to plan for it, but that large numbers of hardware components fail at the same time. Typically it's software or human errors related to software that is the main component of something. Shutting down and having redundancy helps us. And some other stuff we'll talk about in a little bit, like testing and good systems and safeguards does help that as well. Single store, for example, has data copied horizontally on leaf nodes in order to improve redundancy. So each leaf has replicated its data so that if one of the nodes goes down, you don't lose your data, you don't have any downtime. It reelects a new leaf node so that you don't lose any of that data. It stays online in case something goes down. All of the data is replicated across all of the leaf partitions, and your data is still there. You can control how many leaf nodes are it's replicated to and the amount of data, or like what kind of write or read consistency you want, so that you get the performance hits you want. So human errors, humans tend to be the least reliable component of any system. It's true, right? Computers go down, but humans go down way more often, configuration errors, outages, whatever. Humans are the leading cause of outages. That's just the way it is. Right? So as developers, we want to design systems that minimize the opportunities for human error in our systems. That means we want to have interfaces and admins that are admin interfaces that are designed well so you don't make mistakes or that you can't do things that you don't want other people shouldn't be doing on your system. Automated testing can be really helpful for this, and deploying fully featured sandbox environments locally or in the cloud to do testing on can be another way to help mitigate some of these issues. This is a minor sidebar, but databases frequently get left out of the DevOps discussion. And I get it, it's hard to do. But that doesn't mean like data tends to be the most important and critical part of our application. And leaving it down to a manual process is going to leave you vulnerable to human error being introduced to your data intensive application. Okay, great. TLDr just have good practices to protect your system from dust down humans and scalability. This is how we cope with increased load. So we need to be able to scale the system up in order to handle. This is the second tenant, right. Scalability to handle the increased load of our systems. Right. Whether that's increased throughput to our database, we're adding more paralyzed users kind of accessing data or whatever. Right. And I want to use an example here to illustrate what I mean by scalability. So let's talk about Twitter, right? The core tenets of Twitter. We want to post a tweet and we want each of our users to read tweets from a timeline. So there's two key methods you could use to implement this. We have the two posting tweets. We're seeing about 4000 2000 tweets posted per second, about 300,000 tweets per second. This might be different. This is a couple of years ago I got this data from. But you get the point, right? We want to design the system. So the first way you might be able to do that is by building a couple of tables. We have a user who has a follows table and that we have a users table that allows us to find all of the users that are tweeting and we can pull that into a feed, right? We could do a couple joins and pull that in. But Twitter actually did this, right? And if you were Golang to build this at home, like your little toy Twitter app at home, which by the way, that was my first web app I ever built just to learn web development. Little Twitter clone. But with this approach, the system struggled to keep up the load of the home timeline queries. That's because joins are both expensive with memory and time. Doing all those joins within the system was really, really expensive and slow. And as the users went up reposting more and more and more, and they're reading more and more and more, those joins became a major blocker for the system. Yes. So that didn't work. So what's the second method that Twitter use? This is a can out approach. So think of it as like a mailbox, right? So like a user posts a tweet and they put it in their mailbox, and then we do a fan outs, then something. A service reads that letter in your mailbox and it copies it and sends it out to a bunch of other people's mailboxes out. So every user that follows you gets it in their timeline, which allows you to do way less joins, and that you're just inserting them into each new timeline cache. There's a bunch of benefits of this, but there is one downside. The downside of this approach is that posting a tweet now requires a lot of extra work. Some users might have like 30 million followers, like a Justin Bieber or Kim Kardashian or something, right. That means that a single tweet from one of these power users would result in over 30 million rights to other mailboxes, which is a lot. Right. This approach does work really well, but there are some downsides to it. Right. And it's a lot more complicated, especially when you're talking Twitter scale. So what did Twitter do? They actually approached this using both. So they used a hybrid approach. Tweets continue to be fanned out along among most users, but for a small number of power users, they still use the first method to kind of send them out as read only. So people would go and read the message from the person they're following, which is wild. There's a little bit of both. I think that's kind of genius, though. And that tends to be the approach with a lot of these systems. Right. You kind of, like, see what happens. You try to approach it, and you have to try to do some interesting massaging to get it to fit your system. Okay, there we go. By the way, single stored also does this, too. We do a horizontally scalable dataset to try to increase parallelism and concurrency, which increases throughput for our databases. But there's lots of ways you can do it. You can scale up horizontally like single store does scale up vertically, which typical SQL dataset do? I know you can do it with a lot of other ones, but that's just buying a bigger server to handle the increased load. Yeah, those are kind of the load. The systems that do it, in my humble opinion, going the node based horizontal approach is one of the most scalable and effective ways moving forward. Okay, last core tenant here is maintainability. So the majority of our software cost is actually not from the development of the initial product. It's the ongoing maintenance. Us engineers are very expensive and it's important for us to build systems that are maintainable long term. So that means that there's three things we needed to keep in mind for this operability, simplicity and evolvability of our systems. So operability does not refer to performing operations or surgery operations, has to do with a team that's responsible for the ongoing operations of the code base that's running. So tracking down problems with the infrastructure or software, they're anticipating future problems or scalable problems. They're monitoring for anything that goes down or any sort of systems. They're performing complex maintenance tasks and security audits of the system, bunch of these things. Basically, good operability means you're making routine tasks easy for your scalable system. Okay, simplicity. This allows us to manage complexity and make it easier. Obviously, building complex systems is really, really hard. And the more moving parts you have in a system, the harder it is to troubleshoot and build. Thinking long term about your product, you want to make sure that it is easy enough for new people to come in and start using it. I've seen that a lot too. I feel like engineers, we get really into something, we go crazy about it and then we realize it's hard to maintain. I know personally for me, I've favored low maintenance scalable solutions because I would rather be building new stuff than maintaining stuff long term. And by focusing on simplifying our architecture, that allows us to do that. That allows us to do that. Again, I'm going to say this just last time here, but about single store people are replacing lots of different databases because it's much more simple and scalable to do that. If you have three different databases, like in memory database, a NoSQL database, and a SQL database like postgres or whatever, you can simplify those down into one data service that does all those things. That's going to make your ongoing maintenance so much more simple. Single store can do that, which is amazing. Check it out. And lastly, evolvability. The only thing that's certain in software is change. The only certainty is change. Something like that. Right? But we want to make sure your systems are built to evolve and change, which I will say, data makes it hard, data is sticky, and we want to make sure you're trying your best. You can't predict everything that's going to happen, but you can try your best to make it easy for you to change in the future as new requirements come up in your system. Okay, so quick recap here. If you're building out a scalable, data intensive application, you need a database and system that scales with your usage. As it grows, your data usage will grow. That's just how it goes. Data is sticky and unless you have a strong governance policy, it's probably Golang to be staying there and growing. So you want to make sure your system can grow for at best next five years. Honestly, you make sure you're securing your data and you can ensure privacy for your users. You need to make sure it can handle load today as well as your anticipated load again within five years, which is hard to predict. It needs to be capable of delivering analytics. I would recommend making sure it can handle real time analytics because that's going to be if it's not a current need, you should be anticipating it as a potential future need for your system. You want to make sure that there's no noticeable leg, especially for the end users of your system. This is all admittedly a tall order, but single stored can do all this and way, way more. It's a fast, unified database system that's acid compliant, all that good stuff. You should definitely go check that out. Okay, so, questions? I'm in the comments if anyone wants to chat. Otherwise, next step I would recommend you checking out designing data intensive applications by Martin Kletman I also recommend setting up your own project. I think the best way to learn something is to build it yourself. I have some examples that I can share in the chat here. Building your own stock ticker, stock scraper. You can do a real time Twitter data stream in and doing analysis on that with machine learning models. There's tons of huge data sets you can play around with. The best way to do that is to find someone that's a system that is free with like a developer tier and try to build something. And I think go is a great place to do that too. In fact, I've got a great shipping logistics demo I'm going to share in the comments here. Definitely learn by reading. There's a bunch of ways to do that here. And if you want to get started with single store and try that out, it's a great free, easy way to do that. We have a managed service. You get $500 in free credits today. No credit card needed. Just you can go try it out. It's amazing. Go to singlestore.com managed servicetrial. Okay. And here is some additional reading if you want it as well too. I'm going to flash this up. Great. And thank you so much, everybody. This has been an absolute blast. I am so honored to be here. You all are amazing. Again, my name is Joe Karlsson. I'm a software engineer and I work at single store. If you want to follow me, check me out at Joe Karlsson one on Twitter. That's in the lower right hand corner. If you want to follow me on all the other links, you can check out Joecarlson dev links. All right, I'm ahead out. Thank, thank you so much. Talk to you later.

See all 13 talks at this event!

Conf42 Golang 2022 - Online

March 31 2022

A Gentle Introduction to Building Data-Intensive Applications in Go

Video size:

Abstract

Summary

Transcript

Joe Karlsson

Senior Developer Advocate @ SingleStore

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2022 - Online

March 31 2022

A Gentle Introduction to Building Data-Intensive Applications in Go

Video size:

Abstract

Summary

Transcript

Joe Karlsson

Senior Developer Advocate @ SingleStore

Join the community!