Transcript
This transcript was autogenerated. To make changes, submit a PR.
What if you could work with some of the world's most innovative companies,
all from the comfort of a remote workplace?
Andela has matched thousands of technologists across the globe to their
next career adventure. We're empowering new talent worldwide,
from Sao Paulo to Egypt and Lagos to Warsaw.
Now the future of work is yours to create.
Anytime, anywhere. The world is at your fingertips.
This is Andela.
Hey, friends. My name is Joe Carlson, and this is
a gentle introduction to building dataintensive applications.
So, first of all, do you know how much data is created every day
in 2022? Well, it looks
like about 80 zetabytes of data. And by 2025, we're looking
at about 175 zettabytes. That's over double
the amount of current data we're processing today.
In just, was it three years? Three years from now?
That's coming up fast. Right?
So I want to talk to you about what it
takes to build a data intensive application,
because knowing how to do this is going to be critical
for building applications that can scale in the future. And it's hard,
right? But wet your pal a little bit.
I want to show you today how building
data intensive applications is a little bit like SpongeBob
squarepants. And yes, I'm serious.
We're about to dig in, right? SpongeBob is a
reliable friend. He can absorb things
like a sponge and scale to huge sizes.
He's able to withstand nautical nonsense even
as our servers and users flop like a fish.
And most importantly, he can be simple. Right? Let's dig
in and see how SpongeBob can help us learn how to
build data intensive applications. So my
name is Joe Carlson. I work for a company called Single Store,
and I make a lot of TikToks and Twitter. So if you like what you
see today, be sure to check that out. You can go to JoeCarlson dev links
for everything discussed here in this talk today, as well as all my socials.
So go check that out. Okay, so before we begin,
I'll be checking out the chat if anyone has any questions. And credit to Martin
Kletman for pioneering a lot of this content in his book,
designing data intensive applications, which you should totally go check out. A lot
of this is from his content. Today we're
going to be introducing concepts about data intensive applications
and how you can get started building them. And this is for developers
who maybe have some comfortability with building some
simple applications and are looking to scale up.
Any knowledge about SQL or rdbms is also really useful,
too. Okay, so we're going to be discussing what data
intensive applications are, then we'll be going into
the fundamentals of how to design and build a scalable data intensive application.
Three key tenants are reliability, scalability and maintainability.
And then I know this isn't live, so we're not going to do A-Q-A today,
but let's get started with our intro content.
Clearly we live in a world of dataintensive applications, and if we're
not currently, it's going to be taking up more
and more sectors very soon. So you may be asking
yourself, what exactly is a data intensive application? And in
my humble opinion, it is an application that has one of these
five core tenets. It has an application that has a mass
amount of data, or where data is streaming
in really quickly, where low latency
queries on your database is critical,
if you have a complex series of databases with
joins or whatever, or if you have a massive parallelism
or concurrency in your application. The bottom line
is though that data intensive applications is where data
is the main constraint or bottleneck. I think
previously cpu cycles was one of the main constraints,
but in my humble opinion, with the advent of the cloud,
that made that really less of an issue. Right? With a Kubernetes
cluster or whatever in the cloud, we can just start scaling up servers and new
nodes to start handling our applications. And in my opinion,
as developers, one of the things that we're going to be struggling with the most
is designing applications to scale up with their data intensity,
not with the compute cycles, which is why this is important.
So anyway, glad to be here.
So obviously, businesses are becoming more and more data
intensive. We're seeing massively rising complexity
and data analysis and data science, machine learning models consuming
mass amounts of data in real time, and users
who are becoming more requiring
that as their application scale.
We want the apps we use be fast. We want them to be in real
time. We want those real time alerts. Being able to wait 24 hours for
a batch processing job no longer cuts it.
And our applications are struggling to keep up,
and for good reason. Like building these apps is
hard, it takes time and it takes experimentation,
and it usually involves hybrid approaches in order to scale our
applications up. That's okay.
I like giving some more concrete examples here, and I've broken these up by industries
and verticals and some use cases for each of them. This is just
a small example of what's available for data intensive applications
here. Dashboards real time ML streaming,
ads optimization real time what's
the word? Recommendations. IoT devices
streaming in mass amounts of data into dataset, real time data
dashboards, whatever, there's so many things you can possibly do.
And again, this is just a small spattering.
Okay, so I'm obviously biased here, but single stored
is the single database for data intensive applications,
and I'm going to tell you why today. Single stored handles
this and makes it easier to build these applications because they can get
incredibly difficult. When you need a database that's going
to help you solve those problems as easily as possible.
We're going to try to keep our architecture as simple as possible,
and single stored can help you do that. So if you're going to take anything
home today, just know single store is the best database for handling
data intensive applications. We have some comparisons
here too. We scale better, we're faster, we have lowest latencies,
and we handle the most data models and most parallelism than
any other database. So go check those out. We're great for data analytics, real time
machine learning, real time recommendations, whatever. We handle all
those amazingly well. Okay, let's jump into it. What does
an architecture for a data intensive application look like?
Again, great question, dear listener.
They tend to be pretty complicated. You have lots of different
dataset pieces kind of fitting together, shuffling,
transforming and saving data in lots of places as
they flow through. We're seeing a lot of microservices kind of
decoupling databases, which is fine, but it does makes our
applications more complicated,
which I don't know about you, I think developers have a tendency to overcomplicate
things, to appear smarter. We are handling harder
problems and I think we're kind of in this age of trying to understand and
simplify these things as we scale. I think kubernetes is an
example of something that's going through that right now and getting easier and
better to use as it matures. But data intensive
architectures and dataset just aren't quite there yet.
But again, single store does make that a lot easier.
We simplify that by handling more data types and better
performant than alternatives. Lets of users are coming and using us
to replace lots of different aspects of their data intensive application because
it can handle that scale and simplifies their architectures.
Okay, much, much better. Stripping all down. Right,
thank you, SpongeBob. All right, so let's get into the nuts
and bolts of designing a data intensive application. So the
three big ideas here today are reliability, scalability and
maintainability. These are the three things that make up
a scalable, data intensive application. Let's dig into all three
of them and kind of explore each one with some real world.
Examples. First up, reliability.
We got a SpongeBob with this good old reliable friend,
Patrick Starr, right? Best buddies there to
the end through thick and thin, just like our
applications and our databases supporting those applications,
right? So typically, as developers building a data
intensive application, we're expecting them to perform as expected.
Duh. We want them to be able to tolerate any errors
that a user might make. Okay? Yes, good luck predicting those.
We want it to be fast performance, and we want it to prevent any abuse
or insecurities or any leaking of any data or whatever,
right? We don't want any user secrets kind of getting out into the ether.
This all seems pretty straightforward, right? But it gets
complicated actually trying to make sure all these things actually happen that we
want, right? So first thing with reliability
is hardware and software faults. So that means
a lot of times, especially for databases, redundancy was
really difficult, even frowned upon in a lot of situations. You saw
a lot of single node databases popping up.
But we're moving towards a future of systems
that tolerate more losses of the machines, right? And we need that
to include our databases too. Netflix, for example,
has a tool called Chaos monkey that randomly shuts down servers
and databases to ensure that other systems are still working.
And it's backing up, which is really helpful for their
scalable architectures. But you also have software architectures
too. It's unlikely, again, not impossible.
You want to plan for it, but that large numbers of hardware components fail
at the same time. Typically it's software or
human errors related to software that is the main component of something.
Shutting down and having redundancy helps us.
And some other stuff we'll talk about in a little bit, like testing and
good systems and safeguards does help that as well.
Single store, for example, has data
copied horizontally on leaf nodes in order to
improve redundancy. So each leaf has replicated its data so
that if one of the nodes goes down, you don't lose your data, you don't
have any downtime. It reelects a new leaf node
so that you don't lose any of that data. It stays online in
case something goes down. All of the data is replicated across
all of the leaf partitions, and your data is still there.
You can control how many leaf nodes are it's replicated
to and the amount of data, or like what kind of write or read consistency
you want, so that you get the performance
hits you want. So human errors,
humans tend to be the least reliable component of any
system. It's true, right?
Computers go down, but humans go down way more often,
configuration errors, outages,
whatever. Humans are the leading cause of outages.
That's just the way it is. Right? So as developers,
we want to design systems that minimize the opportunities for human error
in our systems. That means we want to have interfaces and admins
that are admin interfaces that are designed well so you don't make mistakes
or that you can't do things that you don't want
other people shouldn't be doing on your system.
Automated testing can be really helpful for this,
and deploying fully featured sandbox
environments locally or in the cloud to do testing on can be another way
to help mitigate some of these issues. This is a minor sidebar, but databases
frequently get left out of the DevOps discussion.
And I get it, it's hard to do. But that doesn't mean like data
tends to be the most important and critical part of our application. And leaving
it down to a manual process is going to leave
you vulnerable to human error
being introduced to your data intensive application.
Okay, great. TLDr just have
good practices to protect your system from dust down humans and scalability.
This is how we cope with increased load. So we need to
be able to scale the system up in order to handle.
This is the second tenant, right. Scalability to handle the increased load of
our systems. Right. Whether that's increased throughput
to our database, we're adding more paralyzed users kind of accessing
data or whatever. Right. And I want to use an example
here to illustrate what I mean by scalability. So let's talk
about Twitter, right? The core tenets of Twitter. We want to post a tweet and
we want each of our users to read tweets from a
timeline. So there's two key methods you
could use to implement this.
We have the two posting tweets.
We're seeing about 4000 2000 tweets posted per
second, about 300,000 tweets per second. This might be different.
This is a couple of years ago I got this data from. But you get
the point, right? We want to design the system. So the first
way you might be able to do that is by building a couple of tables.
We have a user who has a follows table and
that we have a users table that allows us to find all of the
users that are tweeting and we can pull that into a feed,
right? We could do a couple joins and pull that in.
But Twitter actually did this, right? And if you
were Golang to build this at home, like your little toy Twitter app at home,
which by the way, that was my first web app I ever built just
to learn web development. Little Twitter clone. But with
this approach, the system struggled to keep up the load of the home
timeline queries. That's because joins are both expensive with memory
and time. Doing all those joins within the system was really,
really expensive and slow. And as the users went
up reposting more and more and more,
and they're reading more and more and more, those joins became a major blocker
for the system. Yes.
So that didn't work. So what's the second method
that Twitter use? This is a can out
approach. So think of it as like a mailbox, right?
So like a user posts a tweet and they put it
in their mailbox, and then we do a fan outs, then something.
A service reads that letter in your mailbox and
it copies it and sends it out to a bunch of other people's mailboxes out.
So every user that follows you gets it in their timeline,
which allows you to do way less joins,
and that you're just inserting them into each new timeline cache.
There's a bunch of benefits of this, but there is one downside.
The downside of this approach is that posting a tweet now requires
a lot of extra work. Some users might have like 30 million followers,
like a Justin Bieber or Kim Kardashian or something, right.
That means that a single tweet from one of these power users would result
in over 30 million rights to other mailboxes,
which is a lot. Right. This approach does
work really well, but there are some downsides to it. Right. And it's a lot
more complicated, especially when you're talking Twitter scale.
So what did Twitter do? They actually approached
this using both. So they used a hybrid approach.
Tweets continue to be fanned out along among most users,
but for a small number of power users,
they still use the first method to kind of send them
out as read only. So people would go and read the message from
the person they're following, which is wild.
There's a little bit of both. I think that's kind of genius, though. And that
tends to be the approach with a lot of these systems. Right. You kind of,
like, see what happens. You try to approach it, and you have to try to
do some interesting massaging to get it to fit your system.
Okay, there we go. By the way, single stored also
does this, too. We do a horizontally scalable
dataset to try to increase parallelism and concurrency,
which increases throughput for our databases. But there's
lots of ways you can do it. You can scale up horizontally like single store
does scale up vertically, which typical SQL dataset do? I know you
can do it with a lot of other ones, but that's just buying a bigger
server to handle the increased load.
Yeah, those are kind of the load. The systems that do it,
in my humble opinion, going the node based horizontal approach is
one of the most scalable and effective ways moving forward.
Okay, last core tenant here is maintainability.
So the majority of our software cost is actually not from
the development of the initial product. It's the ongoing maintenance. Us engineers
are very expensive and it's important for us to
build systems that are maintainable long term.
So that means that there's three things we needed to keep in mind for this
operability, simplicity and evolvability of our systems.
So operability does not refer to performing
operations or surgery operations, has to do with
a team that's responsible for the ongoing operations of
the code base that's running. So tracking down
problems with the infrastructure or software,
they're anticipating future problems or scalable problems.
They're monitoring for anything that goes down or any sort of systems.
They're performing complex maintenance tasks and security audits of
the system, bunch of these things. Basically, good operability
means you're making routine tasks easy for your scalable
system. Okay, simplicity. This allows us
to manage complexity and make it easier. Obviously,
building complex systems is really, really hard. And the
more moving parts you have in a system, the harder it is to troubleshoot and
build. Thinking long term about your product,
you want to make sure that it is easy enough for new people to
come in and start using it. I've seen that a lot
too. I feel like engineers, we get really into something, we go crazy about it
and then we realize it's hard to maintain. I know personally for me,
I've favored low maintenance scalable solutions
because I would rather be building new stuff than maintaining stuff long term.
And by focusing on simplifying our architecture, that allows
us to do that. That allows us to do that.
Again, I'm going to say this just last time here, but about single store people
are replacing lots of different databases because it's much
more simple and scalable to do that. If you have three different databases,
like in memory database, a NoSQL database,
and a SQL database like postgres or whatever, you can simplify
those down into one data service that does all those things. That's going to make
your ongoing maintenance so much more simple. Single store can
do that, which is amazing. Check it out. And lastly,
evolvability. The only thing that's certain in software is change.
The only certainty is change. Something like that.
Right? But we want to make sure your systems are
built to evolve and change, which I will say,
data makes it hard, data is sticky, and we want to make sure
you're trying your best. You can't predict everything that's going to happen,
but you can try your best to make it easy for you to change in
the future as new requirements come up in your system.
Okay, so quick recap here. If you're building out a
scalable, data intensive application, you need
a database and system that scales with your usage. As it grows,
your data usage will grow. That's just how it goes. Data is sticky
and unless you have a strong governance policy, it's probably
Golang to be staying there and growing. So you want to make sure your system
can grow for at best next five years. Honestly,
you make sure you're securing your data and you can ensure privacy
for your users. You need to make sure it can handle load today as
well as your anticipated load again within five years, which is hard to
predict. It needs to be capable of delivering
analytics. I would recommend making
sure it can handle real time analytics because that's going to be if it's not
a current need, you should be anticipating it as a potential future
need for your system. You want to make sure that there's no noticeable
leg, especially for the end users of your system.
This is all admittedly a tall order, but single stored
can do all this and way, way more. It's a fast, unified database
system that's acid compliant,
all that good stuff. You should definitely go check that out. Okay,
so, questions? I'm in the comments if anyone wants to chat.
Otherwise, next step I would recommend you checking out designing
data intensive applications by Martin Kletman I
also recommend setting up your own
project. I think the best way to learn something is to build it yourself.
I have some examples that I can share in the chat here.
Building your own stock ticker, stock scraper.
You can do a real time Twitter data stream in and doing analysis
on that with machine learning models. There's tons of huge data
sets you can play around with. The best way to do that is to find
someone that's a system that is free with like a developer
tier and try to build something. And I think go
is a great place to do that too. In fact, I've got a great shipping
logistics demo I'm going to share in the comments here.
Definitely learn by reading. There's a bunch of ways to do that here. And if
you want to get started with single store and try that out, it's a great
free, easy way to do that. We have a managed service. You get $500
in free credits today. No credit card needed. Just you can go try it out.
It's amazing. Go to singlestore.com managed servicetrial.
Okay. And here is some additional reading if you want it as well
too. I'm going to flash this up. Great.
And thank you so much, everybody. This has been an absolute
blast. I am so honored to be here.
You all are amazing. Again, my name is Joe Karlsson. I'm a software engineer and
I work at single store. If you want to follow me, check me out at
Joe Karlsson one on Twitter. That's in the lower right hand corner. If you want
to follow me on all the other links, you can check out Joecarlson dev links.
All right, I'm ahead out. Thank, thank you so much.
Talk to you later.