Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to MongoDB. Schema design best practices.
Let's jump in here.
Oh, okay. Just got to click. There we go. Okay, so first
of all, why is schema design so important when you're working on a
database? Well, did you know that it's one of the most
critical parts about improving performance and stability of any database?
This is especially true, though, for MongoDB and just
a personal opinion, it's one of the things that people get wrong
the most when they're setting up a MongoDB database. All right,
cool. Well, we're going to jump more into that. Don't worry.
So, my name is Joe Carlson. I work for a
company called MongoDB and I'm a developer, advocate and
software engineer. If you're at all interested in hanging out with me at
all ever again, you can totally hit me up on here. Twitter is the best
place to get a hold of me. And if you
want any of the links, video,
resources, slides, anything, you can find that at that
link below. Or anytime you see a QR code that will take
you to a page that shows you all of the resources for this talk.
And, oh, I didn't even say this here. So also, opinions are
my own. If I say anything weird, just know that I'm going to be putting
lots of my own personal opinions in this talk.
All right, so what are we going to be talking about today? Well, first of
all, we're going to be talking about traditional relational
SQL type databases and comparing those to MongoDB
databases and particularly from a schema
design approach. Next thing we'll be discussing today is
embedding versus referencing. What's the differences? And these are the two key
ways of organizing a schema with a MongoDB database.
And lastly, we're going to be discussing lots of different types of database
relationships and how to model those in MongoDB.
We're not going to cover all of them today, but this is an introductory
course to designing a schema. All right,
cool. So first off, relational versus
MongoDB schema design.
The thing I see the most when people
are coming from an SQL background to MongoDB is they
are designing their MongoDB schemas like they would with their traditional SQL
database. Most developer advocate
see a difference and a lot of times
I see like, hey Joe, why is my schema performing badly or my database is
getting slow? It's because they're designing their schemas in
the old way or in the SQL way.
Right. And that can lead to some performance issues in the future.
This is my reaction. Anyone comes to me
with this type of databases. Okay, so relational
schema design, what does that look like? So if
you're in charge of an SQL database and you're designing a schema for that relational
database, what you're going to be doing is designing
your schema independently of the queries you are going to
be making with the application using that data set or database.
The question most devs ask themselves when designing a schema for
a relational database is, what data do
I have? Did that go in there?
Okay, so typically we have a very prescribed
approach to doing that, and that's called normalization with a legacy SQL
database. And typically we normalize to the third
form. I think there's five forms. Four or five.
But traditionally most developers are normalizing to the third form.
So you don't have to know about normalization to understand this talk.
But I just want to tell you that normalization means you're trying to dedup
your data as much as possible by splitting it up into separate tables.
Cool. So with a relational database, this is a
typical user type normalization,
right? You have like a user table and
you might have other tables saving data, and you're linking that data
together using foreign keys you see there in the professions and
cars table. Both of those are using a user id foreign key
to match that to the user.
Okay, so those are the basics of SQL or
relational database schema design.
If you've never done it before, you're probably at least vaguely familiar that that's what
that kind of looks like, right? Rows and columns linking together with foreign keys.
Okay, great. So MongoDB
schema design, how do we do that?
Well, first of all, there's a couple of
things I should tell you. There's no rules to it.
And if you're used to coming from a relational databases,
that's going to be kind of tricky. There's no process and
there's no pre prescribed algorithm for how to split
up that data. Holy cow.
This can be freeing and kind of scary because
there's no rules here. There's no rules. We got no rules in
this house.
So if there's no rules, what do we do? Well,
schema design for MongoDB
is based on the needs of your application.
So instead of asking what data we have, we're asking how do we
want to use this data? And there's
a couple of things that we are mostly concerned about when we're designing our
MongoDB schema. The first thing is we're wondered with how
to store that data. Duh. Right. We're also concerned with the performance.
Right. We want to make sure that we're querying and updating and
maintaining the correct amount of performance that we need for our application.
And we also want to make sure that we're not using ridiculous amount of hardware
and spending way too much money. Right. No one wants to spend too much money.
Our bosses don't, we don't. My side projects don't.
Right. We want to try to minimize costs
and optimize for performance.
So let's say we have the same user table and
we want to model the same exact data set in MongoDB.
How would we do something like that? Well, if we were going to do this
with MongoDB, we would of course be using MongoDB documents and
a lot of things. Stuff. We would just be using keys
and values, right.
We just save that stuff as key value pairs, first name, last name, surname,
cell, whatever, location, whatever. Right. We're just saving as key value pairs.
But those other tables were saving more data. We might need to do
something a little bit differently. Right, because key values are one to one.
But if we have that professions and cars data, we need
to keep track of, and the user can have multiple professions and multiple cars,
we need to model that data a little bit differently. And of course,
MongoDB documents, we can save that like we would with any JSON document
with any nesting, with keys or with objects or with
arrays. So if we had our professions table and
a user could have two or more professions,
I'd probably just save that as an array, right? We could just have an array
of professions we could embed there. So that would show that a user would have
more than one profession or cars. Right. Cars have
multiple data. Now we have the model and year we also need to keep track
of that makes sense to have an array of objects to track that data with.
My indentation got a little bit weird there, but you know what I'm saying.
Okay, cool. So not too bad. Let's do a quick recap here.
So first thing we started talking about was starting from a base
of relational schema design. I think a lot of us are coming from that way
and trying to understand how to do that. I think it's a helpful place to
start. So we discussed relational schema design, and we
discussed how we traditionally model our data using
a normalization, and we're modeling our data independent of the queries we
need to actually be making. We are then also normalizing
in the third form. Right. We talked about the rows and columns we have.
And then we discussed MongodB schema design.
Remember, there's no rules, no process, and no pre prescribed algorithm
for how to actually do that. The things that we're most concerned with when we're
designing our schema is how we're actually saving that data and query performance.
And of course we don't want to use too much hardware.
The most important thing you can ever remember when you're designing
a schema is you're going to be designing
your schema based on the needs of your application. And every application is
different and uses data differently. So we're looking at exactly what
your database needs and how your application is going to be using that data
or modeling our data to optimize for query performance.
And the two ways that we do that are through embedding
and referencing in MongoDB. So let's discuss what
each of them are and then we'll be discussing when to use each of them.
So embedding, of course,
refers to actually embedding that data within our
object, right? We can deeply nest arrays,
objects, keys, whatever in any structure that makes sense for us, but we can
embed that directly in the document.
Referencing you might recall too, from is similar
to a join that we make with foreign keys on
a legacy SQL relationships database.
So we're not embedding that data directly in there, we're actually referencing
based on keys to make queries that pull together from separate documents
or collections. So embedding,
why would we want to use embedding? Well, if we're
able to embed that document, all that data in a single document, we can
get that with a single query. If all the data we need is in one
place, we don't have to do any joins.
Joins, and if you do not know, joins are very expensive.
It's a blocking operation. They tend to be time consuming and what
the compiler is doing or the database, when you're making
a join in a legacy SQL database is they are bringing
all that data together in memory and then doing a filter or search
on that data once it's been joined in memory. This is time consuming,
expensive, and if you have huge data sets, this can use up an enormous
amount of energy or like computing power in order to get this data.
Also, by default, update operations
in MongoDB are atomic. If you're updating on a single document,
you can have atomic asset compliant considerations
for multiple documents, which we'll discuss later as well. Okay,
so embedding, what are some cons? Well,
if you're embedding all of your data within a document, that could be a lot
of overhead. And sending lots of data over the wire every single time
could be overkill. So the question you should be asking yourself is, do I
actually need all this data to be embedded within my document or not? And if
not, you might need to actually reference, which we'll talk about in a second.
Also, there is a 16 megabyte document limit
per MongoDB document, right? So you cannot
exceed 16 megabytes per document in MongoDB.
And again, if you're getting that close, probably time to start thinking about maybe referencing.
I get asked that all the time, like, hey Joe, my documents are huge.
What do I do? It's like, well, that's a code smell in MongoDB. We probably
want to look at how we can split that data up and reference it in
other documents. Okay, so what about referencing,
right? Just like you can do a join but using foreign keys,
you can do the same thing. In MongoDB. Traditionally we use
a unique identifier or the object id, and we can
do queries that do joins for us, right? Just like you can
with a join in a SQL statement. We're splitting
that data up into separate documents and doing joins on our queries, updates or
whatever our crud operations, right? So why would we want to
use referencing over embedding? Of course we
can start splitting up our documents and making those smaller. If you're hitting that 16
megabyte limit, again, probably want to start referencing.
And just like you would with a dedupe
in normalization with a relational database,
you're going to be deduping your data or reducing duplication.
That's not to say that deduplication is an anti pattern,
either in SQL or MongoDB databases. Don't be afraid of
duplicating your data. A common way to increase
performance with an SQL database is to denormalize,
or that means starting to consolidate your joins
in a single data or collection or table. So you
don't have to be doing those expensive joins. It makes querying actually much faster.
So it's not even a problem in SQL, right? Neither of
them. Don't be afraid of deduplicating your data or duplicating your data,
for example, too. If you have data that you're not accessing often or you don't
need every single time you query that document, probably want to
yank that off into a separate document and reference that with a reference
id. Okay, so we got some cons now,
right? So if you're referencing and you need to get that data. You will have
to do joins or lookups in order to retrieve all that data, which can slow
down your query performance.
Okay, cool. So embedding and referencing, the two key ways of designing
your schema, that's the building blocks. And the question you should be asking yourself is,
should I embed this or should I reference it? So embedding
is where you embed that data directly in your document.
You do not need to have object id references to look up that data.
All the data you need is in one place and
makes it easy to look it up with a single query. You don't have to
do any expensive joins and it updates everything with a single
atomic operation. All right. On the flip side,
it also can cause problems if you have large documents.
And if you're hitting that 16 megabyte document limit, you need to make sure
we're handling that, in which case you want to probably reference
that with an object id. You're going to have smaller documents.
You're less likely to hit that 16 megabyte limit, which we'll talk about. Maybe you
could, still could, right? We're dedupling our data, which is
not an anti pattern, but it is a consideration to make. And we
don't have to have all the data every single time, right? If you don't need
that data, let's not query it every single time. That's a waste of, of computing
power, space, data transfer, over the
wire, et cetera, et cetera.
Okay, but if you are referencing, just note that you will have to be doing
separate lookups in order to get that data, which is a consideration to make when
you're considering performance of your database
operations. Okay, so let's look at
types of relationships you would see when designing a schema.
I think it's helpful for us to start in a place where we are designing
relationships like we would with a legacy SQL database, which again,
I think a lot of us are. If you're not great, you don't have all
that baggage, which is great. But it's still important to understand how
these relationships work and how we can model them.
In MongoDB, we're going to start with the most
simple operations and move away up to more complicated, more interesting, in my
opinion, more interesting schemas. So the first one we
need to be aware of is one to one, right?
DJ Khaled would definitely be a big fan of this one, right? We're just adding
one single new piece of data to our
document. One to one is really easy for us
to create, right?
It's just key value pairs, right? Nothing too complicated.
If you have a single piece of data and a single
option for that, using a one to one relationship with
the key value pair is the way to go. Right.
Just use key value pairs. Okay. So that one's
pretty easy. But let's get to some more interesting stuff here. Right? We already previewed
this one too. But the one to few, right? One to few would be
modeled by perhaps doing an array, array of
some sort of documents in there. All right, we have our
data. All one to few means that you're not going to have
mass amounts of them. So someone could have a couple of addresses, but someone probably
doesn't have 16 megabytes of addresses, which might be like
millions and millions of items
in there, right? So those are pretty safe to put in there because
it's probably not going to max anything out. So we're just going to embed that
right in there. We're going to prefer it, right. There could be cases
where you don't want to do that, especially if you're not going to be referencing
that very often, or don't need it on every single call. But prefer embedding
if you have only a couple of things you need to be tracking in a
dish. Like more than one.
I told you there was no rules before. Understand the
irony of this, but I have a couple of personal rules
I follow when designing a MongoDB schema for a data
set or for an application. But I favor embedding.
That's my first go to thing, unless I can articulate
a reason why I do not want to embed it. So just
your go to should be embedding. But if you say like, I don't need this
every time, cool, that's a compelling reason. Or this is too huge,
cool, that's a compelling reason. Let's pull that out and reference it.
But prefer embedding if possible.
Rule two, needing to access an object on
its own is a compelling reason to embed it.
So for example, it's addresses. If another piece, your application is just going to
be using those addresses, even though that's a one to few, you might want
to just split that off. That might be a good call for your application.
That could increase performance and it could decrease the amount
of data being transferred over for that separate query on a different part of
your application. Okay, so let's
move it on up one to many relationships.
So let's say, for example, you're designing an application for
a product and your product that you're keeping
track of has lots of parts. So you want to keep track of what
these products are and all the separate parts and components that make up this
product that we're designing. Maybe we're manufacturing, maybe we're an ecommerce company,
maybe we're doing supports and we need to understand the parts that we're supporting.
Right. Whatever it is. But we have products, and products are made up of lots
of parts, and we need to figure out how to design a schema for this
things. So we have one product that has many parts
to it. You see where I'm going with this?
And potentially that bicycle could have thousands of parts,
or if it's more complicated, like a car or a
tractor or Xbox or something. Right. These could have
thousands and thousands and thousands of parts. That's a many type relationship.
So this is the first one we're going to actually start considering doing
a reference. So we have our one product,
and the product is made up of many parts in order to reduce
the amount of parts that are like or
the data that's being tracked in our product. What I'm doing is doing referencing
for each of those parts, but I'm tracking each of those parts within the
product. So the part can have lots of different things
in there. Right. We have quantity, price, cost, name, product number,
et cetera, et cetera. We probably don't need that in there. Also, if you're designing
an application, let's say you're designing an ecommerce site and you're keeping
track of your products and all the parts that make up that, chances are
you're going to be using that product information much more often than
the parts. Like parts would be. Something if someone needed support or wanted
more information about what made up this product on your ecommerce store, you can
make a separate query to go get that, but you probably don't need that every
single time. So we've made a call here that we don't actually need the parts
all that often, and we're worried about hitting that 16
megabyte limit. So we want to start splitting that up to kind of help
us mitigate the risk of going over that limit.
But it still works, right? We're matching the ids
from the product to the parts, and we can do that many, many times easily
and making queries and crud operations for all of that.
We're going to prefer referencing for this instead of embedding,
especially if you're hitting thousands and thousands and thousands of sub things
in there. Right? That'd be the many. Right. Okay,
so rule three, I want you to avoid joins
and lookups if they can, but joins
are not an anti pattern. And if you have a better
schema because you need includes a reference,
go for it. Right. I prefer embedding, but if I can justify a
reason to split it up and use joins and lookups,
great, go for it. That's a great use case for your
application. It depends entirely on what you are building.
Now we're getting to some fun stuff here. One to squillions. So before with one
to many, we're talking like one to maybe a couple thousand
subparts in there and we're going to be referencing. But one to squillions is on
another level. So let's imagine you're building an application for logs,
right? And I'm not talking about timber
logs here. I'm talking about like you are being tasked with
building a log system for your server farm. And your server
farm, if it's blowing up, might be generating thousands or
potentially millions and millions and millions, especially over time. Right.
Depends how verbose your logging system you're designing is.
But you could have squillions.
I know it's a made up word. I don't even know where that word comes
from. But you could have potentially squillions of log
files for this burning server farm you're building. So how
do we build that? Because the problem we have
here is that we could keep track of an array of object ids like we
did with one to many. But an array file growing
at the unbounded size, even if it's only tracking object ids,
could potentially run out your 16 megabyte limit, especially if
you're leaving your log system on for six months without clearing up that data.
That could be a problem. So how do we mitigate that risk?
Well, we have a one to squillions relationship you could be developing.
So we have a single host file, which would be like a single server instance
on your server farm. And we're using,
instead of keeping track of all the log files and array in that
host, what we're doing is keeping track of the host in the
log. Right. You see what we're doing here? Each of the logs
here, it keeps
track of the host id in the log file. That way we don't have
to worry about running that unbounded array out to our 16
megabyte limit in our host object.
We don't have to worry about that. Right. We're doing a reverse reference
and we're keeping track of the one host in
the log message. We can start doing queries just on if
you need to get metadata about the host, or we can just do queries on
that object ids and group them. Right. We could do all that stuff,
no problem. But we're tracking both
of those host files or host
object ids in each of the log messages.
We're going to obviously prefer referencing here because you could be going
massively growing here. So rule four, and I think this
is honestly one of the most important ones you can have or one of the
most important rules. But if you have an array that
you think could be growing at an unbounded
size without any stop gaps, that is an anti pattern,
that is a code smell, you should probably try
to avoid that. So anytime you have an array that is growing at
unbounded size, like a log file, you want to be
referencing. And I prefer doing the one to squillions.
Right. So you don't want to make sure any of those arrays are going to
be growing unbounded ever.
Okay, cool.
This is the last
relationship we'll be discussing here in detail,
but I want to discuss a many to many relationship.
So follow me if you will, dear listener.
But let's say you are hypothetically designing a to do list,
right? And a to do item can have many
users and a user can have many to do items.
Let's say we're designing like a kanban type board and multiple people can
be working on a single item on that board at a single time.
How do we do this? Well,
this is where we use a many to many approach. So we have
users and we have tasks that we're keeping track of in our
system. So a user obviously can have an
array of tasks that they are responsible for
and these tasks would be correlated with different
task documents in there. So we're doing a reference, we're doing an array of
references to our tasks and we would also need
another that didn't show up on there. Oh, well,
let's see here. I'm just going to show you the point here. And then also
the tasks then have an array of owners that they're also
keeping track of. So with a many to many type relationship,
what you're doing is storing that many to many relationship in
each of the reference subtasks with each other so
that they can each have a many, many to many relationship.
So these tasks, for example, they only have one owner
and we're keeping track of that owner using that reference id.
But it is possible in the system to have many owners
responsible for a single task.
So this one's really important
too. How you model your data depends entirely on your application's needs,
right? There's no rules, but every application has separate needs.
And how you're using that data dictates how you're going to
be designing the schema. Right? So what may
work for someone else's project may not work for yours.
And it's up to you to make sure that you're taking apart
or taking into consideration your performance needs for that application and
how you're going to be using that data. Okay, so let's go
through it. Types of relationships. The first one we went through today was one to
one, right? We're just preferring using embedding or
using key value pairs. Right. Or embedding that data right in the document. Piece of
cake. Not a problem. One to few, right? Easy.
We're just going to be embedding that data within our document,
right. We have a subarray with some data in there. It's not going to be
growing at a huge rate. You only have a couple of items in there.
Just throw it in there.
There we go. Computer is going slow here. And then
we have the one to many relationship, right? We're having a computer
pause. You would have your one
product keeping track of potentially thousands of subparts
on there. And we're doing that using, referencing and referencing
that object id of the subpart in another document. And we're embedding that
object id of that subpart into a parts array in our
one product, one to many. And then we discussed
one to questions, which was really fun, right?
And we gave the example of our logging system. So we have
a single host document keeping track of all the metadata
for that server. And then each log file
is a separate document and the document keeps track of the host's
object id in it. So we don't have to worry about having an unbounded array
appear for us in that host object.
And then we had the many to many example, when I gave the example of
writing a to do list where multiple users can keep track of multiple items
or to do tasks and a task can have multiple owners. Right? So you can
have many things to do and many people can own a single task.
And we do that by doing arrays of sub reference
ids to all the tasks in the user object. And then in each of
the task objects, we're keeping track of all of the owners who
own that object in an embedded array.
All right? And then we had some rules here today too. So first thing is
favor embedding. That's your go to thing, unless you have a
compelling reason not to embed that data.
And basically everything here is about deciding when you want to
stop embedding that data, right? And needing to access that data on
its own is not a compelling, or is a compelling reason to not embed it
or to reference it in a separate document. You want to avoid joins and lookups
if they can be avoided, but it is not an anti pattern
to use them, right? It's not a bad thing to do joins and lookups in
MongoDB if you have a compelling reason to use them.
It's just in my experience, most people coming from a normalization approach just
by default will split it up in order to normalize their data like
they did in SQL. But the very nature of
a MongoDB document allows us to be more creative and do more
interesting things with our data set, including embedding it. Let's take advantage of it.
This is a unique way to save our data. We might as well use it
and take advantage of it, right? Note your array
should not grow without bound. If you have an unbounded array
anywhere in your data set, let's get rid of
that. That's a huge code smell
for me. And lastly, and most importantly, how you model your
data depends entirely on the needs of your unique application, right?
Everyone has different needs for the applications, but these are the most
important. Okay, here, home stretch here,
let's go through what we just talked about. One last recap.
So the first thing we discussed today was relational database
design versus MongoDB schema design. With a relational schema
design, what we're doing is modeling our data independently
of our queries. And typically we do that through normalization to
the third form and splitting that data up using foreign keys into separate
tables and columns and data sets. Right? And we're doing joins on
those foreign keys to bring that data together with MongoDB
schema design. No rules, no process,
no algorithms we can follow. We're just worried
about how we're saving that data and query performance based
on the needs of our application.
Okay, so there's two key ways of us designing
the schema for application. That's either embedding that data directly in the
document or referencing it with an object id. So if you embed it, obviously you're
just sticking that data directly in that database, right? That'd be the
equivalent of doing a join on data sets with a relational database
or data set. And get all that data,
all that's there, single query, super fast atomic operations.
But you have to be aware of data growing massively
and growing out of the bounds of that document or getting too much data
that you do not actually need. Right. And if that's the case, then you want
to make sure you're referencing that data using object ids and
using joins and lookups. So you get smaller documents,
you're deduping your data again, not a code smell and you
can access. We're reducing the size of the
data going over so you're not over fetching any data for your users or for
your app, which would slow it down. But you do have to be aware that
you are going to be making queries and lookups which can slow down
and decrease performance of your application. Okay,
then the next thing we did was we discussed a bunch of SQL type
relationships that we can also use. So we have the one to one. Awesome.
We're going to use key value pairs just to keep track of all that data
together if you have a one to few. So that's just like
a subarray. We're just going to embed that in the data set. If it's not
too huge, just embed it one to many. Right. This is where
it start growing largely. And we gave the example of that product
having many subparts and we're using referencing on those
object needs to make sure that we're not going to be getting too big.
And for the needs of this application, we didn't actually need to let part data
every single time. It may only be a unique part of
our application. It may be overkill to get that every single time.
And then we gave the example of one to squillions with our log files and
many to many with our to do lists.
And we'll skip through this really fast here, but I just want to
point out this last part again,
if you're going to take anything away from this talk. What I want you to
remember is when you're modeling a schema for
MongoDB databases and for your application, just know that every database
schema design is different and it depends on the unique needs of your application.
Consider how you're going to be querying that data or using it,
and you need to figure out what
performance needs you need for your application and you're going to be designing
your schema based on those requirements. Then that's it, right? There's no pre prescribed
approach. Every need is different.
Okay, questions? I'm in the chat too. If you have any questions
and what's next? So if I've inspired you at all in this
talk and you want to get involved or learn more,
you should know that we have the MongoDB University, which you can totally check out
there. We have the MongoDB developer hub developer mongodb.com.
It's a place where you find amazing blog posts, articles, cool to
do things, examples,
getting started guides, quick start guide. It's amazing. And if you
want to take advantage of our MongoDB DB University, I'd recommend
if you haven't started with the m one or like intro to MongoDB
course and you want to learn more about MongoDB schema design
because we just scratched the surface here today. But I recommend
taking the m 320 course on data modeling. It's a great place
to learn more about database design. Lastly too,
if I've inspired you all to want to get better at schema design or MongoDB
or just be a better developer, what I'd recommend is just getting out there and
doing it right. Just on your next project,
spin up a MongoDB database and just use on your backend.
But practice it. Going to these talks is a good
way to learn if this is something you're interested in learning more about.
But in order to fully
grok a new piece of tech, I think the best way to do that is
to actually build it.
Figure out what errors are going to come up. Like just do it.
There's tons of resources for you from MongoDB, but get out
there and build something. Build something just for you, right? But try to use it.
And if you want $100 in
free MongoDB credits, use code Joe K 100.
Or you can scan that little QR code for a bunch of free credits.
If you want to work for MongoDB, we're always hiring, baby. We're always hiring.
Check out the MongoDB careers. That's careers mongodb.com
or go to Joecarlson dev Mongodbcareers.
Here's a bunch of resources I'd also recommend checking out again,
my name is Joe Carlson. I work for MongoDB and it has been
a pleasure chatting with all y'all today.
If I've if you want to hang out with me ever again, best place to
do that is on Twitter. I also make dumb jokes on TikTok and I stream
on the MongoDB Twitch stream every Friday at noon eastern time.
Thank you so much everyone. You're the best. Ooh, I love you.
You're great. Oh so good. See you next time.