Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. View us. I'm santosh, software engineering lead
at Bytedance, San Francisco, Bay Area. Today I'll be talking about some
of the architectural best practices for large scale data
systems. So what are these
large scale data systems? They are something which we use on a daily basis,
like Gmail, Yelp, eventbrite, booking.com, digital wallet,
e commerce system, and Google document.
So I'll be going
over some trade offs which you need to consider as
part of making choices for what you need to
pick while architecting these large scale data systems.
So firstly, to begin with, we'll go over the storage architecture.
Now, what is storage architecture?
So let's get into some introduction about
this, right? So let's take an example. I mentioned
a user, or a lot of users, accessing their Gmail
or Yelp or booking.com or any of
these services or websites. The request goes through
load balancer and then the HTTP API call to gateway,
then the microservices. All of these fun things happen
in the backend system. So what is ultimately
happening is all the data is written to some
kind of a place where the data is written, and then
the data is also read when you have to retrieve or
on the get API calls. So this particular
place where all the data is stored, like the blue box
in this diagram I'm showing is
the storage layer, and the storage has different
types of databases. In the current world, we have been using lot of NoSQl
and MySQl. And this is again data moving from MySQL
to NoSQL or directly making an entry into the NoSQL like Cassandra
or Redis, and databases
like that is totally different topic, which I won't be covering here.
And MySQL databases, MySQL, like a typical
relational database management system. So where the data is
written. And now in this storage system,
we already have lot of options in
the current world, which I just gave some examples, right,
you don't need to implement the storage system from
scratch. All we need to know is some
technical details of how the data in the
storage system being saved or
processed, so that we as an architect or any
software architect, will be able to make the right decisions
when building a system. When building a data intensive system.
So the two things, right, select the app storage
engine based on the trade offs and then fine tune the storage engine
to perform well for our system workload. So for these
two needs, we need to understand the underlying implementation
of this particular storage layer. Now let's
look at some options. So ultimately
the storage layer, which I just talked about has some
way of storing the data in the form of some data structures,
right? So ultimately the data has to be stored in the form of data structure.
So when you say MysQL or a Cassandra or any database,
ultimately that database is storing your
data, say your gmail or say the restaurants in the Aihelp
or the Google Maps data, the places and all those things are
stored in some data structure in the backend
of these storage systems. So that's something we'll be focusing
on to understand the different types and then
what storage we need to
pick depends totally on the data structure
or the back end implementation of
that particular data structure. So now let's look at some
examples. Firstly, the B
trees. So this is a very classic data
structure where you have the parent
and then the children dropping off the parents,
which I'll be going into each and every data structure in the upcoming sections.
So most relational database management systems
use this btrees and some examples of
backend systems which use the database based
off of Btree's are like Gmail, Yahoo mail or any
mail system. And there are other services too, but I just took email because
it's more connected to all of us. So the other
ones are quattries. Say we have these
services like Yelp, what are the nearby restaurants or
nearby places on Google Map where you can just look around
searching, or also booking.com where you select a region
and then look for all the hotels or anything you want to
book. So at the back end you are getting some data
to be viewed, right? So where is that data coming from?
That data is coming from a data structure which is called quattrics
and why quattries are used here. And if you're building a
new system altogether, what data structure you need to pick
is something which should be running on your mind while I'm giving these details.
Now these are like very interesting. The third type of data structure is the LSM
trees. So they are write intensive.
They're used for write intensive applications because it's
a very unique data structure where the data is first
written into some temporary cache and then flushed
into the secondary storage like a disk. So like for
example, rocksdb is a database
which implements or uses this or implements
the log structured merge trees in the backend and I'll be
going into the details in the upcoming slides. And digital wallet is
one service which uses because there is
heavy rights. And of course the Hadoop distributed file system and Kafka also
have write intensive operations, so they have to
use the LSM trees in the backend storage.
And inverted index is another, although it's
not a data structure, but another concept which you need to consider
when selecting the storage systems and configuring these storage systems.
And in search engines like google.com or Amazon.com,
when you search for certain items, you immediately
get those things right. So in
the back end, it is the inverted index that is doing the
magic. And so whenever you build any search engine,
inverted index is something which is very crucial and needs to be incorporated
in your system design. So now let's get into the details one by
one, like I said, firstly, the b trees.
So as you can see in the picture, this is how the B tree is
structured. So at the top you have some numbers,
and anything to the left of the numbers are
the nodes, or a node which
contains values which are less than, say that number here 100,
less than 100 are 48, 50 and 79. And between 100 and
155 we have the second node at the second level,
128 and 140. Like that, anything greater lies
onto the right side of any node and anything less
than that particular node value lies onto the left side
of that node and anything in between. Of course, say here in 155,
on the left you have lesser than 155 and on the right you
have greater than 155. So this way the data is well
hierarchically organized. What it means
is it's efficient for lookups, like reading,
say, when you have to look for when
the data is so organized like this in btrees.
So when you have to look for, say certain queries,
you say, I want to look at an email from
in a particular time range, like certain say March
10 to March 15. And that means that data lies
in only one side or one part of the B tree.
So you don't have to look through or search through the entire
database automatically when you query.
It doesn't do lot of seeking because
even if you think about it, the data is ultimately
stored in some kind of a secondary storage. So this particular
B tree data structure enables us to use something like a disk
and disk. Why disk? Because disk based storage is always cheaper
than the SSDs. But again, disk that
operations like input out operations really
adds a lot of latency to avoid that our btree
data structure, which is where the way it is
organizing the data in the storage when
we are using the RDBMs, it enables us
to search through the right trees and
pruning the other irrelevant or not needed
branches so that you can just go to a specific
node and there is not much I O happening,
which means you're already optimizing the disk
storage. And of course you're getting the data in a very quick time.
And the time complexity would be log n. As for any binary
tree or a B tree, as is
proven mathematically. And of course it supports like just
mentioned, range queries is one important thing. So in
Gmail or Yahoo mail, when you have to search for
a certain, because that's a very frequently used operation to
look for emails from a particular sender or time range. So the querying is really,
really easy. So read heavy operations
are really efficient with the btree,
and cheaper as well with the disk based storage
optimization, which I just described. Now moving on to the next data structure,
which is quarteries. And so some
databases use quarteries. I gave an example here,
mongodb, PostGis. These databases use
quarteries to represent the data. Now, what is a quarter? Right, let's look
at an example. You can look at the
picture in this slide. The top node represents
the entire map, like a two dimensional map.
That map can be the entire globe, or that map can
be a small region. Say, let's do it simple, right?
Let's imagine the top node is the entire globe
2D map. And every map has four quadrants.
On the left top it's the northwest, and the
right top it is northeast,
and down left is southwest,
and down right is the southeast. That's it. So now
if you want to go into North America in the globe,
it's in the northwest. So it goes to node a. And in
the node A in North America, there might be lot
of places, lot of states. And when you have to
go zoom into some places,
you further go drill down from the node
A, even down further. Just like how we have node C
and D, I've given here in the northeast
quadrant, which we can assume as say China or like
Japan, where you can
look for certain, you zoom into certain regions.
And again, say if C and d represent certain countries,
say for example, c represents China here and inside China, if you want to
look for certain places, again the c drills
down to four more nodes as a children,
and then you look for certain whatever places you
wanted to see. And that way the
whole quadrant is structured in such a way that you can zoom in,
zoom in, zoom in, or even zoom out, zoom out and look
at the country level or just continent level
and all those things. And one thing
to note here is every node has a value, and each value
represents like, the higher the value,
the more important the place is. For example,
if you are looking for, say, restaurants, the top
restaurant always has a high value and is always at the top of
the node in a particular region. Say you are zooming into North
America A and a represents inside going down
a, you have a one which represents California,
and in a one you probably pick a
top restaurants, like a brazilian steakhouse.
And that immediately is the first child of a
one. So when you're picking the algorithm,
how it works is the breakfast search
is done when an area is selected and
all the nodes which are at the periphery of the selected map
are picked. So, which means that the top places,
the top things are always retrieved first.
And the way this is working, like how the data is being
retrieved very quickly, because this particular quad
tree supports something called spatial indexing. It is just like a regular
indexing, but applied on this
particular data, the coordinates based on the applied on the coordinates
data, like a 2D map, which I just explained. And of course the range
queries similar to Btree, where we go through
certain nodes, we don't have to search through the entire nodes.
Again, login complexity, really quick.
And what else? Oh, the important one is the
density. So you can see in this particular picture
itself, you just left node a as is, and there
are no children, but node,
the second node, which is northeast region, you have like
again four children, and again in the southeast region, you have until
level three, going until E and F nodes. So that means if
a user wants to just look at superficially
zooming onto the map, or a 2D map at
a very high level, like a country level, you're not going to see too many
details, right? But if you drill down all the way to certain
pinpoint locations, then that's when the
tree gets more dense in that particular node,
like say southeast in this example, and you look into
the details of it and you can get as dense as you can.
So say, if there are a lot of things to be retrieved,
the particular node can be very dense. And if there are
not many places to be shown,
then it will be low density. So it's quite adaptable
and flexible in that aspect. So that's why you need to consider
using quad trees for proximity service or any service
that involves like the coordinates and spatial indexing and anything with
maps and places.
Now, the next one is very interesting.
One is LSM trees, I explained.
So what are these? So LSM
trees are the data structure in the storage systems,
which are used for write heavy applications,
say Kafka, say Hadoop distributed file
system, where there is lots and lots of data, like stream
data coming in. And imagine if there is lot of data
coming in, you just don't want to write all
the data each and every. Say you've got a write request and you don't want
to write it as secondary storage immediately, right? Because it involves
some I o and it won't be efficient. You always go to the disk,
write and come back. No, the more efficient way is store
it like buffer it in a particular cache,
like a cache thing, as it is shown here, right?
So the incoming writes first are buffered in a
small cache, like a mem cache, and that
itself is implemented. The back end data structure is again a
binary tree, and once that is full, the data is asynchronously
flushed to the disk.
So here, level 0123 are disks, the secondary storage.
So when the cache is full, the data is flushed onto the disk,
and the disk when it is flushed, the small objects
or the small entities you see at level zero, each one is called
sorted string table, which are called the SS tables, and they
are immutable, they can't be changed, and the data
is sorted whenever it is pushed.
And when you have lot of such individual SS tables,
imagine if someone wants to do a search like writing is good,
right? We are just flushing into the memcache and then writing into
the memcache and flushing into the disk. But what if someone wants
to read? Are you going to write the logic to go over all these
SS tables? It's not efficient, right? So to make
the system read efficient, what happens
is in the background, asynchronously, there's something called compaction, which happens.
What it means is say at level zero,
the two or two to four SS tables are
clubbed together, like the merge sort algorithm is used.
Since they are sorted. We use a merge sort to compact
or club multiple SS tables together and then flushed
to level one. So the level one, what you see is an X
level zero which was flushed to level one.
Similarly, level two is an X level one which
was combined. At level one, all the data SS tables,
the merge data is again merged. It's a classic merge sort. You can look
up online and the merge data is
flushed from level one to level two. So that way the data
keeps getting built up. And so when you have all the
sorted data clubbed together at one place, it is easy to search.
And that way you're getting the read performance and also the
right performance with that memcache buffer. So compaction
is an important point here, which needs to be noted.
And LSM trees are used for
any write heavy application. This is the very reason. And another important
thing here is with the LSM trees,
the configuration is customized.
We can provide what is the buffer size of Memcache and
how frequently do you want to flush the data from buffer to
the disk storage and
what should be the size of the compaction which we
are doing. So all those things are there.
Now moving on to the last one. This is again not a data structure.
But I mentioned about inverted index, right? So why inverted index in
search engine? So what are inverted index? Right? So what
is an index first of all? So index is again used to search in a
DB. You create a map with a keyword and
then having the reference like say, id of a document.
And then with that ID you refer to some content.
So it's more for faster search in
the backend or in the database. That's what the indexing
is used for. Now, what is reverse index? Say, let's take an
example here. So we have lot of documents and you can imagine
these documents as say the web links which
appear when you search Google.com. When you search on google.com,
who is the highest paid actor, you get a list of links, right?
So you can imagine all those links as these documents say,
document one is one web link. And like a web crawler thing, right?
So document one is web link one, web link two, web link three.
And now how
we build the inverted index is you get all the
names, right, all the searchable entities.
Like he says this a all brown day in this
particular black table, I picked all the words from
these documents and put here. And then
the key is what all the documents are referring
to,
these particular keywords. So it's very opposite of
say, document one has these keywords and document two has these
keywords. It's the reverse. The keywords have documents one, two,
three, and keyword has document two, three, four, whatever,
something like that. And now when someone is searching,
right, when you search for, say with these keywords,
say all brown day, then what is
returned is documents three, one and two are returned.
So that means you are actually based on the keywords you're caching the
actual documents, just like say Google.com,
right? You search for the
highest paid actors, right? All the keywords are used and
whichever documents or whichever web links have those keywords
retrieve the particular links and are displayed. So that's what
the reverse inverted index is about. So when you're designing or
building a search engine or anything, one of the most important
storage architectural considerations is building an
inverted index. So now moving on to the second
part is again, we're still in the storage department.
So it is a bit tangential,
but still in the storage part is partitioning the databases. Now,
until now, we talked about the data structure behind
the storage, where the data is being stored.
Now, how the data is being like,
you can't use one database or you can't use one storage system.
It's always distributed with all these large scale, large data
intensive systems, right? So we all know about
primary indices, but I would like to focus on something called
secondary index. Now let's understand what is
secondary index and why we need to know, and what are the
architectural best practices you need to consider
to use the right secondary index. Now, let's take an
example. I took an online book management system here and
say a user is trying to find,
I wrote it in red here, trying to find all books by a
particular author called.
So now what
happens here is the user sends that find all books
by David, and it goes to, obviously the microservice,
and the request goes to the backend storage. You see, there is a primary
index for fast retrieval of the data. Say,
when you say find all books, there are the data. First of
all, this is how the data is partitioned. Let's assume,
right, you don't have one
storage or one server or one back end storage,
where you get all the data. It's always distributed. You have
partition one located in, say,
Virginia, partition two located in Seattle, partition three
located in, say, Europe. You have data scattered all across.
Let's assume the data is sorted or partitioned based
on the genre, and say, now we have to find all
the books by the author. David, you need
to go through, like novel
science and biography, each of the partitions. And in each partition
there is something called secondary index. Why? Because even
with just going into the partition won't help you, right? There might be a
lot of other authors too, and you want to retrieve the books of
only one but one author, which is searched.
So going until the partition level is
fine with the primary index, but in the primary index, in a
partition, you actually need a secondary index to retrieve results
faster and quick lookup. And that's why you need a secondary index. And now
what happens here when
you go to the secondary index and search for the author David,
you have author David in partition one, two and three.
So the data is retrieved from all three partitions,
and then it is aggregated in the microservice and
then returned to the user.
So that's something which is a lot
of activities happening there, right? Like aggregation.
So it sounds fine, but let
me explain. You this particular secondary
index is located in each partition, right? That's why
we call it local secondary index, or another name is partitioning secondary
index by document. So the querying process across multiple partitions
is known as scatter and gather. It's scattered,
you're querying everything and you're gathering together, aggregating and sending it to the
user. This involves parallel queries to each partition to
collect the data needed, although parallelization can improve
speed. Although parallelization can improve the speed,
the scatter and gather can be resource intensive,
especially in this kind of large databases.
So the main costs are associated with querying each partition
separately and then consolidating the results, which often require
lot of computational and network resources. You're actually
making calls over the network for fetching all the data. You can definitely optimize
it. Consequently, while local secondary indices
enhance partition specific query efficiently, which I just explained, they also
necessitate a careful consideration of the overhead involved in
scatter and gather operations in distributed data systems like this.
So what is the solution? So what we need is
not local for this particular use case, but we
need a global secondary index. So here in this example,
in this architecture, the secondary index
is not local,
it's not pertaining to a specific partition,
but it is at the level of primary index itself. Now,
when a user tries to find all the books for that particular author,
we can directly let the microservice query through the
secondary index. And you can see here,
when the user asks for the books
with author David, you can fetch it from partition one,
two, and so,
and there is no aggregation needed. And all
the books data is actually returned back. And at the same time, if you want
to retrieve all the books related to a particular genre like
novel, it goes through primary index and say partition one,
all the books with novel genre and
all the authors are returned. So this is the more ideal.
Like the partition secondary. Global secondary index
is the most ideal in the particular
case where you are drilling down querying
certain authors books in a particular genre, like two levels of
filters. So this partitioning is
called partitioning secondary index by term. This approach
offers a more efficient solution for queries that span multiple partitions,
like searching by authors across all genres. It reduces
overhead of scattered and gathered, which I described, since the index
is global and not confined to individual partitions. However, one more
thing, this method might introduce challenges in maintaining the global index.
Like you're already having a primary index, right? So you have to have
a separate data structure like a map, and where do you
want to store it and how do you want to store it? And with
a large data and large data intensive environment,
it becomes a challenge. So it's a trade off between the ease of cross
partition queries which you get with the global
secondary index, and the complexity of maintaining a global
index. So after all, everything in life is a trade off.
Now let's get some differences, right? I'll give some
real world use cases. So let's
take an example, a couple of examples. When local secondary
index is used and when global secondary index is used.
So local is about partitioning.
Querying in a particular partition, say in an ecommerce website,
the buyer wants to look at all the products,
say in electronic products and with
a particular company, say Sony, he wants to look at.
He or she wants to look at Sony electronic products.
In Sony electronic products. So the first level is go to
electronics partition. And then in electronics partition you want to retrieve
all the products that
are Sony. So for that you need secondary index right here. The Sony
products or the company name
is a secondary index in the electronics category.
Electronics, what is that partition? So we
only need the data from that partition. Now consider a use case
which is more applicable for the global secondary index
is say any company or any firm,
right? A multinational firm has employees across
different countries like US, UK, Asia.
Now the HR department wants to view all the
managers across the company respect
of the country. And let's assume that
the storage is actually partitioned.
It is sharded as per the country like US employees
are present. The data is present in the US data center, UK and UK data
center, so on. And now for this particular
query from the HR to view all the managers across the
company, you don't need a local secondary
index. You don't need to look into the details of one particular partition.
Rather you want to look at all the partitions. And this is where
the importance of global secondary index comes into picture. So data is
obtained from all partitions. So those are the two use cases.
And yeah, this multinational company employee database is a
common thing for a lot of companies. So I'm
sure many of you guys can relate to it.
Now we'll switch gears, moving on to the next
part, which is again, we are still in the
storage world or a database world. Now I'll be talking about this
conflict free replicated data types. So what are they?
But before I talk about these data types, we should
understand why we are even talking about this.
Or how can the architecture be improved
or architectural considerations should have
something like a CRDTs.
Now let's talk about a concept called replication.
Now what
is replication? Right. So as you can see,
I directly am mentioning multilateral replication here. So that
means the back end storage or
the data is not present at one place, but it is copied over
to multiple servers or multiple data
centers or multiple places. And again,
multileader replication is needed in distributed systems
for several reasons. First of all, replication is needed. Several reasons. And on
top of that, multireeader replication. Why? Like first and foremost, high availability.
Say system availability is
improved by providing multiple independent leader
nodes. Say if one leader node fails, like one database node
fails or becomes unavailable, other leaders can
continue to accept the right request, right. Ensuring uninterrupted
access to the data and services so the user has good experience.
Another reason is say, fault tolerance,
especially this multileader replication improves
the fault tolerance by providing redundancy at the leader level itself.
If say one leader node fails, other leader node can continue to accept
write of business and preventing data loss and service
disruptions and things like that. And another big important
use, big important advantage is the
write scalability. So this particular replication
model allows write operations to be distributed across multiple leader nodes.
So this enables the system to handle large volumes of
write operations by parallelizing writes across multiple nodes,
involving improving the overall throughput and scalability.
So again, like geographic distribution in
general, the data earlier I mentioned by giving the example of the
employees database, right? So the data is always distributed geographically,
allowing the leader nodes to be located in different regions
or data centers. So this enables the data to be replicated closer to the users
or replication instances, reducing latency and improving the overall
user experience. So that's why we have multiple
nodes and the data is always copied from
one node to the other node. And now
let's take this use case where
say you have two users trying
to edit a common shared Google document.
Let's assume the title of the document is a and then
the ID of the document is one, two, three. Now in the red
flow, the user red and purple both.
Step one happen at the same time. The user one says,
hey, want to update the page? The Google Doc
title is equal to b for this particular Google Doc
one to three. And user two says yeah, even I want
to set the title to C. And what happens? Step number
two? In red and purple, the data is updated
by the leader node. Internally. That means
the leader one changes the title from A to B,
and the leader two changes the title from A to C.
Done. And don't focus on that follower here. It is just
like a asynchronous copying of the
data from leader to follower for the read operation. So that's
again a different concept. Leader follower model of
general replication. So let's
focus on step four and five now. Now,
like I said, replication involves copying data from one data center to
the other asynchronously so
that they are all consistent. And the way the data is
synchronized between the leaders. Leader one, two, and there can be a lot of leaders,
right. Leader 12345. There are a lot of topologies for it,
say star topology, ring topology,
mesh topology. And again, that's again a different concept. I'm not going
to go into it, but the concept here is leader one, leader two
should be coming into sync or syncing each
other. So this asynchronous, during this asynchronous operation,
what happens is let's assume red
step four, red line, step four, red line. Step four
says four and five says, hey,
leader one is saying to leader two as part of step four and five.
The old document title is A and I want to change it to B.
But what happens? Leader two says, hey,
what is a? I don't have a.
It is already changed to C. So there is a conflict there. It can't be
changed. And then now steps four and five at
purple also you can see the reader two goes
to leader one and says, hey, I want to change document
which is titled as a to C.
Then leader one says at purple, step four and five. At step
five of purple, the leader one says, what are you
talking about? I don't have any document with
a. I only have it B because it's already changed to
B. So there is the conflict. So how is this
conflict resolved? How can you resolve such conflict during the
multilider data replication,
that's where we shouldn't use the regular data
types like int or maps sets
and all the data types we have while actually doing
the replication, while actually copying the
data from leader one to leader two or leader two to leader one.
We need to use some data types, specific data types called
conflict free replicated data types. They are different.
They are special data types. I gave an example here. I thought
integer is more easier to explain, so I just gave
a code sample here. You can see that
it's not a regular integer which is declared. It is an atomic
integer which is coming with the concurrent atomic
import utils. And you
can see leader one in the
main block. Two objects are instantiated. Imagine both
are like this program is being run instantiated
as a multithreaded program in leader one and leader two and leader
one increments the counter by one and
leader two also increments counter by one, as you can see in the main block.
And after that each one have value
as one. Now here the merge
leaders is the concept of actually the replication we're
talking about. Like merging is nothing but replication. And in that, in the
merge logic, as you can see at the top public void merge,
you're actually adding the logic where you're ensuring
the data is consistent. You are not having
any conflict in terms of the inconsistent
data, but rather you are ensuring that the data is
consistent across both the nodes and merging it accordingly.
So using the conflict
free replicated data types,
we can ensure that these conflicts don't occur in
any replication, mostly the multi leader replication.
Now we will switch case again,
move out of the storage world, and rather focus
on some architectural considerations for
the end to end design of the system. Now, when considering the
end to end design of a system, there are two types, inside out and outside
in. Now let me go into the details.
As you can see the pictures here, they are very
general, like whatever I have been talking
about until now. Like say any
system has, you can see outside in architecture
first, where there's a user making some requests,
say ecommerce platform, he wants to make some payment. After selecting a list of things
to pay for, you have a user interface where he does that.
And then you have the APIs, right? Like the APIs
are invoked and at a particular endpoint the HTTP call comes onto
the microservices. And then internally there is some data manipulation
there. Writing or reading happens. Ultimately it's some crud operation,
right? Like create, read, update or delete. And step
four is the classes or abstraction, which is
defined on top of the database. And then you actually write it
to the DB. So this is the general flow.
Now, what is outside in and inside out?
So let's say there are two ways of designing a
system. So when there is a use
case, right, you have to design a
system. Either you can define it from the
user itself, saying user interacts the system in certain way,
user opens certain page on the UI and user
clicks this button. That results in calling a particular API
onto the backend system. If you go by that approach, like a user
centric approach, or a user interactive approach, that is
outside in architecture, I would say it's more like a
product management or product driven approach. When a product
managers come to a software engineering team
and says that hey, this is what we want to build as a product,
then you most likely tend
to go with outside in architecture and inside
out architecture. Is more engineering driven. That means you
are not starting off with the user or user perspective, but rather
you are changing the data models, data types, and then
accordingly you are changing the abstraction. Back end classes
like data access layer. Like for example, you're splitting
the database into multiple databases with
foreign key and primary key linkings, and then
you are writing new classes on top of it. And then you
are, accordingly you are defining some new microservices. It is domain
driven. So this is like the engineering
team decided to make some architectural enhancements within
the system. So that's when you actually
start off with the database, and then the classes
on wrapping around the database, and then the microservices, and then the API change
and ultimately it propagates all the way to the user.
So that is the inside out architecture.
So an example of inside out architecture can be say monolithic to microservices
decomposition. And outside in architecture
example can be product driven, right? Or user
driven. It can be like ecommerce platform, where the user goes and
purchases things. And how a user interacts with the system
is one thing, let's say
inside out. Another example for inside out can be a
banking system. In developing a banking system, the core domain
revolves around financial transactions, accounts,
customer relationships. All these are like entity relationship model needs to
be defined and then written, right? And then the
domain driven design starts by modeling these concepts
and then defining the business rules governing them. Once the
domain is well defined, then the
infrastructure things such as the storage,
user interfaces and all the external integrations are addressed
and it's propagated out. So that's the inside out
and like outside in. I told about ecommerce platform, right? So imagine a company
is building ecommerce platform that aims to provide a
seamless shopping experience to its customers. So now
to apply an outside in architecture approach
for such use. Case, the development starts by identifying
the key user interactions and requirements from the
perspective of both customers and sellers. Then they prioritize
features that directly impact the user experience, such as product search,
browsing, purchasing and order tracking and all of that,
right? Yeah, so that's what
I have just described here. I gave some examples as
point number four. Inside out you can see monolithic to microservices
like domain driven or domain driven applications
like banking system. And outside in
user centric design. As I just explained, API driven inside
out, as you can imagine, right? It's a
push architecture or a push strategy because you're going
from database, which is inside the back end
system. You can think of it as inside the back end system, and then you're
going out to the user, so it's inside out or
push. And then that's why the outside in is
more coming from the user, the end user,
all the way until the system, right? So that is user
into the system. So you can imagine yourself as the system you're either pushing
or then for pulling. So that's how the strategies are defined.
Since if you're doing something
like monolithic to microservices rearchitecture, like an inside
out, you know all the details of the system,
like what needs to be changed. And you
don't know about UI changes in that particular
case. Right? So that's why you start off with internal things and
that's why I mentioned at the second point, like forecast the demands
of the UI needs, and outside in is exactly opposite. You start
from the UI, you know what UI is demanding and then you
go inside. These are the four
topics I wanted to cover as part of
the architectural practices,
the important architectural trade offs or design decisions
that one needs to consider, especially a software architect
designing these data intensive systems, knowing the details of the
back end to make the right decisions based on some crucial trade offs.
And there are other things like say consistency
and availability and
lot of other aspects of the system which can be considered and
architectural practices needs to be employed. But yeah,
I'll keep that for some other talk. But thanks
a lot for listening to my session
today and I really appreciate it. Thank you.