Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Alex. Today I'm going to talk about practical strategies for
navigating complexity in distributed systems. Here's a quick agenda
of what I'm planning to talk about today. So first of all, we'll take a
look at what this distributed system is. What are the main complexities of
building such systems? Core principles of system engineering and cybernetics,
and the three practices for reducing complexity overhead.
So first of all, let me define what distributed system is. The attributed
system is simply a network of computers which communicate
with each other to achieve a common goal. Each computer,
or you may name it as not, had its own local memory
and cpu and run its own processes. All of
them communicate with each other via network to coordinate their actions.
The main differences between distributed and centralized systems
is that in centralized computing systems, all computing
is performed by a single computer in one location.
Also, nodes of a centralized system all access
the central node. I think that's obvious, which can lead to network congestion
and slowness. A centralized system has a single
point of failure or it's represented as a single point of failure, while destroy
the system by design has no single point of failure.
But before we start looking at real complexities of building destroy the systems,
I want to talk a bit about what complexity is.
And complexity can be defined from different point of views and aspects.
To be honest, in terms of software engineering, there are two main definitions which are
important for us. So in systems theory,
complexity describes how various independent parts of the system
interact and communicate with each other.
Where from software and technology perspective,
complexity refers to the details of software architecture, such as the
number of components, how they define interactions between each other,
how many dependencies they have, and also how they interact within
the whole system. I think monolithic architecture
is a great example of centralized system. It's represented as a single
deployable and single executable component.
For instance, such component may contain user interface,
different modulus in one place,
and sometimes even more and
more modular. Although this architecture is a traditional one
for building a software, I think over time and even today.
But it has several important drawbacks we need to talk about.
So first of all, the main complexity or
main disadvantage of monolithic architecture is inability to
scale modulus independently. Another problem is
that over time it's much harder to control growing
complexity of this big system. The lack of modulus independent
deployment also brings a lot of complexities because you will be
needed to redeploy everything because it's a single executable component.
And also additional challenges are
bringing to developers to maintain this single huge database
without well established rules. And basically
these rules are not very well established and it's really complicated.
And also another problem which is quite obvious, technologies and vendors
coupling, which is quite, quite obvious because you cannot just
have several languages in one monolithic application because it
technically has different runtimes and so on, so forth.
If we'll move further, we can take a look on the opposite
one. This is a microservice architecture, where microservice architecture
is an architecture style and one of the variants of
service oriented architecture. We structure the system as a collection
of loosely complete services. For instance, in the
same example, companies accounts,
customers and user interface are represented as separate processes
deployed on several nodes.
And all of these services has its own
database. Time and time shared database is utilized,
but that probably is a bad practice
or anti pattern for microservices architecture. So in this case I demonstrated how the
different and database for services utilized.
So, but what do distributed systems gives
us, to be honest, the main, I think,
obvious fact that technically we can get
some horizontal scalability traits of such system. You can scale database horizontally,
you can scale your services horizontally, or technically any infrastructure component
can be scaled horizontally. Just technically cloning that,
putting some kind of load balancer between all of that and
leave without a lot of pain. Another thing which is quite important
is about high availability and fault tolerance. Because in
this way, whenever you have several clones, you may organize some
failure techniques will help you to avoid any downtimes
in case of crashes or some memory leaks
and services outages. Another thing is quite important
about your geographic distribution. We all have
customers all the world, in USA, in Europe,
in Asia, and we also want to bring some best experience to our
customers. That's why we'll be needed to anyway to distribute these
services across the whole world and organize some more
complicated techniques for data replication, how to connect all of that
together. Another thing which is also
quite obvious is about technology choice freedom. Because you can deliver
one service with this tool and other service with another tool, you can
try more. You can find the best solution for the current problem.
You can scale it horizontally based on
some special techniques, special solutions
and so on, so forth. Another thing which is quite important
to know and to note about
easier architecture control. It really helps to
control the system, because when there is only
single component time and time rules are violated,
everything is broken. You need to dig into this component, find some
balance, make an order and so on, so forth. When this some kind
of little bit isolated from each other, several small services,
very likely these
services will be not very messed with some changes
from a big group of developers working on the same component.
Any distributed system comes with its own challenges.
We'll take a look right now. So before
we take a look into challenges, I just want to talk a little bit about
quality attributes we'd like to have in our system. So there are three
main quality attributes which any system has at
some level or another level. First of all, it's reliability.
Reliability is a way to continue work correctly even when
things go wrong, meaning to be fault tolerant or resilient.
Even when a system is working reliably today, that doesn't
mean it will necessarily work reliably in the future. One common reason
for degradation usually is increased lots or perhaps the system
has grown from 10,000 concurrent requests or users to
100,000 current users, or run from 1 million
to 10 million. Scalability is the term we used to
describe a system's ability to handle just an increased load.
And the final piece of that is about maintainability, is about
making our life better. For the engineering and
operations teams who need to work with the system, good and stable
abstractions really help to reduce complexity overhead
and make the system much, much easier to modify and
adapt for new features. So this is the
system example. I just want to show you of
the system which has various components such as API gateways,
memory caches, message queues, different databases,
different microservices, full text search and so on, so forth.
And all of that, in order to evolve, should be
reliable, scalable and maintainable. So far it sounds easy,
but this is a famous low and
very realistic law, to be honest, and the best friend of
distributed systems. In some formulations it's extended to
anything that can go wrong will go wrong, and the worst
possible time is it just because we're
not really lucky or infrastructure is not. The reality
is different. And now let's take a look at the real complexities
with distributed system. What are the main troubles?
First of all, networks are unreliable.
We'll take a look on it deeper little bit later on.
But just from a high level perspective, you cannot completely rely on networks because
so many things are, are out of your control.
You cannot say for sure that it will be working in 100% cases.
Another thing which is quite important is about unreliable clocks
and problems with time synchronization. Another thing which is
also quite painful for us, it's about process pauses.
And let's take a look deeper on this one. And I think many of
you who was working with the systems having garbage collection
probably was thinking about this problem as well. Another thing
which quite obvious is about eventual consistency. Whenever we have some horizontal
scalability of our database, we also have some follower,
we have a leader, and whenever data is replicated from leader
to followers there is some lag which technically may affect some
systems, some logic, user experience and so on, so forth. It's not a problem
from one perspective because it gives you a lot of availability
benefits, but it can be a problem from a different perspective where you need to
achieve really strong consistency. Another thing which
is quite also important is about observability.
Whenever we build a system on this huge scale with
a lot of services, databases, queues, we need to understand how
all of that is working. We need to have all insights about how
it's working and how we can get everything about right
now for that and how performance is tracked.
Do we have any bottlenecks, errors and so on, so forth? This is a
big and broad topic. We will take a look on some of
these aspects which probably can be helpful in building such systems.
And another one is about evolvability.
It's some kind of extension of maintainability, but technically describes
how to evolve the system or distribute the
systems without a lot of problems and without a lot of
pain, and how to make it really scalable for people, for engineers,
for the business. And don't waste a lot of resources
and be stuck because something it was really designed
very well and very bad and it cannot be scaled due to some
issue in the system technically. This topic is also very
broad and we'll take a look on several very useful aspects you may
probably find. And first of all,
as I said, let's take a look on unreliable networks.
So there are so many reasons why networks are not reliable.
So there are a few reasons why so. First of all your request may have
been lost and you cannot do nothing about that because
this is a problem of network. Another thing, your request
may be waiting in the queue and will be just delivered later on,
and you also cannot completely rely that it will be processed immediately.
The remote node may have failed,
probably, perhaps it crashed or powers down,
or some kind of memory leak happens and so on, so forth.
Remote not have may have temporarily stopped responding.
The same problem with the second
I mentioned about waiting in the queue, but it does happen on the not
being this case also, remote not may have processed
your request, but the response has been lost on the network
and you technically just don't know was it processed or it wasn't processed.
What you need to do right now and so on, so forth. And technically the
modnot may have even processed your request and again the response has
been delayed. Now you still have to wait for a while to get response
and make some logic on your side.
So the obvious thing how to probably
solve the problem of request loss is to apply tam out
logic on the caller side. For example, if the caller doesn't
receive response after some timeout, we can just throw an exception
that shows an error to the user. In most cases probably would be okay
mainly for the back office I guess, but for the customers and have availability systems
when we don't want to have some poorer user experience,
we want to make something to solve this problem. And obvious
solution for this problem is to because
the network issue is very very likely temporary. At scale we can
just apply, or even scale we can just apply retry pattern.
So if the response indicates that something
goes wrong, we can just retry it after timeout.
So technically if timeouts happened, we like it's a temporary issue problem
and we can just apply retry to process it again.
But what if the request technically was processed
by the server and only the response was lost?
So in this scenario, to be honest with tries may lead to severe
consequences like several orders,
payments, transactions and so on, so forth.
And that experience will be not very nice for the
users. So the technique
to avoid this problem, we can call it,
or technically called so edompinac
is a concept. We're doing the same thing multiple times has
the same effect as doing just once.
To achieve exactly one semantics, we can leverage a solution
that attaches, especially in the potency key to the
request. So after trying the same request with the identical
identity key, the the server will verify that the request with such key
has already been processed and we will simply return the
previous response. Such any number of retries with the same key
won't affect system behavior negatively,
and that technique helps in most of the cases. And this
is one of the main techniques you have to bear in mind while building
distributed systems. Another pattern which
might be quite useful in preventing overloading and completely
crushing the server in case of failover, outage or
temporary issues is circuit breaker.
Circuit breaker acts as a proxy to prevent the calling system,
which is probably under maintenance, likely will fail
or heavily failing right now. So there are so many reasons
why it can go wrong, memory leak, bug in the code, or external dependency
which are faulted. So in such scenarios it's better to fail
fast rather than risk of getting cascading fails
over the system. So technically, circuit breaker represents some
intermediate component which is checking based on some specific conditions
and just switching it off and on when it's allowed
based on some specific condition. Concurrency is one
of the most intricate challenges in distributed systems. Concurrency means
that multiple computations are happening at the same time.
So what happens if we try to update the account balance at the same
time from different operations? If there is no defensive
mechanism applied, race conditions very likely will
happen, which will lead to lost rights and data
inconsistencies. In this example, you can see the top region are trying to update the
account balance, and since they are concurrently running these operations,
the last one is winning, which led to serious issues.
And to avoid this problem, there are several techniques can
be used. So before diving
into the problem solution, let's take a look at what
ASIP means. So the ACES acronym stands
for atomicity, consistency, isolation and durability.
All of the popular SQL databases implement these properties,
but what do they mean? So atomicity specifies
that the operation will be either completely executed
or entirely failed. No matter at what stage it happens.
It allows us to be ensured that another thread
cannot see the half finished result of the operation. In very simple
terms, consistency means that all invariants are
defined and will be satisfied before successfully committing a
transaction and changing the state of the database or
just specific table or particular part of the data store.
Whether isolation in the sense of faceit
means that concurrently executing transactions are isolated from
each other. This is a serializable isolation level, which is the strictest
one to process all transactions sequentially. But another
level named snapshot isolation. In popular databases like SQL
Server, MySQL and Postgres, SQL and I
believe many others is mainly widely used and we will talk about it
very soon. So durability promises that once transaction
is committed, all the data is stored safely. So that
is very important point for us to be sure that we
sort everything without any losses. So returning back to
the concurrency problem, so the most practical solution is utilized,
snapshot isolation level, which most of the popular databases
with ACET compliant database provide. So I
just thought about that. So the key of this level is the database
track records version and fails to commit transactions
for once that were already modified outside of the
current transaction. And for all read operations
in a transaction, see a consistent version of the data. And as
it was at the start of this transaction for write operations,
changes made by transaction are not visible to other transactions
until the initial transactions commit. And that helps a lot to be
honest, simplifies so many things during development,
and it also guarantees you that if you managed
it properly, you won't be losing anything and you
will be quite consistent in your database and your daytime.
Another thing which is also quite important to say that most
of the no SQL databases are not providing acid properties completely
right now while choosing in favor of base.
In such cases, the compare and say the
strategy may be utilized. So for instance, where base means basically
available soft state and eventually consistent,
the purpose of this operation, I mean compare and set,
is to avoid lost updates. Of course,
by allowing an update to happen only if the value
has not changed since your last read it. If the
current value does not match with what you previously read,
the update has no effect and the read modify write cycle must be retried.
So for instance, Cassandra provides lightweight transactions approach which
allows you to utilize various if if not exist if exist
statements to prevent concurrency issues. So please bear
in mind this strategy will be very helpful if you're utilizing knowledgeable
database and you need to be aware about how to track how to
avoid this problem with concurrency, which may lead to really problematic situations
in your software and the business as well.
Also, I think we can take a look on another pattern which
may be quite helpful. So another technique that might be useful is
called lease. Let's imagine we have some kind of resource we want to
update exclusively, and the lease pattern means that we
should first obtain the lease, which with an expiration period
for this specific resource, then update the resource
and finally return back the list. In case of failures or
anything else, the list will be automatically expired,
allowing another threat to access the resource.
Also, this technique is very useful and it will help you.
In most of the cases there is a risk of process
pauses and clock desynchronization, which may lead to issues with
parallel resource access, and we'll take a look at it a
little bit later on. So if you'll try to send a
message during transaction, we'll find ourselves in the worst situation
where for some reason transaction failed to commit. So technically,
the dual write problem is a challenge that arises in
distributed systems, mainly when dealing with multiple data sources
or databases we need to capt and sync or just update.
So in other words, let us imagine we need to store the new data in
the database and then send some messages to the kafka. So since
these two operations are not atomic, there is a
chance of failure during the publishing of new messages.
So based on this example, if we'll be trying
to send some data, we will find really ourselves in the worst
situation. For some reason transaction failed to commit, but external
system like Kafka didn't get these messages,
another system may be really disrupted.
In this case, if we'll be trying to send the messages during transaction directly
to Kafka and only then trying to commit transaction.
We'll find ourselves in the worst situation when external
systems are already notified but we
didn't commit transaction on our site and depends
on the business depends on the business and critical operation criticality of this
operation. We may find ourselves in really, really interesting situations,
very likely to lead to some very nice discussions with
legals and trying to find some solution. But technically this is a really
bad practice and you have to avoid this at any time and
at any situation to not being involved in so many war rooms and solving
problems. So one possible solution to implement
and solve this problem is to implement transactional outbox. So this
involves storing events in an outbox table within
the same transaction at the operation itself. Because of
atomicity, nothing will be stored in case of transaction failure.
One more component needed here as well is relay
which will be polling the outbox event table at some regular intervals
and send the messages to destinations. Such approach allows
to achieve at least one delivery guarantee.
However, it's not a big problem since all the consumers must
be independent anyway due to network possible failure and
just network and reliability. An alternative
solution instead of building custom transactional box is to utilize
database, transactional lock and custom connectors to
read directly from this lock and send these changes to exactly
destinations. This approach has its own advantages and disadvantages.
For instance, it requires to be coupled to database solutions,
but that allows you to write less code in the application.
Another problem we have to take a look. It's about
unreliable clocks. The time tracking is one
of the most important parts of any software or infrastructure,
since you always need to track durations to enforce timeouts and
expirations, or even gathering metrics to understand how your
system operates. Unreliable clocks is one of the most tricky
problems in distributed systems since time accuracy heavily depends
on computer performance, since each machine has its own clock,
which could be faster or slower than others.
So there are two types of clocks used by computers, time of day
and monotonic clocks. Timeofday returns date and
time according to some calendar, which will lead to clocks desynchronization
in case of if NTP server
is desynchronized from our machine,
monotonic clocks always move forward and that
helps in many cases for calculating durations, so they can
be really useful in such cases when we need to calculate how much time that
required took on our machine and this monotonically
increased value. But this value is unique per
computer, so technically it can be used for multiple server comparison
of date and time. Technically, there are not
so many easy solutions to achieve very accurate clock
synchronization. In most of the cases you don't need it, to be
honest, but in situations where it's required by some specific regulations,
where we are building very, very accurate system for
probably tracking transactions or showing some charts
with real time data, precision time protocol can be leveraged,
which to balance requires a lot of huge investment of
that. Now take a look on the cap theorem.
And the cap theorem states that any distributed data
store can satisfy only two of three requirements
or guarantees in this case. So however, since network unreliability
is not something you or someone else can significantly
impact, in the case of network partitions,
we have to choose between availability or consistency.
So in this simple diagram, two clients read from different nodes,
one from the primary node and another one from the follower.
And replication is configured to update followers after the
leader was being has been changed. But what happens
if for some reason the leader stops responding?
It could be a crash, network partitioning or
any another issue. In highly available systems, a new
leader has to be assigned. But how do we choose
between existing followers? To solve this problem,
we need to use some kind of distributed consensus
algorithm. But before we look at it, let's review several major
consistency types. There are two main classes
of consistency used to describe guarantees, weak consistency
or eventually consistency, and another
one is strong consistency. Eventually consistency means that the data will be synchronized
on all the followers after some times if you will
stop making changes on the leader talking about strong consistency,
I think the better to explain it in this way where
a linearizable system, as soon as one client successfully completes
a write, all clients reading from the database might
be able to see the value just written. So this is a
quote from Marcin Klatman book about
designing data intensive applications and it perfectly describes
what it is strong consistency. This is a synonym of for linear usability,
how we can decide and how we can solve this problem with
selecting leader by having existing followers.
So returning back to the problem I just mentioned about,
when leader has crashed, there is need to select or elect
a new leader. So this problem at first glance probably looks
easy, but in reality there are so many conditions and trade offs
that have to be taken into account when selecting the appropriate
approach. Per rough protocol, if followers
do not receive data or heartbeat from the leader for a specified
period of time, then a new leader election process begins.
So distributed consensus is an algorithm for getting
notes agree on something. So in our case agree on a new leader.
So when these leaders election processes started,
raft specifies a lot of instructions and techniques,
how to achieve the consensus in case of network partitions
as well. So technically each replication unit,
it can be monolith, write node or multiple
shards. It really depends on the infrastructure is associated with a set
of rough logs and OS processes that maintains
the logs and replicates changes from leader to followers. So the
rough protocol guarantees that receive lock records in
the same order they are generated by the leader. A user
transaction is committed on the leader as soon as the half of the follower
acknowledges the receipt of the commit record and writes
it to the raft log. So technically, raft specifies how we
can really implement this distributed consensus and how we
can solve this problem of selecting a new leader. There are many other
algorithms, like Paxos in
many others. So in this case I just highlighted that
raft is just an example. But destiny with the consensus algorithm is
very very important. And without it, it's impossible to solve the problem
of highly available software.
And it has a lot of much more complexities. Let's imagine we have several
data centers and we need to organize some kind of replication.
One is located in USA, another located in
Asia, and we don't want to ask users to make write
requests from Asia to USA. We like to have several data
centers which can accept rights. To achieve that
we need to have some kind of multi leader application and apply some
conflicts resolution strategy which will help us to achieve consistent way
of the data. Also ordering is quite
problematic thing, so it requires
to generating some monotonically increasing numbers which will be
assigned to any message in this replication specific mechanism.
So let's take a look on eventual consistency.
So this is a classic example of eventual consistency,
which demonstrates situations when the user sees the absence of
new jack created transactions.
So in this case, first of all we're trying to
insert some data in the leader node. Then of course it's replicated to the
follower. And whenever we do a second request to the follower
during two eventual consistency aspects, we probably may
found that the transaction was not even created and be really
really confused. So in this case probably it's not very critical because
eventually it will be synchronized. But in
more complicated scenarios when we need to make some logic,
for instance, we cannot finance the company if it was added
in some sanctions list. We cannot rely on eventual consistency
because it may lead to some problems in the business.
And one of the possible solutions to apply here,
and it's probably one, it's quite effective and simple strategies,
is to read from the follower by the user who just saved new data
to avoiding this replication lag. And it helps in most of
the cases. For more complicated scenarios, of course it's better take a
look on the cases and trying to define the best strategy how to
solve that. Another thing is about process spouses
and let's take a look on this. So this is a quite dangerous problem and
that may be even garbage collection stop the world situation
happened, or virtual machine suspension or context switch
took more time than we expected, or some
delay in the network that led to unexpected
pauses of the system which affected the logic. So if such thing
happens under the cured list, for instance, we discussed
this couple of slides ago, there's a chance to
access the same resource twice, which may lead to severe consequences.
To overcome this, there's a safety technique
called fencing that could be leveraged. So technically,
before updating the resource, we need to secure the lease as
we did it previously, but without just occurring.
The lease would like to get some kind of token and this token
will be used on the data store to prevent updating the data
or resource in case of such talking was already processed
or it was already expired. But technically it means like we need to generate a
monotonic increasing token. Of course we can attach some key here
and whenever we update the resource, we check that the
current token is the last token from the sequence. If some,
if some another token in the past was already,
if the time was already processed or this token was
expired will be needed to throw an exception and solve this problem
on the data storage. But probably it's better to take a
look at the problem as well. Like whenever one client secured the lease for
some reason garbage collection stop, the world may happen
and we will be delaying until the lease expires.
Another client will be getting the same lease under the same key
and also will be started doing some operation and maybe the client
to finish the operation. But when
client one slapped out, it will be also
trying to update the storage because it was already under release.
And in this case we may probably have some issues and
failures. So in this case, this fencing technique to
having some monotonically increasing token and checking that
there is no such token in there processed in the
past, but this sequential number should
be applied to this problem. It probably may be
some theoretical problem, but it may happen.
And it's better to bear in mind for really, really critical operations when we
want to avoid anything related to transaction processing
and so on, so forth. So as
I said in the beginning, observability is one of the most important things in distributed
systems. So in this example, you can see a distributed
operation that spans transactions across multiple components
and we need an easy mechanism for getting this information to
effectively determine bottlenecks and have full visibility or
executed attributed operation. The best
thing to do that, or probably one of the best thing, is distributed
tracing is an approach for tracking this distributed operation across multiple
microservices nodes components.
The key idea is to attach a unique destroyed operation id to
the initial request, then span it down the road
and finally collect and aggregate this unique id to build analytical
visualization of the whole.
For instance, Jager keeps track of all connections and
provides charts and graphs to visualize request
path in the application. So technically it has multiple
components which is providing collecting of these destabilization ids
and building some visualization to effectively take a
look into operation and understanding the bottlenecks, the problems
and so on, so forth. This is a visualization example
which helps to analyze the bottlenecks and probably
just to understand where my request is slow, how I can
find the point in the system where I need to apply
some optimization techniques and so on, so forth. It's very useful and
helpful whenever I need to debug complicated scenarios and to
be sure that you are not missing something very important.
Another approach that may help to improve the systems observability is
orchestration techniques rather than choreography. Orchestration in
software engineering involves a central controller or orchestration
engine that manages the interaction integration between services.
Choreography refers to a decentralized approach
where services interact based on navans. In this
model, each service knows when to execute its operation and
with whom to interact without needing to central point
of control. Orchestration is usually the profitable approach,
to be honest, for controlling growing complexity due to increased visibility
characteristics and just having a place for controlling.
Of course, it must be highly available because it can become a
bottleneck or single point of failure. So high availability
techniques must be applied in place to avoid any
problems. So let's take a look about evolvability
and cybernetics principles. So, in the context of software engineering,
cybernetics can be defined as a study or science and
application of feedback loops, control systems and
communication processes within software development and operational environments.
It focuses on how systems, software, hardware processes
anything. Human interactions as well, can be designed and managed to achieve
desired goals through self regulation. Adapting and learning.
Cybernetics and software engineering emphasizes
creating systems that can adjust to changes,
learn from many interactions, and improve over time,
ensuring reliability, efficiency and resilience. To be honest,
that thing explains what we are doing as a software engineers
or architects to manage the really complicated evolving
of the system. So speaking about several
principles of that, the first of all is systems thinking.
So this concept focuses on the system as a whole rather
than individual parts. So in software engineering, it technically
means that all the parts
of a pair are really important, how they interact with each other. And we need
to have a holistic view about how it's working together,
how it's evolving, how many dependencies we have, having some kind of dependency
review over time and check all that. And design decisions are
made with this understanding applied
to the system itself. Without just considering one component,
we're considering the system as a whole thing.
So, example, when service b handles an event published by service
a for the outcome does not really affect service
a directly. However, the overall result of the
operation is a significant problem to the system
at the whole piece. So another thing is
feedback loops. This is a core concept, awareness of cybernetics,
and this involves implementing or just
having feedback loops to understand how to control
and stabilize the system. So in software architecture and systems,
feedback loops may be represented in various forms like having retrospectives,
having monitoring system performance, user feedback through
different questionnaires and so on, so forth, and also having mechanism
like continuous delivery,
continuous integration and deployment to get feedback
earlier and making decisions about how to make it better. So this
principle is one of the important principles in the software engineering from
cybernetics perspective. So another thing is about adaptability and
learning. So cybernetics Pro promotes the idea that
systems should be capable of adopting the changes. It also relates to
maintainability and vulvability. And that means that
we need to apply different techniques to evolve our system
without lots of obstacles, and having specific
techniques for incorporating inside the system to get some feedback
as well, and learning from our mistakes from our existing metrics.
So technically this is evolution of feedback loops.
Talking about this stuff related to goal oriented
design, this is one more principle in cybernetics. And cybernetic systems
are often defined by their goals. So in context
of software architecture, this means that the system should be designed
with clear objectives in mind. Domain driven
design is a software development methodology that focuses
on building a domain model with a deep comprehensive understanding
of domain processes and rules. So technically
it helps to build the software which is aligned with the real
business domains and business requirements drives architecture
decisions like in this case, we can see a lot of
different contexts or domains like one
is related to accounting, another one for underwriting fraud profile analysis
on the boarding. So we're technically designing this system which is solving these real
problems from the domain. I believe everyone,
to be honest, is familiar with such pattern or anti pattern
and was involved in building such software.
This is experience everyone had. And I believe
that very helpful to have this experience because it
helps you to understand how you shouldn't build the system.
So this pattern, or again, anti pattern drives,
building system drives without any architecture there
is no rules, boundaries, strategy and how to control inevitable
growing complexity. After some time it becomes really
painful thing and system becomes unmaintainable.
And any change leads to so many painful problems
and of course reducing the revenue or
just improving, increasing a lot of bugs and failures to the system.
So we can apply some techniques to building software with different
types and can really help. So system organized
or in hierarchies or subsystem and
software system often have a hierarchical structure with high level
models, depending on lower level models for functionality.
So this hierarchical decomposition helps really to manage
complexity by breaking down the system into more manageable
parts. But to be able to understand it better we need
to take a look on something which is quite important. So first of all,
it's a quite popular opinion that all microservices are
the same. So technically from the deployment perspective they are the
same. Everything is either virtual
machine deployed or just a docker image and deployed as
a docker container. But perspective
of the system and perspective of the
making a decision. We need to have some service types to make the
decision much much simpler and simplify the thinking
process. We can have multiple types of
services which very likely will be having absolutely different
rules and constraints. Like for instance, we have several services which are
necessary to communicate with the external world.
Like I need to communicate with some front end application which
is technically run in the browser of the user, and I need to accept some
request and move it in the central of the system.
And in this case I defined about six types of services
which may be very very useful. The first type is web application
or mobile application, public API. Technically this is a service which is used by
customers.
Mainly it's started or run in browser,
or it's a desktop application or mobile application. Another thing,
another layer is back and front. I think this is a place where
all incoming requests are coming. Then we are rotating this request
down into the system. Then the third layer is
about product workflows. Whenever we build a system which has some
products, we need to structure some business logic which is really specific
and detailed this product, and we cannot just generalize this
stuff because it's a product specific, this is a unique of this product.
Then we also have some reusable components we need to help,
we need to utilize in our domain, and likely
that we'll be having this components because all the
time we are using some, we are resolving some domain
problems. We need to utilize in several components. For instance,
maybe kyc kybap be anti fraud, or maybe some just payment
service internally which will be helping us to send money and
getting money back. Another thing is about gateways
and adapters. All the time we need to get some data
from external system, we need to integrate with something from the external world.
And in achieving that we need to call this service and the service will be
responsible for getting this responsibility to take some
data or send some data to outside in the bottom of that
always have some kind of platform components. It of course related
to identity access management, some core library configuration monitoring,
even workflow mechanism for managing really complicated distributed
workflows. Another very, very important thing
that industrial systems and
at scale we're starting to solve more and more problems
and we're starting to having some domain areas, some macro
domain areas, and in this areas we can collect the services
or even concepts or even problems together and hide that
behind some domain macro API which will be implemented
or represented as a central point to communicate
with this subdomain. It technically helps on the scale to control
growing of a lot of microservices and be able to manage
this really, really effectively.
Talking about Sre SRe is a site reliability engineering.
This is a foundation framework built by Google really
many years ago. And the core principles of SRE
is about first of all, embrace risk. We need to really understand the risk
where the critical operation, whether it's not critical operation, where we need to understand that
this is really mission critical for the business and we need to apply all the
reliable techniques. If for some reason this technique is not
very necessary or these functionality is not very
critical, probably we don't need to have it at all.
Probably we don't need to put a lot of resources and making more
and more tasks, putting a lot of requirements,
metrics, monitoring and so on, so forth. So technically it means like
we need to understand the risk and we need to manage the risk. We need
to count it from this perspective because of course a deal state
where everything is covered by test everything, but it's
impossible to achieve. So we need to prioritize where we want to put our
resource based on what we have right now. Now another thing is about
SLA, Slo and SLi to define system reliability.
SLA is a service level agreement, SLO service level objective
and SLI service level indicator. In simple terms,
SLA is our agreement with a customer about what we
promised from our system. SLA is an agreement inside the
team. What we are promised to achieve as a team and
SLI is about technique or just measure how
to calculate our SLO and SLA's.
And these techniques or these principles are very
helpful to define the reliability of the system in
order to communicate with the business about how reliable are
we. Another thing is about automate manual work. We need to
ideally to avoid any manual work as possible. Sometimes,
of course it's not really feasible because automation will cost
more. So we need to be wise about when we make such decision.
But ideally we need to automate manual work. Repetitive work and
trying to avoid any kind of complicated work
have to be done by person or
by engineer, or so on, so forth. Another thing is about
monitor everything. Monitor everything means like we need to have visibility
over everything and discuss with like we can implement some kind
of distributed tracing. We need to put some metrics in place to
calculate and measure performance of our microservices
or systems or any infrastructure components.
And the last thing is about simplify as much as you could
do, because any complex system tends to fail
because it just has a lot of components which may
really fail and you not always have control
over everything. So if you can simplify, choose the more
simplified approach, it will help you over time.
So take a look on some things which is quite important about infrastructure
as code. This is a principle which states that we need to define
our infrastructure as a code. Like I'm a developer writing some
new microservice, the same thing I will be using for our infrastructure.
One of the possible solutions is to having some tools
which can allow us to define some templates of resources,
some definitions of these resources, some scripts which will be updating infrastructure.
And this is the following, the ultimate manual
work guideline. So we can use some kind of terraform solution to
write by engineers that will be validated and then applied to our
infrastructure. And of course it will be versioned in some
control like git, mercurial and so on, so forth.
The thing about child engineering is
also quite important for testing as well.
So one technique we can utilize to achieve a
safety of our system is Jobson, a special
tool and framework developed by Kyle
Kingsbury to analyze the safety and consistency of distributed database
stores and systems under various conditions,
particularly focusing on how system behave under network partitions
and other types of failures. And technically, Jemson provides
some, some very useful things like fault injection. It introduces
specific faults in distributed systems and it looks how
the system will behave under these conditions and what will be
happening when a system is broken and so on, so forth. It includes network
partitions where communication between nodes is really
affected and we need to understand how our system will solve
this problem. Another thing about operations testing test
various default scenarios for read writes from leaders
followers and understanding how fast it works and how
our consistency is organized. And of course in currency. This is a
very very important problem we need to cover is about how our system behave
over constrained to simulate real world usage scenarios
and just become more real and understand
the impact of such thing. Finally, I'd like to talk about simplicity
and measuring complexity. So it's really complicated to measure complexity,
but complex system that works in a specific
way can be measured somehow. And a complex system that works is invariably
found to have evolved from a simple system that worked and
that happens all the time. And this is a galls law
about this stuff. And we need to measure complexity whenever
we have possibility for that. So we can apply several techniques to measure this
stuff. We can utilize some cyclomatic complexity which
helps to track how complicated function you have right now and how
you can make it better and much simpler to understand by other developers.
Another thing which is quite important to track is about time
to train. So whenever a new engineer is onboarding,
you need to understand how much time you need to spend to train this
guy to work with this specific solution or stuff. Probably that's not
very great thing you want to have, but that's a very important thing to
track and also explanation time. If you have
very complicated domain and it takes really a lot of time to
explain some details, probably you need to apply some techniques
to simplify this and make it really easy for people
to be on board with and starting to understand the details and techniques
of this domain. So in conclusion, I just want
to say that I covered several techniques related to destroyed
systems. Destroyed systems is a really, really nice
thing. We want to have to achieve highly available and scalable
software, but it brings a lot of complexities we need
to bear in mind, and we cannot just avoid that. We need, need to really
solving this stuff. We need to understand the drawbacks and
how to apply best practices to solve this problem. Thank you for
attending this talk. Looking forward to.