Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Today I'm going to talk about
how we implemented complex and long running processes
using the domain model and finite state machines to organize
the logic within them. You will learn what state machines
are and why they are useful for backend development,
how to use domains to organize complex business logic,
how to implement parallel processes based on
state machines, and how to deal with common problems.
My name Ilya Kaznacheev and I am a technical
lead leader at container managed Kubernetes service in MTS
cloud. I'm also a consulting cloud architect.
So if you want to move to the cloud or to
design a suitable architecture and set up processes,
feel free to contact me. I'm a founder of several local
communication, host of a podcast and a conference
and meetup organizer. So what is
a finite state machine? If you haven't worked with
it, you probably met it at university or
in computer science literature. A state machine is a
behavior model. It consists of finite number of
states and transition rules between them.
I often see mentions of state machine in context of
front end or embedded, but rarely
see examples of backend application logic implementation.
Today I will tell you how we did it.
So what is managed Kubernetes service?
In simple words, the user wants to get a Kubernetes cluster
in the cloud. Then magic happens and
then the client gets access to the cluster. In our
case, we had a service at the MVP states whose
architecture did not meet our production requirements.
A thmoose refactoring process begun during which we
did what I am going to talk about today. As a cloud provider,
we can just create a cluster in
via terraform. We need much more control over the
process to ensure speed and resilience.
We control the process at a lower level, which means a lot
of granular operations. So what
we had in the beginning, complete cluster
creation, was done in just three steps,
separated by messages. In Kafka. The code structure
didn't allow us to implement good practices and
the application was not fault tolerant.
A crash or restart interrupted the loan
process of cluster creation without any possibility to
recover. First, we decided to divide the
cluster creation process into a set of steps
combined into a pipeline. Each step
is performed in a separate session and the sequence
of steps is controlled over kafka. If an
error occurs at any step, the process can be repeated again,
the same application, the same appliance to fault tolerance
if the service crashes. Another instance reads
the message from Kafka and handles it. It looked very good in theory,
but in practice there were many difficulties.
Some steps consisted of a set of parallel
operations which also had to be managed for many
steps. The logic consisted of a complex chain of operations
that depended on the various cluster components.
The more complex the logic became, the worse
it fit into the existing model. So after a
while we decided to go further, dive deeper into
domain driven design. This is a simplified domain
model of the cluster. It is actually more complicated
by I removed some of the elements for
the presentation. So in the Kubernetes
cluster, it has a cluster domain itself,
a set of load balancers, a set of models,
and a set of node groups. In terms
of domain driven design, domains that change
together are combined into a domain aggregate.
The overall context of changes is described by
the aggregate boundary. A cluster is
an aggregate route. This means that there will
always be one cluster for one aggregate,
and it can be referred to by cluster id.
Node can be master or worker. The group
of workers is combined into a node group, while the master
exists independently. This is an example
of a domain aggregate instance, a real cluster,
a cluster with three masters and five worker nodes,
which are combined into two groups with different configurations.
This module covered the need for complex logic,
as the logic of each entity is encapsulated
in a separate domain. This solved the problem of parallel
operations, because now each domain was responsible
for its own processes and it was easier
to implement parallel steps. But some of the
steps actually consisted of their own set
of systems. Let's take a close look at
the worker node creation process. Each node
is created in four steps, creating a virtual machine,
launching the operation system, configuring the environment and
applications, and finally running. Moreover,
an error can occur at any step. In this
case, different error handling may be required. In the
case of an error, while starting the operation system, you must either
perform a restart or delete the vm for a full rollback.
In case of a configuration error, you may need to repeat
the setup. In the case of an error, when starting the
application, another processing may be required,
and these errors can happen in parallel at different
states for different models, which makes it practically
impossible to manage this process from one
single place, as we assumed in the first implementation
with the pipeline. So at this point
we realized that it was impossible to cover the
logic of the entities. Domain aggregate with a
single state machine. Each domain needs its
own state machine describing states and transitions
between them just for that only domain.
Let's check this out with a state machine
example for a node. So node
has an initial state with which it is created in the
database. The state machine accepts a success
event which will be handled depending on the current state
of the domain. This can actually be different events
for different situations, but for simplicity, I have combined all
success situations into one event. Transition to the next state
will send a request to create a virtual machine.
When the virtual machine is created, the transition to the
next states happens with the request to start the
operation system. When it's okay, the next step states
with transition to a relevant state. When the configuration
is done, the node goes into the running state with no
additional actions performed. Similarly,
we can describe the process of the node removal.
Right now it looks like just two
pipelines joined for creation and deletion.
Or, if you like, it looks like a saga.
However, unlike pipelines and sagas, the state machine
allows you to solve one very important task,
error handling. Remember that an error can
happen at any step. For simplicity, I also have
combined the different types of errors into a single error
event. I don't include retrievable errors since
they can be handled automatically. Here are the
errors that cannot be fixed by repetition.
Note that the error handling actions are
different depending on the current state of the domain.
The case where the application will handle an error that
does not match the current state, for example, if it was received late
from a message queue, is eliminated. Back to
the domain model. For each domain, we add its own
state machine, which describes the logic for that
particular domain. In this way, we can independently implement
the logic of parts of the system as complex
as we want. In case of certain cluster,
a state field is added to each element, which defines
its position on the state diagram of the corresponding
domain. Thus, the state of each item
of the cluster is stored in the database between
event handlings. Any entity in the domain aggregate
has a clearly defined state at any point
in time. Let's now see how it all works.
Suppose we have a cluster, also a simplified cluster
in creation process. When one node group is ready
and other is still using created worker node, one receives
a success message and its state changes to running.
The node then escalates an event about its state's
change to its parent domain, the node group.
The node group then performs state transition validation.
The validation condition is that all nodes in
the group have states running, but one node
is still in the state setup pending,
so the transition will not take place. Next worker
node two receives a success message and its states changing to
running. The node then escalates an event about
its state change to the node group. The node group performs
the state transition validation. Now all child nodes
are in state running, so the node group will change
the state to running too, and then it escalates
an event about its state change to the cluster domain,
its parent. The cluster checks if all node
groups are in running state. As soon as the condition
is met, it also changes state to running.
So the logic of each domain domains within that
domain itself. The logic that affects multiple domains
is propagated within the domains. Aggregate aggregate
through events let's take a look at what happens
in case of an error. For example, we have
the same cluster in the same state, but then one
of the nodes receives an error message which is handled
as an event. The node starts the error process
matching the error process matching
its current state. As we saw in the state diagram
before. At the same time the node cannot
decide what to do with other nodes. A node domain
can only control its own state, so the node
escalates the error event to its parent,
the node group domains. The node group then
has enough information about nodes in group to make
an appropriate decision. For example, it might try
to create a new node to replace a bad one or delete
the other nodes in the group for full rollback.
However, this may not be enough, in which case
the node group escalates the error even higher to cluster
level. In this case, the decision on error handling will
be made at the entire cluster level, where it
is possible to analyze the situation at all levels.
The logic for each domain is still encapsulated
within the domain. After the decision is made,
the domains will tell the child domains what to
do by triggering the relevant event.
So during the refactoring process we
faced technical and architectural problems.
Next, I will mention some interesting problems and their
solutions. As soon as parts of processes
become parallel, there is a danger of race condition.
In concurrent event processing, different entities
in the same domain aggregate may conflict
for resources or for process control.
To avoid this, we use database level root
lock. Any change to domain aggregate data
happens in a single AC transaction, so any
conflicts are eliminated. No matter which
domain handles the event, the log is
always set to the root element, in our case the
cluster domain. However, this leads to
lower performance in parallel processes. This is not a problem
in our case, but could be a problem in yours.
In this case, you can set the lock not
to the root domain, but to the processes domain
or its parent. Here applies the bottom
to up rule. You can put the log on the parent domain
but not on the child. Otherwise you might get a deadlock.
Another problem is inconsistency in
asynchronous messaging. We use secure s
and all comments are executed asynchronously.
When protesting an event, the state transition logic
can send messages to Kafka. Then the domain state changes,
which is persisted in postgres, but entire
event processing takes place in a single AC transaction.
So what happens in case of error?
Domain changes will not be saved in the database
because of the transaction rollback, but elements would
already be sent to Kafka, which would lead to inconsistency
in the data model. The service database expects the domain
to be in a state before the comment is sent, so the
state before transaction begins, but in fact
the comment has already been sent, which corresponds to a
different domains state. To avoid this, we adopt
the outbox pattern. The messages are first stored
in the database within the same transaction
as the rest of the changes. A separate job
then reads the data from the outbox table and send
it to Kafka. In case of an error.
No messages from this transaction will be saved
in the database, so the job will not read them from the
outbox table and send them to kafka.
Next, some operations, such as virtual
machine creation, are not always practical to do
one at a time. There may be dozens of virtual machines
in the same cluster, and the creating them one at a
time reduces the availability of infrastructure for
other services and makes the processes longer.
This can be fixed by introducing batch operations,
the rest of the logic domains the same, but the moment the
process reaches a batch operation, instead of an event
trigger for each domain in the group, there is a special
event trigger for batch processing.
The same changes are made, but messages to other
systems are not sent directly, but are
sent in a different method as a batch. In this implementation,
the domains logic encapsulation begins to leak a
little bit. The business logic remains within the domain,
but the logic for sending messages is moved
out of the single domain to the domain list level.
But this is a tradeoff for batch operations,
and the last problem that the state machines solves
especially well is the complex error handling.
I call it state matched error processing.
Assume an error event comes into a system.
This can be some general event or more specific
one. In any case, we need to handle the error.
Normally, we would have identified the type of error
and handled it accordingly. However, it could happen
that the error messages get mixed up or the
message arrived later because of a slow queue.
There may be a bug in the application sending the error that
causes the wrong type of message to be sent.
This would normally result in error handling that does
not correspond to the actual state.
But now, in the case of state matched error processing,
when we use the state machine, error handling is always
matched to the state of the domain that is
handling the error at the same time. Different errors
can be handled differently within the same state,
but the processing will always be relevant to the
current state of the domain and not to any other.
If the current state does not consider any error handling,
then the event will simply be ignored. So the
error processing be always matched with
current domain state. And if it was some
kind of false error or errors,
that doesn't mean anything for the current state, it will
just be ignored. You also can log it, you also can add it to
tracing, et cetera, but it will not disturb
the data with the fells error processes.
So today, the domain model with state machine based
logic works very well within our hexagonal
services. The approach performed very well and fit into
our CQRS communication model with synchronous
queries and asynchronous comments it's quite
simple to implement in Golem. We just use
long switch cases to describe each state machine.
It's easy to read and easy to troubleshoot,
but there is still a room for improvement.
In the future, we want to further refine our approach
to implemented distributed transactions based
on domains with the state machines. This already
works in some parts of our infrastructure, but the
formal approach has not yet been described.
We also want to generate state diagrams based
on code. Right now we use mermaid
js to describe state diagrams in the service documentation,
but we want to automate the process.
So that's all for you today. I look forward to talking
with you. Send me your questions and ideas. Also,
don't forget that if you want to run your services in the cloud,
I can help with that. Feel free to contact me with any questions related
to cloud architecture and process organization related to
development in the cloud environment.
Thank you for your attention and goodbye.