Transcript
This transcript was autogenerated. To make changes, submit a PR.
I'm excited today to talk about policies and contracts in distributed
systems. My name is Prathamesh and I work as developer evangelist
at last nine. This is my twitter handle. You can find me
posting interesting things about distributed systems, time series,
databases and so on. So if you want to follow,
go ahead. As a software developer or engineer,
we want to write code. We want to fix all the bugs. We want
to ideally write code which is bug free right? We don't want
to be the author who basically writes bugs every
time. We want to use all the latest and greatest tools that is the
North Star. If there is anything new that is coming out
in the market, I will definitely want to try that,
be it copilot or even a chat jeopardy for that matter.
I want to integrate with best in class tools in my development
workflow so that I get the best that is out there. That is
my aim. But is it always possible? As a DevOps
engineer, I want to make sure that my infrastructure scales.
I want to make sure that the application utilizes resources
efficiently. There are no extra resources that are just hanging
and costing me money. I want to make sure that my cost is
always optimized and under my control. It is not
exploding massively unnecessarily. All of these
is something that I will always strive for. But as
the ad software engineers also or even DevOps engineers,
it is possible or not is the question. As a team lead engineering
manager, DevOps lead I have similar objectives.
I want to make sure that the deadlines are met. The features work
as expected within a performance criteria that
basically caters for a customer experience by
which they are satisfied. I want to make sure that my code
and the infrastructure that is running the application is
performant enough. The tech debt is not going out of control so
that I don't want to cater for that when
I really want to ship some features and at the same time I
want to make sure that my team is motivated enough because otherwise there
is no use of everything else if my team is not willing to
work very happily with rest of the team
members. All of these are my objectives, but they are also
not always possible. As a product leader, I want
to make sure that the team velocity is not slowing down.
The external customer commitments, service level agreements
are getting met. The product is getting adopted
is also one of the constraint that I would like to put on my team.
At the same time, the feedback from customers and their expectations
are considered by my products and engineering teams and are
getting incorporated so that my customers are happy. This is
all I want, but all of these are constraints
and sometimes they always run into each other. For example,
if you want to ship a feature. But the engineering team is
struggling with techtech from last print, their deliverables
in this print will get affected. At the same time, if there is
not enough marketing effort from the product marketing
side or the sales side, then even if we have the best product
that is out there, we won't have any customers who
are willing to use it. With all of these and human
contracts involved, we always have these knock
star objectives, but they're not always possible because
they always run into each other as contracts.
Rasmussen long time ago developed a model
of theory of constraints and that describes very
nicely how accidents happen. Basically there are three
axis to every software or any other system as well,
where on one side you have the boundary of economic failure,
beyond which, if you go, there are chances that your company will shut
down, right? If we keep increasing the resources in our
organization, then our cloud can go out
of control and there is a chance that we'll have to shut down the
company. A very trivial example, but you get an idea about
how the boundary of economic failure works. At the same
time, there is a boundary of unacceptable workload.
I have few team members in my team,
but if I continue to ask them to work every
day, 24 x seven, then sometime they will say that,
okay, I want to just leave now. I don't want to work here anymore because
the workload is completely unacceptable. So there is a
threshold up to which the workload can be acceptable and beyond
which it is not sustainable. At the same time,
there is also experience or performance
expectation boundary. My application
should load within a second. If it is a payment transaction,
then it should always succeed. Or if it does not succeed,
then at least my money should not get hanging in between.
It should get credited to my account again. So there is a performance
or safety regulations boundary as well. And the point of
equilibrium should be in the middle in this
circle, right? If you try to push it from one side, let's say I
keep increasing the workload from the
bottom side, then the gradient or my point of equilibrium will
keep moving to an edge and there is always a boundary of
failure, beyond which if I try to push harder, then accidents
can happen. And this is applicable to all three axis. So if I try
to push any one axis forward,
there is a chance that the accident can happen. If I keep pushing it,
we have to. As software leaders, sres,
DevOps and people who manage these software
systems, we want to make sure that this boundary of
acceptable behavior is not broken. We are always
within the constraint that the gradient is
not completely out of the red boundary of accident,
and Rasmussen basically developed this for control theory,
but it is equally applicable to today's modern
software systems. If we don't follow this boundary and
we try keep pushing harder, then boom,
we run into accidents. That's where we see misprints.
We see efforts that is not aligned with rest of the organization.
We see tech debt increasing and effectively causing
more failures. In the long term, failures will happen
anyways, but they will essentially happen because each
and every stakeholder in an organization has a limit.
They have a boundary beyond which things cannot be pushed.
Either. It can be an economic boundary, or a workload
related boundary, or a performance related boundary. As we saw in
the Rasmosan's model. If one of the stakeholders pushes
the boundaries too much, then accidents can happen. Business pushing
for feature rollouts instead of worrying about detect debt
is one of the example of this scenario.
Whereas engineering can also keep chasing perfection,
they can keep looking for the best solution,
best product out there instead of shipping what they have,
which eventually causes problems with the velocity of the
software as well as the delivery consistency.
So these failures will happen, if we don't mind,
for these constraints and boundaries. Boundaries are nothing
but something which limits or puts
an extent to a criteria. It fixes a threshold
for the objective that we are trying to achieve. There can
be team constraints that I only have five people.
One of my team member is on leave for a personal reason. That person
has all the access to my AWS cloud account,
right? And I'm a startup, so I don't have a lot of people
to manage all of these things. Quality of work can also be another
boundary. We cannot have a very unoptimized
code base which keeps failing always time
and time to delivery or time to market is also one more
contracts that everybody is always concerned about. There are
other constraints and boundaries as well, such as perfection,
cost, pricing of the software, time to market,
and so on. These are all examples of boundaries in
modern software systems. Now how we deal with boundaries
is via negotiations. People say that we'll be able to release
this, but there can be few bugs. Are you okay with this?
We will be able to ship 80% of the functionality,
but can be certain bugs can be present. Like the
way I am doing this talk. Mark had told
us that your video should be in by 20 eigth and
I'm making sure that I'm submitting the video recording the
talk before 20th April because otherwise I'll break my promise
to him. We can do certain deployments, but our AWS
bill will increase for two three weeks. When we will get time
for optimization. We'll be able to do the optimization, but until then we'll
have to bear the but of increased bill. Things like we'll be able to
roll out to certain customers, but teams will have to work overnight and weekend.
They have already worked previous weekends as well. So do you want
to really do it? And so on. These kind of negotiations we are used
to doing in our day to day job. These negotiations
effectively leads to contracts, which is a written or
spoken enforceable agreement. When we negotiate
something with our own colleagues or within our own
organizations, and even with outside customers,
we arrive at a contracts that
all of us depend on and that all of us follow. The delivery will
not happen today, but it will happen on Monday at 11:00
a.m. Once we clarify on this contract, then it becomes an
agreement between the two parties and then we follow that agreement
as and when we go forward. If we think about how
it relates to a programming concept,
I would like to correlate it with API
interface. When we talk about APIs and their documentation,
it is nothing but a return agreement about what the
endpoint will return promises. Let's take an example of
the user's endpoint. So if the user's endpoint
is returning HTTP status 20 one, in case of success
scenario, it is returning bad request in case of when the request
really does not have the correct input, and then it can have
different statuses for an unauthorized request, or even when
the resource is not found. This is the contract or this
is the documentation of our user's endpoint, which can
be publicly documented and given to the rest of the team
members to follow so that they can work according to
this agreement when they develop rest of the code that is
consuming this particular endpoint. Now, when we talk about these programmable
interfaces, what is the equivalent of that
in our day to day life? Right? That is the runtime interfaces which
we deal with every day when we deal with other people.
It is a written agreement about how the endpoint will behave
at runtime. It can be similar to the uptime
of the endpoint, or this particular API is 90%
if you expect the consumer of this particular
service that the post users endpoint should always succeed.
No, that is not possible because the agreement is
that it is only available 90% of the time.
10% of requests are allowed to fail every weekend because
we don't work on weekend. Right? This can be an enforceable agreement
which both parties have to follow during
PCR. The latency can vary between certain limits. This is also an
example of one of the promise that the API author
is making to their consumers. So all of these promises
will effectively define how the consumer
will expect this particular service or API to behave.
And there are no chances of confusion or there
are no chances of accidents because both parties are
aware of what those constraints are while
deciding about the consumption of this particular API
endpoint. When we talk about these runtime interfaces in
the real world, they are effectively the objectives
that we were talking about so far. And there is a beautiful concept in
site reliability engineering or observability world for this, which is
service level objectives. Using service level objectives,
we basically define the criteria for
a particular indicator, health indicator of a service
or an API or a function over a period of time.
An example of this can be that availability of
my service will be greater than 99.99 over a period
of one day. An example of the service level objective
in case of this talk is that I promised mark that I will
submit this talk. He selected my talk. I promised him that okay, I will upload
this talk today. And then he promised me that
once you upload it, I will publish it on 4 May.
That is a layman's example of how service level objectives
can be defined. Similarly, we can define objectives for
key indicators such as latency and uptime over a
period of time, and these gives enough visibility
to all team members, all functions, all stakeholders about
how a particular system, a service, or important
infrastructure component is going to behave so that they can build
redundancy, they can build parameters to consume this
information or this service in a way that is consistent across
the organization. There are other examples
of service level objectives as well. What is my error rate on this particular
payment checkout flow? Can we promise 99.9%
availability to this enterprise customer? And if we can't,
then what are the areas that we want to improve upon?
Is it the tech debt that is stopping us, or is it some
hardware that we need to invest to get to this particular availability?
Because we have to remember that not every nine is free.
As we go beyond certain three lies to four nines,
to five lies, we'll have to invest more in terms of time, money,
resources, people. Objectifying it and making
it a contracts helps us in identifying where we
have to spend more or do we even have to spend more to
reach that particular level. It also helps answers questions
such as should we prioritize tech debt over new features?
Because if we know that the availability of
a service itself is 80%, then shipping new
features may not be even productive for our team
members because these new features will also run into the same challenges.
Instead, we can first prioritize improving
the reliability from 80% to an acceptable level
that is acceptable across organization, and then work on new
features. Making those decisions then becomes extremely
easy because everybody is aligned on the same objective and
same goal. These runtime promises
are nothing but service level objectives. As we saw,
these runtime promises can be codified as documents,
can be run as service level objectives
if you're using observability tool or they can always
be recorded as decisions in your decision
lock tree where everybody can see them over time and
they are essentially runtime. They are not static because
if you start with a particular objective, you can always
increase or decrease it. Adapt to the next
nine based on the performance that you are seeing right now.
So instead of forcing these promises top down,
where the engineering leaders can say that okay, we want to start
with three nines, four lies. Instead of that, the teams can start
with what they have right now and use adaptive service
level objectives to improve their reliability
goals over time based on their current benchmark
or the current baseline of service level objectives.
These promises can effectively be codified then into
policies where my P zero service or P
zero API will only have 99.99%
of availability versus my P three will have
90% of availability, and this can be enforced across the organization.
These help in setting right expectations on what's possible.
It also helps understand these contracts to
multiple stakeholders at the same time and effectively.
This becomes a framework of communication between customers,
between internal stakeholders, between other team
members, and so on. It also helps in making decisions such
as build versus but. For example, if I don't have enough
team members resources to improve reliability of
my infrastructure component, it will help me to take a decision that
okay, I need this particular level of reliability.
I don't have the enough resources. As of now, I'll go for a build decision
or a but decision in such cases. It can also
help us in tiered services like I discussed earlier, where I
can categorize my services into critical normal and
can be ignored in certain cases, and so on. Because this
can help us in documenting that not everything is a priority.
I can decide and take decisions based on whether a customer
is a paid customer, or just a pilot customer,
or whether a customer is an enterprise customer, or whether
a but is happening only in an alpha release versus
a release that is generally available, and so on.
Because the most important thing to understand
here is that in these today's world, time is the biggest
constraint that all of us have. If we can focus our energies
on specific things based on the objectives that we have
decided upon as an organizational policy. It just
helps us making those decisions, making these decisions faster,
and prioritize right things instead of just going
to fix everything.
It helps us climb the ladder of reliability.
You cannot improve what you can't measure. We already know that.
So the way to go about this is always first baseline and
then go one ladder at a time in adaptive way. As we discussed
earlier, instead of going big Bang from 90%
to five nine, that will lead us to a failure.
So, recapping the Rasmussen's model of how accidents happen,
there are essentially three boundaries. A boundary of economic failure,
a boundary of workload, and a boundary of expected
performance or safety regulations. The point of equilibrium or
the gradient, if it is within this circle, within these three boundaries,
then system is performing to its optimal level. But that
is not at all the reality. At one point of time
you will have one access pushing the other to access,
and then there is a chance to the gradient
moving beyond the boundary of acceptable failure where accidents
will start happening. So we have to again do pushback from other
lies to keep the gradient inside and make sure that the accidents
don't happen. The boundaries still exist
even if you use service level objectives or policies.
But there is a tension that keeps them in balance. The tension is
via these service level objectives and policies where every
organizational function is aware of that and works in tandem
with each other according to those objectives, instead of working against
each other. The that results into fun and profit.
That's all I have. Thank you.