Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to my talk. I'm recording this video from Spain where
I'm based. I hope you enjoy, I hope you're having a nice day
and I would still very much like to interact with you. So if
you have any questions, comments, ideas, or just
want to say hi, feel free to reach out to me after the talk.
I'll share my contact details right at the end.
So let's go. Move fast without breaking things probably
you heard the opposite of this phrase from
here, where it started, like, I think maybe
ten or 15 years ago, where it became Facebook's.
One of mottos of Facebook. And this is
funny, actually, because I remember I first seen this phrase in
my first office, one of my first offices in Istanbul,
Turkey, where I was working in one
of the largest banks in Turkey as a software developer. And this phrase was written
on top of the wall right next to my desk.
And it was a bit confusing because things bank is run with strong hierarchy,
stability and quality is everything.
Like perception of quality is everything, both individually and both of the systems we
build. And to be honest, we deal with money
and it's a banking system, so there's not much room to.
Room for error or to break things. So it was a bit confusing for me,
bit for the case of Facebook. It actually makes a lot of sense and it's
great, it's a great vision as what great leaders should
do. They set great visions and I think move
fast and break things states that pace is the priority,
even on the cost of breaking things. They wanted their employees
to feel safe, to not fear
about breaking things, because they want to innovate.
Facebook created new user habits for us,
they invented new user habits for us and wanted to go further. They wanted to
invent newer things, they wanted to invent better, faster things
of interacting in web. So that's why they went
with it up until it became this. So not
too, like maybe, let's say recently, a few years ago,
Facebook adopted move fast with stable infra. So they're
both valid approaches, but different times call for different
measures. So why would any company would prefer this
move fast with still having a stable infrastructure, stable performance
over just pure pace? So of course, there's an obvious reason
where now user base expects a certain level of threshold of
performance, you want to comply with
the standard of quality. But there are a few
other things, maybe like, not too obvious things that I want to mention here
to consider high or good performance.
First of those is experimentation, where I've
seen this in my workplaces, where we've shipped an
experiment we shipped a new feature can idea as an experiment to get
feedback to see how it's going to perform. But the thing is,
what we missed was the specific feature
idea was performing like the performance to quality
was a little worse than the actual product because we just wanted to be fast.
But then the experiment results were the users didn't like this idea.
But actually soon after we found out that it's not that users didn't like the
idea, it's just that that feature was slow, it wasn't performing
well, it wasn't really in high quality that users didn't value,
users didn't want to use it, they dropped out.
So during experimentation, it's really crucial that
you meet your current quality standards so that users
don't, even if they don't notice this consciously,
they wouldn't just want to drop out, they wouldn't want
to be annoyed with your product. And the next one is morale.
So when you ship things that every once in a while breaks,
and when you break things you have to go fix them.
And when you ship things that breaks often, then you have to go fix them
quite often and spend a lot of time on it. So every once in a
while this is fine, but if it happens quite frequently over a
long course of time, in the mid and long term, this would actually bring
down the morale of your teams a lot. This is also something you want to
consider. And the last thing I want to
mention is I think things is something I really want to emphasize, which is
more than often spending 10% extra time and effort will probably
prevent 90% of the issues you will face, that or otherwise.
I think this is almost
always free in my personal experience and I think
with the content that I'm just going to be telling you about, I'm going
to be supporting this idea. And maybe now
it's time to talk about why am I talking about
this? Why not for something else? Or why
not someone else is talking about these but me? I think it's
a good time to explain myself, like my past, my experience.
So I've been working in tech for ten teams for the past ten years as
a software engineer. I call myself a product management engineer.
I love solving users problems and I've been grown
from a junior engineer for technically doublet teams.
I've mostly worked in thoughtworks in the past where I've
started in Turkey, and I've then worked
in Germany, India and then Spain finally where I settled.
I believe we came for the past five years and
I worked in various different domains, banking, e commerce, pricing,
breaking and then in new relic
I moved to new relic to the observability domain. I've had amazing
few years and then since the beginning of this year I'm working on Shopify.
So I worked a lot with those individual able to I had
the luck to build that muscle and I'm really interested in complex systems
and how people I'm fascinated by the characteristic of
complex systems, the complexity and especially how people
struggle or can manage with that, deal with that.
So I've led teams that own and build
or been part of teams that own and build really high scale
systems that demand really high quality like for example neuralic
processes and serves telemetry data. They own perhaps
the biggest Kafka cluster in the world and requires
excellent operational quality because actually
that's the product where other companies need when things are going
tough, or during critical data like Black Friday, Cyber Monday, which is
happening now, where I'm recording this, or now
I'm building in Shopify and I'm part of the shop app. My team is
responsible of helping our users attract their
packages, track their orders and in updates we process
more than 20 million of status updates,
track order updates to be able to let our users
know the latest status about their orders.
And while dealing managing this complexity,
my teams have always been under pressure of moving fast.
And I want to share my experience of how did we balance those two things.
So it all starts with planning, right? If you want to go really
fast, but also keep a high level of fidelity, you need to plan accordingly.
And first thing I've been really happy about
when we achieved this was setting our priorities,
really setting our priorities straight. So that will come daytoday
for a lot of people or more than one person in the team,
where we'll need to make a decision between time, performance,
scope, quality, security,
things like this. We need to choose one or the others. There's this trade off
coming. They'll come a day that you need to make a choice
within this trade off. And I think a teams should be
doing the most important thing at any given time. So this is my
motto. And in order to be able to do that,
consider one Wednesday morning you show up to work. There's two things you
can do. Or someone in your team, like you can
just do something to make your application more secure, or you
can do this other thing that's a performance optimization. So which one you would do?
Instead of just relying on
making the right choice every time for
individuals, I think we should just set our priorities straight,
unknown, and broadcast them clearly, so that this choice
is straightforward, so that this choice is
almost known. I think setting your priorities
clear in the beginning will, as I wrote right here,
will make your team work on the most important thing at any given time.
And how do we achieve this? So there are like workshops like
trade up sliders and such. Bit, you don't have to go in a full on
formal practice bit. Could be just a 20 minutes talk. I think it just
makes sure to reach the stake.
And once it's done, of course we have a problem. We have an idea.
We're going to design, right, in big tech, mostly, we have this already running complex
architecture, and we're going to add to this, we're going to add some logic,
possibly some infrastructure. And how do we go about this when we still want
to move really fast? Because many
things can impact us, architectural decisions, right? Like,
if we believe this system is going to grow in a certain way, if we
need to scale in a certain way, if we believe we may
add some logic in a certain way, they all factors in,
impacts this decision. But there is this one concept
I wanted to talk to you about is that's actually coined by my former employer,
thoughtworks. It's called evolutionary architecture.
So when you put the subjective evolutionary in
front of architecture, I think that implies few things.
One of them is it makes architecture a
living entity. Now, we accept architecture is a living entity and it changes,
right? Bit evolves, and now we have a choice then,
right after, we can just let it drift away, let it evolve naturally,
or we can make it consciously, right? We can change it consciously
instead of letting it drift in time. So how
does this gets applied is we
create. Thoughtworks have defined this way of, you create a
fitness function. So, by the way, you don't have to, I think, apply this
formal practice or adopt this at all bit. I think it's really good to
understand the concept. So you create a fitness function. A fitness
function is a term that's been borrowed by evolutionary
biology. It describes the likelihood of survival
of fitness, survival of the fittest. It describes the likelihood
of a species to survive. Likelihood to survive of a species.
So then here, the fitness function actually represents
our priorities, right? We can actually apply things function
to our architecture and see if it matches our priorities or not. For example,
in this case, in this example that I put here, high throughput,
for example, is more important than low latency, for example, or data security
is more important than usability. So once you define this,
then, and make sure your team is clear about this,
you can design your architecture accordingly.
So the next thing is I want to mention is I think this is a
good heuristic for designing a
base architecture, because as we said, as we
now accept our architecture, especially when we are moving
fast, it's a living entity and
it's going to grow and evolve. So I think it's a really good heuristic
to start with, keeping complexity low. And maybe
I want your attention on the left side of this or the right side of
the screen, where is, there's a chessboard,
right? Like why is that? So? I put this, because if you ever
imagine we're looking at a chessboard, like I take a photograph
of a chessboard in the middle of a game, you're not either one of the
players. Sometimes, especially if you're not too experienced
with chess, it's really hard to understand the strategies of
each player, or it's really hard to understand what is the reason,
what's the purpose of each
element on a table, each element on the board, and what
is the strategy of either player and what's going to happen next. It's really
hard to understand the behavior, what's going to come next,
unless you're a really experienced chess player
or really know about things player. And that means there's a
hidden complexity right there. We don't know what to expect from system.
We don't know what is next. And this is exactly what we want to avoid
in our system. If you look at your architecture,
we want architecture to express its behavior, to not
hide it, to really explicitly,
as explicit as possibly express
the behavior so that you can actually grow in it. So I
think that's why I feel having a base level of.
So if you want to make a decision, if you want to make a trade
off designing your architecture between complexity and performance,
between amount of resources, I think it's a really good heuristic to start
a base level of architecture with a lower complexity so that
it's really faster and easier to grow and let the
architecture evolve,
which I can connect back to the code that I have just told you more
than often. Spending 10% extra time, perhaps to build
the architecture with the lower complexity or have more resources
or have more latency, will avoid 90% of
the issues that you'll face later otherwise, and will make you grow your architecture,
evolve your architecture faster.
So next is you have this big design.
Now you need to split it into parts, split bit into chunks, so that people
can work on that, that will complete the puzzle and become your
product. Right? So how do we go about this I
think a great rule of thumb is you first do the things that has
high value and high complexity. You should prioritize
these type of parts. So that basically translates into
parts that connect the pipes and build a walking skeleton,
for example, parts that are complex to build, risky parts that
holds unknown. So basically, again, a rule of thumb is you
face the tough things first. You face the tough burst
that may uncover risky things that maybe your team lacks skill.
Maybe it's a complex logic that you need to build.
Maybe it's a complex infrastructure that you want to face those things
first. And also, instead of just going,
just to go back to the first point, instead of building parts of a
body separately, imagine building parts
of the car separately and connecting them in the last week,
at the last minute. I think you should really consider having
a walking skeleton from week one so that you
get feedback. You experience running it, you know how it feels like, and I
think you may uncover things you wouldn't otherwise. Things is something
I cannot emphasize enough. The benefits of
bit. So the next part is now we're going to more
practical things is integrations. So integrations
is usually in this big, complex system,
this is the most critical and delicate parts because these
are the parts where most incidents and issues occur, and they need to be
treated really with delicacy and they need to be secured.
So we can divide this. How the secure integrations are two,
like downstream integrations versus upstream integrations,
and there are different patterns and you can apply them. So for
downstream integrations, you can apply timeouts, retry,
backup policy, circuit breaking. I'm not going to details of them.
I'm going to talk about a few books. I'm going to mention a few books
that you can learn really deep information and how to
do things and get experience of those. But like timeouts,
for example. But the main idea is simple. If a part of your system
breaks, it's like a system design principle. If a part of your system breaks,
so the rest should be working as best as possible. And in order to do
that, you need to be able to isolate the failure.
Right. So if a downstream system is downstream
service is failing, instead of waiting for that
service for 30 seconds, you should time out and then use that
resource instead of waiting to do some other stuff. Right? Or for
example, let's say circuit breaking. If a downstream
service is really having a hard time under high load,
et cetera, instead of hammering it and having errors all the time,
you should just do circuit breaking and give that downstream
service time to breathe back up. And then you
can try it again. And for the upstream direction,
it's basically similar idea,
similar principle. If a part of the system breaks, you shouldn't let
that failure leak into the rest of the system.
So there are practices called bulkheads, load shedding and
weight limiting. So bulkheads, for example, compartmentalize your
resources so that the failure doesn't leak to the other resources.
Or if you have some issue like a high load that's eating up
all your resources, you compartmentalize it so that it doesn't eat up
all your resources that impact other parts of your system. So load shedding
and rate limiting as well, they make sure
your system performs still at a maximum capacity,
even under high load. Doesn't let
your system gets hammered, basically. So one thing
that is really important I feel that I want to mention here is I think
you should implement them in the beginning. You should implement them while you build
your integrations, don't add them. I've seen practices of
habits of adding these things while productifying the
new implementation, one week before shipping, before releasing
it. I think there's an anti pattern. You should add them
as soon as you build the integration so that you can actually test them,
you can tune them, you know, how they react,
how the system reacts. Because sometimes these things are really hard
to predict with the complexity increases in
the bigger system. So you face the error scenarios, you face the tough situations
again while building as early as possible.
So the next things is testing. Of course it's great if you
have a. I'm going to skip like test and pyramid how you write
tests. I think we're going to approach a little more high level in terms of
testing how you test your product. I think it's great if
you have a staging and testing environment that you can ship to and you
can, while building before you release it to your users, you can
actually get feedback, you can actually see how your system behave.
There are two things we need to consider. Make sure
we cover here, load and diversity. Load, as we need
to make sure we test our system with a load that it's
going to see once we release it. Right? Because we need to face
those scenarios before our users do.
And the other one is diversity. Like, don't test your, if you're
dog fooding, if you're testing your features yourself,
don't test it with just one user. Try to create as much
diverse scenarios as bit can represent real
life. So one really good practice for this that
you can apply is shadow releasing. Right. Shadow releasing means you release
a feature. I think big tech does this all the time. You release a feature,
but only a portion of your users without
breaking, making those big glass man or without marketing about
it. You just release it for a portion of your users so that
you can get the most scenarios you can get, you can get the most feedback.
Find out edge error cases. This is a
really good way of testing that will give you load and most importantly
diversity that you need to make sure your product is working
well. So the next one is
building for resilience. This is a really good heuristic. I want to put
as much content as I can put in the
slide actually, so maybe you can take a photo screenshot even.
So, building for resilience, what does it mean? So I think it's a really good
practice to map out possible problematic and error scenarios.
For example, what those scenarios could be. For example, there could be
a sudden increase in the ingress load. Your database may become
bottlenecked. You should map this out and
imagine how it would play out.
Or for example, one of your downstream API calls, again,
being that being threatened, what happens next? Or your cloud provider
is having issues, your caching clusters unavailable. It's really crucial
and super helpful to make the decision how
to react to these before they actually happen. Instead of a
Saturday night, 02:00 a.m. You should actually just
one person on call woke up from the sleep making this decision
as a team. You should make this decision before and then perhaps
build the tools around it so that you
can actually overcome these problems as best as
possible and document this. Have a runbook.
Other things I can mention about building for resilience of a good patterns is auto
scaling warm up. So most infrastructure technologies, infrastructure providers
support some kind of auto scaling. If you know how to do this, it could
be really good, helpful, and if used consciously. And also warming
up. Like for example, if you know, every Monday morning, or let's
say every Saturday night, you have a huge load, you can actually warm
up your infrastructure so that you
can be resilient under high load. So the next two things I
think are really important and we should be opting in for these concepts
whenever we can. First of them is immutability, a concept that's
been getting highly popular in the last ten years with data storages, with more
data analytics and data processing work is
getting higher, getting more popular, getting more frequent.
So one important thing that immutability does is
it let us reach right parts of our flow? Right? If part of your
flow is immutable, if there's an error happen, if there's
high load, et cetera. If there's an issue happen, you can just retry it on
later and it gives you a lot of power to overcome issues
when they happen, which most of the time it's not a matter of when,
it's not a matter of if, it's a matter of when. The next one is
compartmentalizing. So compartmentalizing let us deprioritize
less important non time sensor tasks, for example.
And the other thing is you can scale them separately.
Like if for example, you're reading a message,
like your system reads a message and then does some job with it.
If you compartmentalize reading a message and those different jobs
that you need to do, for example, notifying your users, maybe you
can on a high load time that you need to send 50
million notifications, you can just scale it separately without touching your whole app.
It is really powerful. And the last one acquisition is ones
time configuration management. These are more or less really
well adopted and quite straightforward
approach, but it really helps to be able to have a
configuration management that
your system can read without the need to deploy new code.
Like for example, you can have kill switches, you can change your
infrastructure, even you can number of replicas,
or like you can change certain thresholds
during runtime depending on how the environment of the system
is, depending on your high load, low load, depending on
if you have an issue or not. I think this is a really good
tool during tough situations as well.
So the next one is observability. I'm going to move myself just
here. Perfect. So first thing
about observability is added while you build it. This is really crucial.
Again, don't add it while productifying your product in the last
week, add it while you build it because it's really easy to miss.
I think it's really straightforward to add it while you build it instead of
being lazy, let's say with respect,
instead of just adding it later. Because later in the last
week, if you're trying to add some metrics of the flow that's
been implemented a few months ago, it's really far more easier to
miss things. And it's really painful to realize during an incident you
miss a metric that you wish it was there.
You can just go back and edit. So my first recommendation would be
edit while you build. And I think the other thing
is other good recommendation. I will have to start with the question.
Start with the question and how do we know if our product is working well?
Like what could be the data that shows us, prove us our
product is working well, this could be, for example, a success rate of an
API call is above this level. A response time p
95. Response time for any given user request is below
this level. A number of requests for per seconds around this bit could be things
like this. You start with a question, but also know that you
may not know all the questions that you need. So emit
data generously, really generously. There are efficient libraries. I think
this could be a good heuristic and start alerting from day one.
Like start getting those alerts so that you can tune them,
test them and know how to react them. And so
this is a mode of system bit. I'm going
to talk about modes of systems where systems behave differently under
different environments. So I think it's really great
we put our system in and sometimes those are really
hard to predict because of the complexity that these systems contain.
So it's really good to put our system under harsh conditions
before those conditions happen when we're not expecting, right?
So performance testing is great for this.
Put your system under high load, high stress, so that
you see how it behaves. And if the performance is something critical for you,
start testing early. Put it as a gateway. Put it as a
gateway so that you can make sure after every chances
to your logic, to your infrastructure, you're not regressing. Right.
And another superpower thing is that I cannot recommend enough
is run game days like put your system under harsh conditions.
If your club conditions, as your cloud provider is unreachable, your cache
cluster is gone. So simulate these things so that you face
it before they actually happen in real world.
So let's recap. What have I said? I said, set your priorities clearly,
broadcast them, make them known. Your architecture
will evolve. Let's adopt evolutionary architecture. And a good heuristic
is optimizing initially for less complexity. Have a walking
skeleton from week one. Face tough task first, we need
the secure integrations. They're important parts of our system.
We should map out our incident scenarios. Create a run book
with a detailed explanation of what to do during tough
times, during problems. Build optimizing for
resilience, immutability and compartmentalizing as our
friends are our friends. Observability. We should have observability from
day one and alerting as well included into this.
And performance testing probably is a really good idea,
even if performance is not a high, high critical priority of yours.
Thank you so much. I tried to really go a bit fast.
I may have skipped this bit. Feel free. Reach out to me in my email
or via LinkedIn. I hope you enjoyed it.
I hope you have got some stuff from out of this that
you can help. You can apply to your day to day job. Really happy
for the chance and hope to hear from you.