Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Conference 42, site reliability Engineering
2024 my name is Dave Argent.
I work for Salesforce and I'm going to talk about how to
avoid becoming an agile victim.
Brief agenda introduction a little
bit about what Agile is, how to fail thinking before you
code balancing tactics and strategy. Code isn't your only deliverable
and the big take home lessons. So since we've got about 30 minutes,
let's get moving. By way of introduction,
who am I? I'm a veteran of Microsoft
and Amazon, currently working for Salesforce. I have more than two decades
of experience in online services delivery, primarily as
an SRE and a TPM. I am also
a battle scar veteran of agile gone wrong and I'm eager
to help others like you avoid being victimized in the same
way as I was. I'm currently working as
a senior level SRE at Salesforce
and by way of a brief war story I'm going to talk for a few
moments about the control plane that didn't. So hypothetically,
I worked on a service that depending on who you talk
to, is the largest NoSQL instance in the world.
And it was, and it had a control plane, which was good.
However, the fact that the control plane was only capable of replacing
one host at a time when we were talking about 10,000
node databases, that was a little bit less fine.
So you can see the pitfalls of not
designing things according to how they're actually
going to need to be used over the longer term.
So a very brief review of agile. So what is all the fuss
about anyway? So it's easier to start with what
Agile really isn't. Agile isn't a framework,
it isn't a methodology, it isn't a process,
it isn't a set of rules, and it is certainly not prescriptive.
So what does that mean? So what can agile possibly be if it's
none of those things?
So what Agile is, is really it's a set of principles which can help guide
your software development. And it emphasizes
its principle, it prioritizes its principles,
it prioritizes working software over comprehensive
documentation. The built software stands in
as your spec, rather than trying to document every little nuance.
That means that you should probably comment pretty well,
individuals and interactions over processes and
tools. So it calls for self defined
teams, and it emphasizes frequent face to face communication
as opposed to heavy duty processes and
heavy duty tools. It looks to customer
collaboration over contract negotiation. Customer needs
can change and we need to flexibly adjust to
that reality. We learn things by iterating
and that goes into our last point where we are responding to change instead
of rigidly following a plan. So this is an iterative
approach. You build small chunks of deliverable functionality at a time,
you learn from each iteration and you change courses necessary to
get what you and the customer actually need, as opposed
to what you thought you needed early on.
But I just said that we want to respond to change over following
a plan. And for certain things like architecture, you need
a certain amount of plan because you can't simply build
each iteration with no overarching plan. And, well,
the same thing really holds true with software development.
So that's really our segue into lesson
one, how to fail, which is just because you have priorities in
your principles, it doesn't mean that you can ignore everything
else. So,
failure 101 no definition of success
how do you know if your product is successful? If you can't measure it,
how can you determine that you've been successful in delivering what the
customer needs?
Understand what is going to make people use your product.
It's going to be some, probably for online services, it's going to be some combination
of features, reliability, performance, security,
privacy, and a litany of other things.
But you should also understand why would people not want to
use your product? To some extent it's going to
be a lack of those same items that made people want to use your product.
The very same things which can make people want to use your
product if you do them badly are going to be the most potent
ways to convince them not to use it. And you
should also understand what would cause you to stop developing this product.
Some examples might be it's too costly to deliver the product, it becomes
too difficult to maintain, it's too small a user base,
there's no longer any need for it.
But all of these things together come to a definition help
you define what does success look like.
However, once you understand what success looks like,
the next likely issue in the failure process is you didn't plan
for success. Functionality is really only
one part of your successful product.
There are very few incremental features that exist in a vacuum.
There's complex interplay between features and they need to play nicely
together. There are overarching requirements that are going to be
necessary to meet your definition of success, and there's going to be a
lot of invisible items that are needed to deliver functionality and
delight your users. And while this list isn't exhaustive, you can
see it's already pretty long on the easy to remember stuff
like code testing for unit functional UI,
etcetera. You've got data integrity, availability and
security. You have availability and reliability. You have a downtime
profile that you need to obey for maintenance or deployment of new
code, which is often going to be zero downtime. You're going to have
certain minimum levels of performance and scalability that are going
to be required in order to deliver this functionality. To deliver it reliably,
you're going to need to have monitoring in place so you can understand when it
breaks, because it will. You need to have incident
management and documentation so that you understand what to do when it breaks.
And you need to have something like disaster recovery or georedundancy
because effectively Murphy is the
enemy of all online services. And in
fact, I did a different conference presentation entitled Achieving
Service how to ensure Murphy doesn't always win, where I
go into a lot more of those details.
Speaking of Murphy, in case I didn't already
make it clear, things will go wrong.
You'll note it's in all caps, and if
I believed in the flash tag, it would have that too.
You can't avoid the idea that things will go wrong.
People are imperfect. That's element number one.
So expecting everything to go perfectly when the people running
it aren't themselves perfect is an unrealistic expectation.
The next thing up again in all caps, because I think it's that important.
There is no such thing as a safe change.
I've been beaten down by these before in many
circumstances where a change was supposed to be perfectly
and utterly safe because it was directed against non production
and there was a default configuration value that filtered through to production from
that config file and boom, production went
down when there was no known change
going out to production. So really,
since there's no such thing as a safe change, change is going to cause things
to go wrong. You need to be able to diagnose your failures quickly. You need
to be able to automate responses to bad deployments. You need to be
able to do things like reduce the blast radius.
All of those things that are associated with good practice
in terms of deployment logic, you need to actually plan
for because you're going to need it. Everything in your
service must accommodate failure. A lot of the
things that are going to fail are things that you have absolutely no control over.
Your partners are going to fail you, your network providers are going to fail
you, your hardware is going to fail at some point in time, your data center
infrastructure, you're going to lose power, you're going to lose cooling.
These things are going to go wrong. And that doesn't even include
bad actors who are simply trying to rip your service down because they're
spiteful. So ultimately,
it's not enough to plan for failure. But you also
need to know how to recover from a complete outage. Because coming up
cold is, for many services, not nearly
the same thing as recovering from a partial outage.
And understanding how you respond to these
huge failures is the sort of thing that you need to
do before, not during, the outage. I'll use
as an example. Prime Day 2018 not
that I was the person carrying the pager for a certain database
service.
And it turns out that coming back from
an outage of that nature,
dealing with all the caches and dealing with everything else that needed to
come back up nicely while still trying to serve the existing
traffic is not the same as your typical outage.
So just take it from. Take it from experience. You don't
need to learn it yourself, learn it from me.
And that brings us into our next section, which is really, you should think before
you code. The cheapest place to make changes is
when you're designing and you haven't yet written a line of code. It's like this
for architecture. It's like this for any number of things that aren't
software engineering.
It's kind of the measure twice, cut once philosophy. Understand what you're
doing before doing it.
So in order to make your designs good, here's step
one. Anticipate failures. Again, you're going to hear me say the word
failure over and over in this
presentation. It's because failures are pretty much the enemy
of online services, and they are unavoidable.
It is the only truly reliable thing in online services. They will
fail. It's death, it's taxes only.
It's probably actually worse than that.
So since you're going to have failures, you need to understand and define
what is acceptable. Don't design for greater reliability
than needed. Each nine is somewhere in the
ballpark of an order of magnitude more expensive than the one that came
before it. So if what you need are four
nines, don't design to five. It's only
going to cost you an awful lot of money that you probably don't have.
And similarly, don't design for greater performance than you need.
If, for example, your service needs
to respond in 200 milliseconds,
just as an example, you probably don't need to design
the entire service so that it successfully replies
in 25 milliseconds. So don't design for greater performance
than you need because it's going to be more expensive.
It adds a lot of cost and generally it's of limited benefit to
your customers if you've defined what is acceptable in a reasonable fashion.
So since we've said that things will fail,
design for fast recovery from failures. And I'm going to say monitor and
measure, because if you can't measure it, it's very difficult to tell when it went
wrong. And similarly,
you don't have an ideal baseline for even understanding what wrong looks like.
You want to automate those responses where possible, because automation
is faster than getting a human to answer a page almost
always. Plus, if you get paged often
enough, and heavenly knows I have in various positions for things I've
worked for, my sleep schedule suffers and I get cranky and cranky
sres usually not the best thing for running services. They're just miserable to
be around. So how about we avoid them and don't
wake them up unless you actually have to. You need to include
being hard down 100% in your recovery scenarios. Caches will not always
be warm. You will have to survive a warm up of
your service if that's the nature of your service. Similarly,
this applies to databases and their caching algorithms and everything else.
Know how you have to do a cold start. In addition,
not everything, not every failure is a black and white failure.
Sometimes they're gray. So embrace
that gray. Build degraded modes of operation for when you or your
dependencies fail. Even if you have the absence of
full functionality, you should be able to, in many cases, support some user
scenarios. Don't be afraid to do that.
It is often better to serve some traffic than no
traffic. And again, that's a degraded mode of operation.
And lastly, you want multi layered security.
Anything is a potential single point of failure.
There are bad actors out there. Make it
harder on them. At least don't give them just one
hurdle to run through. Make them jump through all manner of
hoops before they can get to the goodies and before
they can take your services down or steal your data.
Even worse,
you also need to anticipate success.
It is a less understood item
that the most dangerous thing for an online service is
being embarrassingly successful. So you
are probably going to want to hire, you're probably going to want to
architect to be very scalable in the
instance that you might actually need it. Now, this doesn't apply to
every single online service, but if there's the possibility that
your popularity could blow up and you could have to support,
for example, a couple orders of magnitude more traffic than you were initially expecting,
leave the hooks in place so that you know how to do it,
and doing it is frequently more efficient
if you avoid monolithic structures. Monoliths are
fairly well known for not being able to scale important subsystems independently,
so you end up having to over scale to compensate.
Microservices and similar architectural ideas like
those allow you to scale components independently.
That allows them to scale more efficiently.
So in general,
monoliths are less and less considered a
good thing. Moving forward. If you're going to adopt a monolithic structure,
understand that these are weaknesses and know
that you're going to have to be able to work around them.
You really want to avoid processes that scale linearly with people.
For example, if a customer onboarding requires manual steps,
this can become a bottleneck if you're a bottleneck if you're wildly successful,
since while your service may certainly be able to scale, your staff may not.
And at all times you want to know how to protect your service from
excess traffic. As I said before, serving up some of your traffic
is usually better than serving up none of it.
You can't control client behavior.
Despite what may be highly good intentions, clients will occasionally
throw either bad traffic or excessive traffic against you.
And even if you were only talking about clients, there are bad actors
out there who are going to do the exact same thing for less generous
motivations.
Lastly, identifying your high value traffic and
servicing that when resources are strained is a really good degraded mode to anticipate.
Again, not an exhaustive list, mostly just things
to think about overall.
And you're going to need to anticipate change. My crystal
ball doesn't work, and if you have a better one that accurately tells the future,
good, good for you. Hide it or market it or do
something, because I don't have one. But this means
that since I can't accurately tell the future, I try to
leave as many possible futures open as is reasonable.
That doesn't mean I can. That doesn't mean I'm going to redesign
emacs, which is an operating system and an editor. But the
principle is reasonably good to design for flexibility
and future possibilities. You don't always know what the customer will
want or need next year. The customer doesn't always know what they're going to
want or need next year. So design,
especially with APIs rather than direct calls, which enables you to change
underlying business logic without rewriting every component
that calls it or relies on it.
So it's use abstractions. Use abstractions where possible.
In a lot of cases, you're going to want to understand how to
do a zero downtime software upgrade. It's true
that not all services will truly need this, but if you design from
the perspective that it must be possible, it does safeguard
your future. And most large online services are
increasingly less and less tolerant of actual
downtime. In the spirit of anticipating change,
learn from experience and let it inform you you're
going I at least try to think that I'm going to be smarter tomorrow than
I am today, so I need to be able
to listen to tomorrow me that actually has experiences that
I don't have today and leverage them. Change includes changing
in your direction or a plan in response to new information.
So don't be afraid to change tacks when what
you're doing isn't right.
Couple loosely loosely coupled systems are usually easier to change
and generally more resilient. Are they necessarily 100%
as performant? No, they aren't necessarily, because you're frequently dealing with
asynchronous processes instead of synchronous processes.
Understand when you can get away with coupling loosely. Understand where
you have to couple tightly, but the preference is coupling
loosely will usually be a better idea
when you have the luxury to do so.
So now we're coming to balancing tactics and strategy.
One of the big things with agile is agile has a tendency to
concentrate on short term deliverables and
not concentrating nearly so much on the long term vision. That's just
the nature of the way agile works, its iterative processes, the way the sprints
are constructed. All of those things tend to
lead to a concentration and overindexing on short term
deliverables. So in terms of
tactics, you want to delay non critical decisions
as long as possible. In this case, cold feet are an asset.
You don't want to commit to a path and discover that
you're wrong. It's expensive.
You want to allow yourself to learn more about your problem space before making the
non critical decisions, especially those that you can't easily walk back.
And again, you want to leverage the agile principles. You want to embrace
the iterative process and learn from experience and learn how
to do things better, or learn what the right things are to
do. And you want to actually get the right people
in the design process. Coders are very good at writing
code. Architects are very good at designing services.
Sres are usually more familiar with how you actually run online
services in the real world, so they offer a perspective that's often
lost in the design processes. And product
owners are usually the voice of the customer and they
help define the requirements. All too often
it's coders and architects and sometimes only
the coders who are engaged in the design process. So you tend
to lose out on perspectives which are necessary tactically.
And the other thing on tactus is you really want to get the critical architecture
right the first time. Architecture decisions are often
very expensive or nearly impossible to fix later without a complete
and total rewrite. So do your research,
understand the requirements, plan out your overall
very difficult to change elements like architectures.
Everything else will usually, will usually align around
those expensive architectural decisions with much greater flexibility.
So again, take the stuff that you desperately need to get right.
Make sure it's solid. The rest is probably going to iterate around
it. Some of the things that you really need to nail down
are scalability, availability, performance, security and data
integrity. Without at least most of those, you don't actually have
a service. And some of those
things are negotiable, and you can figure out how to do some of them later.
But if you don't actually have those ideas in mind from
the start, it can be difficult to add them later. And security
is the poster child for bolted on at
the last second. And you want to design initially to
your non negotiable requirements. There are going to be some things which are hard
stops, and they're going to be some things which are nice to have designed
to the things that you absolutely have to have first.
And you want to adapt features and customer scenarios to your architecture.
Because if it's obvious that your architecture can't support your features
and your customer scenarios reasonably, it's time to redesign
the architecture until it can. So if you have
customer scenarios and your architecture literally can't support them, you don't
have an architecture yet. And you need to go back to square
one and figure out, okay, are those customer scenarios as
vital as we think? And if the answer is yes, then your
architecture needs to change. Looking at things like strategy
for a longer term vision. Two way doors. So what's a two way
door? Two way doors are decisions which can be easily reverted.
One way doors, they put you on a set path with no easy way to
backtrack. And in the software engineering business, as in
many other things in life, sometimes you need to take a step back to go
take three steps forward, and that's okay.
You're leveraging one of the strengths of agile, where you learn from each iteration
and you apply that learning. But it does mean that
it's worth it to try to make your decisions reversible, so that taking that
step back is easy. You're not blocked from doing so. You don't
have 39 dependencies on the element that you would like to
fix. Now you can't. And as much as you need to design for
flexibility, you also need to be able to plan flexibly.
Because again, the odds are you're going to make up smarter one day than
you are today. At least I try to make that a habit. Sometimes it works,
sometimes it needs coffee, but there you go.
You need to be able to listen to and be able to implement the ideas
of that smarter you. And you need to
be able to scrap and rework as needed without being embarrassed.
What's important is that you get to the right end point, that you
get the desired result. Sometimes the path to get there is going
to be a little bit crooked. And again, that's okay.
Software development in the agile software department.
In agile, good to learn
from experience and to figure out better ways of doing
things. So if the straight line path isn't the one that you
end up taking, as long as you get where you need to go, that's the
important part.
Which brings us to really how much planning is enough.
And the answer is, even in agile, planning is necessary. But this
isn't the same as waterfall. We don't want to
plan every little thing. We want to embrace
the iterative process. We want to allow ourselves to learn from
experience as we go so that we can determine what the right things to
do are and how best to do them. You will probably
change significant elements over time based on what you learn,
and that's a positive good, that's a feature, not a bug for
your planning. You need to understand what you actually need for long term success,
and you can't compromise on being able to deliver it.
You need to know the shape of what it is that you want to deliver,
how you're going to deliver it, and how you're going to support it during
its lifetime. Again, this dovetails back into planning
for success, planning for failure, and understanding
what it is are you trying to build, what are you trying to deliver,
and making sure that you actually have a life cycle which can work
in the real world. And lastly, you're probably
going to be wrong about all of these things at least once.
And if you're me, frequently. So it means
you really need to be prepared to adapt to real world circumstances. And again,
embrace the idea that we're going to learn, we're going to do things better
tomorrow than we did them today. Be ready for
it, don't be embarrassed. Don't feel like you have to fall on your sword for
it. This is part of good,
positive software engineering using agile,
and something that commonly gets overlooked in
most development methodologies, but especially in agile, is the code
is not your only deliverable online
services really are more than just code. They include data monitoring,
testing, documentation, redundancy, availability, disaster recovery,
performance budgeting and more other things than I really want to shake a stick at
in a short presentation. So deliverables 101
the service code is actually the easy part because you understand
there is a concentrated team writing it.
It is the thing that gets most software engineers promoted. So good
software engineers have a tendency to know how to write reasonably good service
code. Look at all the things that
aren't service code monitoring, alerting and incident response
documentation runbooks any external facing documentation for the
customer, your network, your security testing
in your test framework, deployment tools, SLA's,
Ola's and SLOS administrative tools reporting. The list goes on
for things that you need to successfully deliver online services that
aren't actually part of your service code.
So what do we need to do? We need to integrate non code deliverables into
our planning and execution cycles. We need to add non code items into
the backlog. Since the backlog tends to be fairly king
in most agile houses, make sure that your non code items
are there. You need to understand what you need to successfully release.
What's going to block? What is my non code
collateral that's necessary to release? And sometimes the right answer here is to create
a release sprint. Just because you complete a
sprint doesn't mean you're actually releasing the products of that sprint.
Sometimes you are, sometimes you aren't. But frequently,
especially if you're dealing with major changes,
some things you're only going to need to do if you're releasing that's really much
of a non code. Your processes are
often going to need to change around major releases, since a much larger
percentage of time is spent doing bug fix work from
newly discovered bugs that aren't in the backlog. And you're going to need to allocate
your time in a very different way when
you're getting to the heart of a release than you are when
you're just cruising along in a standard sprint. You need to
ignore your non blocking code items for the backlog and you're going to concentrate on
bug fixing fixes in non code deliverables. You are going to test,
you're going to test, and you're going to test some more. This hopefully
is going to include game days and breaking non production to ensure that
your runbooks, your monitoring and your incident response are all solid
and you're going to test some more and you're going to do deployments so
that you know what to expect before, after and during.
And you're going to need to train your operational staff and vet the documentation carefully.
All of these are deliverables that aren't actually your service code.
You need to plan for it and you need to account for it both in
time and effort, and preferably in rewards and acknowledgement
for software engineers who are engaged in
writing non code collateral.
So in summary, even with the principles of agile,
you need to do some planning to ensure the long term success of your
production. Don't compromise on those elements.
Your long term success is crucial in nearly every
product you're going to design. Architectural defect defects
and deficits in the areas of redundancy, availability,
monitoring and performance can absolutely destroy trust in
your product weeks, months or years from now.
And it can be very difficult and expensive to fix.
So this is why having a certain amount of rigor in
planning, understanding what's important to plan,
what's less important to plan, and all of these
sorts of things can come together. So if you learn nothing else,
it's don't over plan. Don't be afraid
to learn from experience as you move forward and
lock in your expensive to change things as early in the
process as possible. Because again, if the cost of change is high,
you really want to make sure you don't have to.
Thank you for listening. Again, my name is Dave Argent.
My email is dargentmail.com.
I work at salesforce. I also have a LinkedIn. I believe
I'm the only David Argent out there, so if you
need to contact me, feel free to contact me that way. And again,
thank you everyone and I hope you have a good rest of the conference.