Conf42 Site Reliability Engineering (SRE) 2024 - Online

How to Avoid Being an Agile Victim

Video size:

Abstract

Agile’s planning cycles can easily emphasize short-term deliverables at the expense of long-term success. Take a tour of how to avoid running into that major pitfall, and get the benefits of Agile without sacrificing your long-term vision.

Summary

  • Dave Argent will talk about how to avoid becoming an agile victim. Code isn't your only deliverable and the big take home lessons. Conference 42, site reliability Engineering 2024.
  • I'm a veteran of Microsoft and Amazon, currently working for Salesforce. I am also a battle scar veteran of agile gone wrong. You can see the pitfalls of not designing things according to how they're actually going to need to be used over the longer term.
  • Agile is a set of principles which can help guide your software development. It prioritizes working software over comprehensive documentation. We learn things by iterating and that goes into our last point where we are responding to change instead of rigidly following a plan.
  • Failure 101: How do you know if your product is successful? Understand what is going to make people use your product. Also understand what would cause you to stop developing this product. Once you understand what success looks like, the next likely issue is you didn't plan for success.
  • There is no such thing as a safe change. Everything in your service must accommodate failure. It's not enough to plan for failure, but you also need to know how to recover from a complete outage.
  • Don't design for greater reliability than needed. The cheapest place to make changes is when you're designing and you haven't yet written a line of code. Most dangerous thing for an online service is being embarrassingly successful.
  • One of the big things with agile is agile has a tendency to concentrate on short term deliverables. In terms of tactics, you want to delay non critical decisions as long as possible. You really want to get the critical architecture right the first time.
  • Two way doors are decisions which can be easily reverted. One way doors put you on a set path with no easy way to backtrack. Sometimes you need to take a step back to go take three steps forward. What's important is that you get to the right end point.
  • Even in agile, planning is necessary. But this isn't the same as waterfall. Code is not your only deliverable online services really are more than just code. Integrate non code deliverables into your planning and execution cycles.
  • Dave Argent: My email is dargentmail. com. I work at salesforce. I also have a LinkedIn. I believe I'm the only David Argent out there. And again, thank you everyone.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Conference 42, site reliability Engineering 2024 my name is Dave Argent. I work for Salesforce and I'm going to talk about how to avoid becoming an agile victim. Brief agenda introduction a little bit about what Agile is, how to fail thinking before you code balancing tactics and strategy. Code isn't your only deliverable and the big take home lessons. So since we've got about 30 minutes, let's get moving. By way of introduction, who am I? I'm a veteran of Microsoft and Amazon, currently working for Salesforce. I have more than two decades of experience in online services delivery, primarily as an SRE and a TPM. I am also a battle scar veteran of agile gone wrong and I'm eager to help others like you avoid being victimized in the same way as I was. I'm currently working as a senior level SRE at Salesforce and by way of a brief war story I'm going to talk for a few moments about the control plane that didn't. So hypothetically, I worked on a service that depending on who you talk to, is the largest NoSQL instance in the world. And it was, and it had a control plane, which was good. However, the fact that the control plane was only capable of replacing one host at a time when we were talking about 10,000 node databases, that was a little bit less fine. So you can see the pitfalls of not designing things according to how they're actually going to need to be used over the longer term. So a very brief review of agile. So what is all the fuss about anyway? So it's easier to start with what Agile really isn't. Agile isn't a framework, it isn't a methodology, it isn't a process, it isn't a set of rules, and it is certainly not prescriptive. So what does that mean? So what can agile possibly be if it's none of those things? So what Agile is, is really it's a set of principles which can help guide your software development. And it emphasizes its principle, it prioritizes its principles, it prioritizes working software over comprehensive documentation. The built software stands in as your spec, rather than trying to document every little nuance. That means that you should probably comment pretty well, individuals and interactions over processes and tools. So it calls for self defined teams, and it emphasizes frequent face to face communication as opposed to heavy duty processes and heavy duty tools. It looks to customer collaboration over contract negotiation. Customer needs can change and we need to flexibly adjust to that reality. We learn things by iterating and that goes into our last point where we are responding to change instead of rigidly following a plan. So this is an iterative approach. You build small chunks of deliverable functionality at a time, you learn from each iteration and you change courses necessary to get what you and the customer actually need, as opposed to what you thought you needed early on. But I just said that we want to respond to change over following a plan. And for certain things like architecture, you need a certain amount of plan because you can't simply build each iteration with no overarching plan. And, well, the same thing really holds true with software development. So that's really our segue into lesson one, how to fail, which is just because you have priorities in your principles, it doesn't mean that you can ignore everything else. So, failure 101 no definition of success how do you know if your product is successful? If you can't measure it, how can you determine that you've been successful in delivering what the customer needs? Understand what is going to make people use your product. It's going to be some, probably for online services, it's going to be some combination of features, reliability, performance, security, privacy, and a litany of other things. But you should also understand why would people not want to use your product? To some extent it's going to be a lack of those same items that made people want to use your product. The very same things which can make people want to use your product if you do them badly are going to be the most potent ways to convince them not to use it. And you should also understand what would cause you to stop developing this product. Some examples might be it's too costly to deliver the product, it becomes too difficult to maintain, it's too small a user base, there's no longer any need for it. But all of these things together come to a definition help you define what does success look like. However, once you understand what success looks like, the next likely issue in the failure process is you didn't plan for success. Functionality is really only one part of your successful product. There are very few incremental features that exist in a vacuum. There's complex interplay between features and they need to play nicely together. There are overarching requirements that are going to be necessary to meet your definition of success, and there's going to be a lot of invisible items that are needed to deliver functionality and delight your users. And while this list isn't exhaustive, you can see it's already pretty long on the easy to remember stuff like code testing for unit functional UI, etcetera. You've got data integrity, availability and security. You have availability and reliability. You have a downtime profile that you need to obey for maintenance or deployment of new code, which is often going to be zero downtime. You're going to have certain minimum levels of performance and scalability that are going to be required in order to deliver this functionality. To deliver it reliably, you're going to need to have monitoring in place so you can understand when it breaks, because it will. You need to have incident management and documentation so that you understand what to do when it breaks. And you need to have something like disaster recovery or georedundancy because effectively Murphy is the enemy of all online services. And in fact, I did a different conference presentation entitled Achieving Service how to ensure Murphy doesn't always win, where I go into a lot more of those details. Speaking of Murphy, in case I didn't already make it clear, things will go wrong. You'll note it's in all caps, and if I believed in the flash tag, it would have that too. You can't avoid the idea that things will go wrong. People are imperfect. That's element number one. So expecting everything to go perfectly when the people running it aren't themselves perfect is an unrealistic expectation. The next thing up again in all caps, because I think it's that important. There is no such thing as a safe change. I've been beaten down by these before in many circumstances where a change was supposed to be perfectly and utterly safe because it was directed against non production and there was a default configuration value that filtered through to production from that config file and boom, production went down when there was no known change going out to production. So really, since there's no such thing as a safe change, change is going to cause things to go wrong. You need to be able to diagnose your failures quickly. You need to be able to automate responses to bad deployments. You need to be able to do things like reduce the blast radius. All of those things that are associated with good practice in terms of deployment logic, you need to actually plan for because you're going to need it. Everything in your service must accommodate failure. A lot of the things that are going to fail are things that you have absolutely no control over. Your partners are going to fail you, your network providers are going to fail you, your hardware is going to fail at some point in time, your data center infrastructure, you're going to lose power, you're going to lose cooling. These things are going to go wrong. And that doesn't even include bad actors who are simply trying to rip your service down because they're spiteful. So ultimately, it's not enough to plan for failure. But you also need to know how to recover from a complete outage. Because coming up cold is, for many services, not nearly the same thing as recovering from a partial outage. And understanding how you respond to these huge failures is the sort of thing that you need to do before, not during, the outage. I'll use as an example. Prime Day 2018 not that I was the person carrying the pager for a certain database service. And it turns out that coming back from an outage of that nature, dealing with all the caches and dealing with everything else that needed to come back up nicely while still trying to serve the existing traffic is not the same as your typical outage. So just take it from. Take it from experience. You don't need to learn it yourself, learn it from me. And that brings us into our next section, which is really, you should think before you code. The cheapest place to make changes is when you're designing and you haven't yet written a line of code. It's like this for architecture. It's like this for any number of things that aren't software engineering. It's kind of the measure twice, cut once philosophy. Understand what you're doing before doing it. So in order to make your designs good, here's step one. Anticipate failures. Again, you're going to hear me say the word failure over and over in this presentation. It's because failures are pretty much the enemy of online services, and they are unavoidable. It is the only truly reliable thing in online services. They will fail. It's death, it's taxes only. It's probably actually worse than that. So since you're going to have failures, you need to understand and define what is acceptable. Don't design for greater reliability than needed. Each nine is somewhere in the ballpark of an order of magnitude more expensive than the one that came before it. So if what you need are four nines, don't design to five. It's only going to cost you an awful lot of money that you probably don't have. And similarly, don't design for greater performance than you need. If, for example, your service needs to respond in 200 milliseconds, just as an example, you probably don't need to design the entire service so that it successfully replies in 25 milliseconds. So don't design for greater performance than you need because it's going to be more expensive. It adds a lot of cost and generally it's of limited benefit to your customers if you've defined what is acceptable in a reasonable fashion. So since we've said that things will fail, design for fast recovery from failures. And I'm going to say monitor and measure, because if you can't measure it, it's very difficult to tell when it went wrong. And similarly, you don't have an ideal baseline for even understanding what wrong looks like. You want to automate those responses where possible, because automation is faster than getting a human to answer a page almost always. Plus, if you get paged often enough, and heavenly knows I have in various positions for things I've worked for, my sleep schedule suffers and I get cranky and cranky sres usually not the best thing for running services. They're just miserable to be around. So how about we avoid them and don't wake them up unless you actually have to. You need to include being hard down 100% in your recovery scenarios. Caches will not always be warm. You will have to survive a warm up of your service if that's the nature of your service. Similarly, this applies to databases and their caching algorithms and everything else. Know how you have to do a cold start. In addition, not everything, not every failure is a black and white failure. Sometimes they're gray. So embrace that gray. Build degraded modes of operation for when you or your dependencies fail. Even if you have the absence of full functionality, you should be able to, in many cases, support some user scenarios. Don't be afraid to do that. It is often better to serve some traffic than no traffic. And again, that's a degraded mode of operation. And lastly, you want multi layered security. Anything is a potential single point of failure. There are bad actors out there. Make it harder on them. At least don't give them just one hurdle to run through. Make them jump through all manner of hoops before they can get to the goodies and before they can take your services down or steal your data. Even worse, you also need to anticipate success. It is a less understood item that the most dangerous thing for an online service is being embarrassingly successful. So you are probably going to want to hire, you're probably going to want to architect to be very scalable in the instance that you might actually need it. Now, this doesn't apply to every single online service, but if there's the possibility that your popularity could blow up and you could have to support, for example, a couple orders of magnitude more traffic than you were initially expecting, leave the hooks in place so that you know how to do it, and doing it is frequently more efficient if you avoid monolithic structures. Monoliths are fairly well known for not being able to scale important subsystems independently, so you end up having to over scale to compensate. Microservices and similar architectural ideas like those allow you to scale components independently. That allows them to scale more efficiently. So in general, monoliths are less and less considered a good thing. Moving forward. If you're going to adopt a monolithic structure, understand that these are weaknesses and know that you're going to have to be able to work around them. You really want to avoid processes that scale linearly with people. For example, if a customer onboarding requires manual steps, this can become a bottleneck if you're a bottleneck if you're wildly successful, since while your service may certainly be able to scale, your staff may not. And at all times you want to know how to protect your service from excess traffic. As I said before, serving up some of your traffic is usually better than serving up none of it. You can't control client behavior. Despite what may be highly good intentions, clients will occasionally throw either bad traffic or excessive traffic against you. And even if you were only talking about clients, there are bad actors out there who are going to do the exact same thing for less generous motivations. Lastly, identifying your high value traffic and servicing that when resources are strained is a really good degraded mode to anticipate. Again, not an exhaustive list, mostly just things to think about overall. And you're going to need to anticipate change. My crystal ball doesn't work, and if you have a better one that accurately tells the future, good, good for you. Hide it or market it or do something, because I don't have one. But this means that since I can't accurately tell the future, I try to leave as many possible futures open as is reasonable. That doesn't mean I can. That doesn't mean I'm going to redesign emacs, which is an operating system and an editor. But the principle is reasonably good to design for flexibility and future possibilities. You don't always know what the customer will want or need next year. The customer doesn't always know what they're going to want or need next year. So design, especially with APIs rather than direct calls, which enables you to change underlying business logic without rewriting every component that calls it or relies on it. So it's use abstractions. Use abstractions where possible. In a lot of cases, you're going to want to understand how to do a zero downtime software upgrade. It's true that not all services will truly need this, but if you design from the perspective that it must be possible, it does safeguard your future. And most large online services are increasingly less and less tolerant of actual downtime. In the spirit of anticipating change, learn from experience and let it inform you you're going I at least try to think that I'm going to be smarter tomorrow than I am today, so I need to be able to listen to tomorrow me that actually has experiences that I don't have today and leverage them. Change includes changing in your direction or a plan in response to new information. So don't be afraid to change tacks when what you're doing isn't right. Couple loosely loosely coupled systems are usually easier to change and generally more resilient. Are they necessarily 100% as performant? No, they aren't necessarily, because you're frequently dealing with asynchronous processes instead of synchronous processes. Understand when you can get away with coupling loosely. Understand where you have to couple tightly, but the preference is coupling loosely will usually be a better idea when you have the luxury to do so. So now we're coming to balancing tactics and strategy. One of the big things with agile is agile has a tendency to concentrate on short term deliverables and not concentrating nearly so much on the long term vision. That's just the nature of the way agile works, its iterative processes, the way the sprints are constructed. All of those things tend to lead to a concentration and overindexing on short term deliverables. So in terms of tactics, you want to delay non critical decisions as long as possible. In this case, cold feet are an asset. You don't want to commit to a path and discover that you're wrong. It's expensive. You want to allow yourself to learn more about your problem space before making the non critical decisions, especially those that you can't easily walk back. And again, you want to leverage the agile principles. You want to embrace the iterative process and learn from experience and learn how to do things better, or learn what the right things are to do. And you want to actually get the right people in the design process. Coders are very good at writing code. Architects are very good at designing services. Sres are usually more familiar with how you actually run online services in the real world, so they offer a perspective that's often lost in the design processes. And product owners are usually the voice of the customer and they help define the requirements. All too often it's coders and architects and sometimes only the coders who are engaged in the design process. So you tend to lose out on perspectives which are necessary tactically. And the other thing on tactus is you really want to get the critical architecture right the first time. Architecture decisions are often very expensive or nearly impossible to fix later without a complete and total rewrite. So do your research, understand the requirements, plan out your overall very difficult to change elements like architectures. Everything else will usually, will usually align around those expensive architectural decisions with much greater flexibility. So again, take the stuff that you desperately need to get right. Make sure it's solid. The rest is probably going to iterate around it. Some of the things that you really need to nail down are scalability, availability, performance, security and data integrity. Without at least most of those, you don't actually have a service. And some of those things are negotiable, and you can figure out how to do some of them later. But if you don't actually have those ideas in mind from the start, it can be difficult to add them later. And security is the poster child for bolted on at the last second. And you want to design initially to your non negotiable requirements. There are going to be some things which are hard stops, and they're going to be some things which are nice to have designed to the things that you absolutely have to have first. And you want to adapt features and customer scenarios to your architecture. Because if it's obvious that your architecture can't support your features and your customer scenarios reasonably, it's time to redesign the architecture until it can. So if you have customer scenarios and your architecture literally can't support them, you don't have an architecture yet. And you need to go back to square one and figure out, okay, are those customer scenarios as vital as we think? And if the answer is yes, then your architecture needs to change. Looking at things like strategy for a longer term vision. Two way doors. So what's a two way door? Two way doors are decisions which can be easily reverted. One way doors, they put you on a set path with no easy way to backtrack. And in the software engineering business, as in many other things in life, sometimes you need to take a step back to go take three steps forward, and that's okay. You're leveraging one of the strengths of agile, where you learn from each iteration and you apply that learning. But it does mean that it's worth it to try to make your decisions reversible, so that taking that step back is easy. You're not blocked from doing so. You don't have 39 dependencies on the element that you would like to fix. Now you can't. And as much as you need to design for flexibility, you also need to be able to plan flexibly. Because again, the odds are you're going to make up smarter one day than you are today. At least I try to make that a habit. Sometimes it works, sometimes it needs coffee, but there you go. You need to be able to listen to and be able to implement the ideas of that smarter you. And you need to be able to scrap and rework as needed without being embarrassed. What's important is that you get to the right end point, that you get the desired result. Sometimes the path to get there is going to be a little bit crooked. And again, that's okay. Software development in the agile software department. In agile, good to learn from experience and to figure out better ways of doing things. So if the straight line path isn't the one that you end up taking, as long as you get where you need to go, that's the important part. Which brings us to really how much planning is enough. And the answer is, even in agile, planning is necessary. But this isn't the same as waterfall. We don't want to plan every little thing. We want to embrace the iterative process. We want to allow ourselves to learn from experience as we go so that we can determine what the right things to do are and how best to do them. You will probably change significant elements over time based on what you learn, and that's a positive good, that's a feature, not a bug for your planning. You need to understand what you actually need for long term success, and you can't compromise on being able to deliver it. You need to know the shape of what it is that you want to deliver, how you're going to deliver it, and how you're going to support it during its lifetime. Again, this dovetails back into planning for success, planning for failure, and understanding what it is are you trying to build, what are you trying to deliver, and making sure that you actually have a life cycle which can work in the real world. And lastly, you're probably going to be wrong about all of these things at least once. And if you're me, frequently. So it means you really need to be prepared to adapt to real world circumstances. And again, embrace the idea that we're going to learn, we're going to do things better tomorrow than we did them today. Be ready for it, don't be embarrassed. Don't feel like you have to fall on your sword for it. This is part of good, positive software engineering using agile, and something that commonly gets overlooked in most development methodologies, but especially in agile, is the code is not your only deliverable online services really are more than just code. They include data monitoring, testing, documentation, redundancy, availability, disaster recovery, performance budgeting and more other things than I really want to shake a stick at in a short presentation. So deliverables 101 the service code is actually the easy part because you understand there is a concentrated team writing it. It is the thing that gets most software engineers promoted. So good software engineers have a tendency to know how to write reasonably good service code. Look at all the things that aren't service code monitoring, alerting and incident response documentation runbooks any external facing documentation for the customer, your network, your security testing in your test framework, deployment tools, SLA's, Ola's and SLOS administrative tools reporting. The list goes on for things that you need to successfully deliver online services that aren't actually part of your service code. So what do we need to do? We need to integrate non code deliverables into our planning and execution cycles. We need to add non code items into the backlog. Since the backlog tends to be fairly king in most agile houses, make sure that your non code items are there. You need to understand what you need to successfully release. What's going to block? What is my non code collateral that's necessary to release? And sometimes the right answer here is to create a release sprint. Just because you complete a sprint doesn't mean you're actually releasing the products of that sprint. Sometimes you are, sometimes you aren't. But frequently, especially if you're dealing with major changes, some things you're only going to need to do if you're releasing that's really much of a non code. Your processes are often going to need to change around major releases, since a much larger percentage of time is spent doing bug fix work from newly discovered bugs that aren't in the backlog. And you're going to need to allocate your time in a very different way when you're getting to the heart of a release than you are when you're just cruising along in a standard sprint. You need to ignore your non blocking code items for the backlog and you're going to concentrate on bug fixing fixes in non code deliverables. You are going to test, you're going to test, and you're going to test some more. This hopefully is going to include game days and breaking non production to ensure that your runbooks, your monitoring and your incident response are all solid and you're going to test some more and you're going to do deployments so that you know what to expect before, after and during. And you're going to need to train your operational staff and vet the documentation carefully. All of these are deliverables that aren't actually your service code. You need to plan for it and you need to account for it both in time and effort, and preferably in rewards and acknowledgement for software engineers who are engaged in writing non code collateral. So in summary, even with the principles of agile, you need to do some planning to ensure the long term success of your production. Don't compromise on those elements. Your long term success is crucial in nearly every product you're going to design. Architectural defect defects and deficits in the areas of redundancy, availability, monitoring and performance can absolutely destroy trust in your product weeks, months or years from now. And it can be very difficult and expensive to fix. So this is why having a certain amount of rigor in planning, understanding what's important to plan, what's less important to plan, and all of these sorts of things can come together. So if you learn nothing else, it's don't over plan. Don't be afraid to learn from experience as you move forward and lock in your expensive to change things as early in the process as possible. Because again, if the cost of change is high, you really want to make sure you don't have to. Thank you for listening. Again, my name is Dave Argent. My email is dargentmail.com. I work at salesforce. I also have a LinkedIn. I believe I'm the only David Argent out there, so if you need to contact me, feel free to contact me that way. And again, thank you everyone and I hope you have a good rest of the conference.
...

David Argent

Senior SRE @ Salesforce

David Argent's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways