Conf42 Platform Engineering 2024 - Online

- premiere 5PM GMT

Building Developer-Driven Platforms: Automation and UI Excellence

Video size:

Abstract

Unlock the secrets to seamless developer experiences with insights from Facebook/Meta and Cloudera. This session will reveal strategies for intuitive UI design and effective automation, covering tools like deployment pipelines and orchestration frameworks.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to the conference. Welcome to building developer driven platforms. I'm going to share some experiences, some things that I've learned along the way and really excited about it. And I'm hoping that those things that I've learned can help shape automation and UI for your platform. But before I convince you of all of the things that I've learned, let me tell you about myself and how I've learned these things. Elliot Clark. I'm the founder and CEO of Batteries Included, an engineer at heart who's building on top of Kubernetes, a source available product that we'll be releasing out of Stealth soon. I hope to have more to tell you about that soon. But before founding Batteries Included, I've done a lot of things and helped a lot of engineers build a lot of systems that have given me a lot of, good lessons. I worked at Facebook where I worked in ads, dev efficiency, and distributed, data stores. Before that I worked at Cloudera where I was an open source committer to Apache HBase. I got to see open source be adopted by enterprise and got to help them pull open source and really understand it. I've done a few startups before and before those startups, I got to help other developers when I was at, Microsoft building, C sharp tools and WPF. So I've really had a career of helping engineers build systems, mostly distributed systems, but lots of lots of good experiences of what can build the best, most productive systems in the world. While we're thinking about a developer driven platform and what a good one looks like, it's helpful to have some goals, right? Some things to strive for, things that we want to do better. Your goals and the things that you want from your technology will obviously be very personal. It'll be something that your company, your team, your project, feel, feels individually. But almost every platform at some time will have these as goals or these will be part of their goals or everyone will agree that a platform should include these, right? So we'll include these just as our generic goals and you'll see how they mostly align with that, with your experiences, I hope. So for this platform, what's the goal? Let's try and strive for greatness, right? Let's try and help our users, our developers build sustainable and maintainable software. That's good. Not just now, but is good for the business in the future. And let's help them speed up so that the time it takes from them having. Some thought, something they crazy idea that they think might work to it, actually being out there in the world, how let's make that faster, but let's also do that in a way that doesn't break things, right? Let's do that in a way that keeps everything reliable. And anytime things are broken, let's help them fix that. Fast so that we maintain that reliability and that uptime, right? If we do that as a platform, I think we've done well. we've sped everyone up. We've helped them. We've done good things. These goals though are often seen as in conflict, right? The, how do I speed up developers? Is oftentimes in direct conflict with how do I make sustainable software, right? I don't want to add metrics. I want to go faster. keeping things from breaking also often means that you need to do extra work, which can be seen as slowing down the time from idea to production. I think it's helpful to, to look at times when things go wrong as a way for us to learn in where we can make better platforms, right? Where we can make platforms that really lead the way. So if we think about it, what's the number one cause of incidents, right? what do platforms let through? what causes downtime? Overwhelmingly, there's one source, one big, giant source, that's the root cause of most downtimes, of most, errors. There, there's one thing that kind of changes and causes most of the incidents. It's developers. Developers are the root cause. We're the things that change the software. The, The hardware will add one plus one almost forever. The software is where humans get involved and it's where things often go wrong. So if we want to understand and build a good platform, we have to dive deeper into it. into developers and how does that risk come with them and how do we make it better? I think of the things that slip through the cracks, the issues that come from developers as being in two buckets, not understanding or not testing. let me explain what I mean by both of those. For me, understanding is the grokking, is the deep conceptual knowledge. The, you can explain every principle of something to the scientific level where you can prove it. We don't often get to understand the business problem, the distributed system, the edge cases in that kind of depth. So oftentimes instead of fully understanding and enumerating every possible outcome and input, we write tests that say, that test is green, so I am pretty sure that everything is going to work. Right? . So there, there are two sides, two ways to make it where you're sure that you've written something that is good software. and there's some give and take, right? You can't write a test without some understanding, and in writing a test, you often gain more understanding. So there, there's a back and forth of the two of them. So it's often really interesting to bucket what went wrong into one of these two causes. So you can go back and say, what should we do? Testing is a lot more than just, you didn't write a unit test, right? Testing is, you didn't have an integration test that tested those systems together and showed what happens when things go wrong. It's, Not checking that you can actually restore your backups. How recently have you made sure that you're able to fail over to your redundant systems, right? There's a lot of testing to make sure that things that you're never going to fully understand are still working and still operating as you expect, but testing falls through the cracks. But oftentimes as developers we're too eager to say, I can just write one more test when really it's, we didn't understand, right? If you're holding the tool upside down, unit tests after the fact won't help you. If you didn't know what winning and losing looks like, it's really hard to write a unit test that says you got a good result or a bad result. oftentimes, It's much more than that, right? Knowing that we didn't have to even try this very hard solution. We could have tried the much easier thing over there is much more impactful and is much better, right? We didn't understand that this path wasn't the one to take 10 choices ago. Or we didn't understand where this path was going to lead us when we changed size or scale up or down. So with those kind of two buckets, let's go through an exercise where we talk about one specific example. And in the example, we'll try and just walk through what happens. And. And. figure out where the causes and the things that we can do better are. And then we'll take that to use it to learn to make a better platform. So for this example, we'll choose just about any enterprise, setting where there's a new product team coming on, right? The new product team is going to write some new data to a database that's already existing and they need some help. So a junior or mid level DevOps engineer joins the effort and writes a set of code. In that code, they add a user to the database permissions list. And they say, Oh, I'm gonna, I know you were going to be writing some more data. So let's make sure that we also increase the size of production storage. thinking ahead. I like it. They're doing great things. They're making sure production is going to be stable in the future. But in doing that, they make a mistake in the production environment and they change the size to also include the wrong units. So rather than asking for a few million bytes, they're asking for a few trillion gigabytes, right? But you say, don't worry, we have a CI pipeline. So they test their change locally. They push the pull request out. All of the test environments will pass because all of the test environments use a different size. Production is the only one that has all of the data. Everything's green. So an engineer on the project approves the pull request. They don't fully look at the code. They look at the green check marks. Off you go. Through the pipeline it goes, it gets deployed to production. In the best case, this fails our production pipeline and everything stops. No more developer changes can make its way through our pipeline. Someone has to go debug the issue. Oh, shit. revert the change, start other changes flowing forward. In the worst case, pods actually go down and capacity issues happen. But we tested everything. We used automated CI. what's the cause? what went wrong? I think that's a really interesting thought. Lots of things went wrong, right? I think one of the most fundamental was that we had Two different changes together. Why did we have two different changes together? Because it's much easier for developers to put two different changes together so that the friction of testing and deploy tools is only paid for once. they didn't fully understand the schema. When does it need, A unit, when does it not? How many does the unit say? When is it applied? Those things. Our system didn't test for environments. It tested for production like environments, but not exactly production. Other signals that didn't have anything to do with the changing size or production were trusted much more than going and looking at the code, because the green check marks said everything was good. The people who did the code review did their best, but they're not experts in YAML or platforms or what will happen if this is wrong, nor should they be. And then after everything went wrong, we didn't have an understanding of What changed to roll back? What change caused problems? What problems were part of this change and what change was good for other parts of the code? Our system didn't understand. With all of this going wrong, it feels really easy to go write a bunch of tests that say this won't happen again. But I would say there's something better we can do. There were two changes that they wanted to make that were hard and scary. Can we make that better? Can we make that something that's not hard and scary? yes, we can. We can make a tool for that. Can we make a tool to add users to databases? Sure. I can make a UI that loads the YAML. knows the schema, will validate the input, will ensure that we test the size changes in the correct environment. This will never happen again for a size change or for a adding a user change because we've written a tool that's automated that will never let this happen again. We can repeat that over and over again to build ourselves a set of tools. Because we know the inputs that are required. We know the changes that are scary and we've constrained them in ways that they're not nearly as scary. I think this is a really powerful way of taking the incidents that we find and building new guardrails. you can't go crazy with it. You can't build guardrails and never let any developer make some change. So we have to have some mental model. How do we think about when we need a tool that will add automation and add some structure? And when do we let there be free flow of developers moving things around? For me, I have a mental model of rock climbers. you're a rock climber and you want to go test yourself, test your grip strength, can I do this? You go bouldering. You go to your local bouldering gym and you climb one, two, three meters into the air. If you fall, there's nothing to catch you. But you're following a small distance on the padded floor. You have a small chance of any, anything bad. That works great. No, no need for safety nets, nothing to slow you down. No harnesses to keep you from moving the way you want to. you start climbing the mountain, you start going into production. You don't want to make those changes as freely. You don't want to jump around. Do you want to have some safety harnesses that it means everything's a little more controlled. The thing that's scary is when you do that in production. Don't do that in production. if we take all of these things that have gone wrong, these ways that we can see developers either needing help, needing places where they can go faster, places where we can make a tool to make things better or easier for them, We can all learn together, right? We can take the patterns that show up. We can take the things that are common and say, Hey, those are industry wide, those are humanity wide. How can we learn from those? and for me, those are the really interesting part. and so I just have kind of four, four quick ones, They don't work for everybody, but hopefully they'll work for you. the first is don't use modals where you care about accuracy. Don't use pop ups where you care about accuracy. Don't use them where you want the user to think in depth. When you see a pop up, your initial reaction is to try and make it go away. I want it to be done. So if we're designing platforms and we say, here's a pop up that says, are you sure you want to delete this database? We're trying to make sure that this is a safe thing. a user's reaction is going to be, please make this pop up go away. So instead, if you want to encourage safe UI, Don't use a modal for that. Use a full page. Ask them to read something. Ask them to scroll. Works very well. One of the other big things is moving validation early and earlier into the cycle. Makes it so that validation is skipped less often. Skipping or ignoring errors is one of the biggest causes of issues of tech debt. And so if you want people to react and react well to feedback, putting that feedback earlier and earlier into the development cycle works really well. So rather than after commit hook, tell them that field was wrong. added in a linter that's running at code change time. The earlier you can get it into the validation, the more you can put it into a UI that shows them that, the better. While you're building platforms or while you're operating platforms, things are going to change. Things are going to, react to new inputs, to new changes. Being able to go back and say who changed what, when, is super critical. Showing people that users X, Y, and Z changed these three things, just combined together makes it for an amazing experience debugging, but also helps new developers while they're using your platform, see what's happened before. Showing kind of the inner workings of a platform helps so that the users never have to break into that layer of abstraction because they understand just enough to know that the abstraction is working well. So those are some of the patterns that I've seen that I really think work well. I'd love to hear what UI patterns you've seen, what you've seen that makes production better, what you've seen that makes, development faster. and if any of these kind of solutions or patterns sound good to you, We're building Batteries Included, a fair source, all inclusive infrastructure platform, that, like I said, hopefully is releasing soon. you can sign up to get early access right now on the website, or you can email me to get a personal demo and get early access. thank you very much for coming to the talk and for coming for the conference. I hope you've had a good day. Bye bye.
...

Elliott Clark

CEO @ Batteries Included

Elliott Clark's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways