Conf42 DevOps 2023 - Online

Platform Engineering is all about product

Video size:

Abstract

Platform Engineering is the latest buzzword in the modern software engineering world. It is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. The holy grail for platform engineering today is to achieve the most effective “Internal Developer Platform” (IDP) that enables the rest of the developers in the company to be as effective as possible. Can this job be accomplished with engineering skills alone? Platform intersects with product in two ways: first, the platform must be optimized for supporting the development of the company-specific product. Second, the platform has to be built with a product mindset and practices for it to be adopted by its users - the developers. In this session, we will discuss how to build an engineering platform that your engineers will actually want to use. We will go over standard product practices to use when creating the developer platform, and also the importance of making sure your IDP actually helps developers build the company’s products faster and better. We will define the role of the platform product manager (PPM), and the importance he plays in ensuring our platform is not a glorified Rube Goldberg machine. In this session you will learn: - What is platform engineering? Is it just a new name for DevOps? - What makes an IDP and a platform team successful? - Who is the PPM? why is he important? How do I convince my head of product we need one? - Practices you can use to build a successful platform, and pitfalls to avoid.

Summary

  • Gal Bashan: Today I want to convince you all that platform engineering is all about product. He says that only around 3% of the organizations were able to achieve the DevOps ideal. Bashan asks how do we avoid this in two years from now?
  • Epsagon was a company that was building a lot with serverless technologies. Its entire backend was based on Lambda, and we had a lot of kinesis streams. A senior developer wanted to build a tool that will help developers in our company. But it wasn't adopted because this wasn't a real problem.
  • The platform has to solve a problem that your developers has. Just because we can solve something doesn't mean that we should solve something. Our developers know what troubles them in their day two day. When you're building the solution, you should focus on value.
  • The takeaway here is that agile is still valid. Even when we're building a platform and not a user facing product. We should always build an MVP and let the user try it before building the next has.
  • In order to build a good IDP, we need to first validate the problems that our developers have. Because it is a product, we should have a product manager that leads it and makes sure that it is valuable and for our internal users. This product manager has to be measured against the success criteria.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, my name is Gal. And today I want to convince you all that platform engineering is all about product. And that product is an important part of building a successful internal developer's platform. Now before I do that, I want to take you all down a stroll down memory lane. We all know the stories about throwing over the fence, but I just want us to remember what organizations used to look like back in the 90s or in the early two thousand s. And these organizations were made of two groups. Well, a lot of other group built two groups that I want to focus on today, which are the dev group and the Ops group. The dev group was basically in charge of building features and they wanted to build as many features as they could as fast as they could. And the Ops group was in charge of making the system run. In production. They were tasked with the system being up, being stable, being reliable. And these two groups were siloed. And the silos basically caused a lot of miscommunication between them which slowed down the business and hurt it. So somewhere around 2008 we all came down and said, okay, let's think of a new cultural shift and we'll call it DevOps. Now these cultural shifts sound brilliant. We're going to break down the silos between the two organizations. We're either going to have one organization or two organizations that are functioning very well in harmony together. The developers are going two own production and you are going to be able to deliver software manifester, more reliable and a better way. But the problem is that it's not as easy has it seemed. It turns out that only around 3% of the organizations were able to achieve the DevOps ideal, where every developer if he builds something, he can run it, he can operate it. Basically the other organizations fell down into two or maybe a few more, but two main patterns. One is where they just added another DevOps team which tried to make connections between the dev and the Ops team and basically just created a third silo. Another type of organizations let go of Ops entirely. Said our developers are able to run Ops themselves, where in fact they couldn't, that they just disregarded the art and the effort that is needed to run production. And the result was that there were kind of a shadow Ops operation going on. The senior developers are doing the Ops. No one is exactly sure what is going on and only a few knows how to operate production, which actually caused even more skills. Basically what we ended up with in either case is again two groups of people. One are the devs and one are the Ops. But instead of the Ops we just call them DevOps now. And the skills continued because not everyone can adopt this complex cultural shift. Then ten years later, somewhere around 2018, it was probably before that we all said, okay, Google developed something pretty cool back in the early two thousand s and they called it an SRE. So maybe instead of DevOps, we can do SRE. Let's take our DevOps group and embrace the practices that Google created and now let's do SRE within our company and that will break down the silos and make us work more effectively. And you guessed it, basically being Google isn't that easy. For most companies that try to adopt the SRE mentality, they don't follow all the rules and all the methodologies that come with it, and they end up with two groups, one dev and one users to be called Ops. Then it was called DevOps and now it's called SRE that are siloed from each other. The devs are focused on building features, the SRE are focused on holding production, and we're not getting to deliver value faster, we're not getting two building features faster, we're not getting to help our business in the way that we want them. Fast forward five years later, now we're all saying, let's do platform engineers. Now before we dive into that, let's talk a bit about what platform engineering is right, what platform engineering is offering us. Platform engineering is basically saying, okay, let's build an internal developer platform that will help our developers move faster. It will simplify the workflows and give them paved draws, some golden paths so they can focus on bringing value into our business. We will have a group that are the platform engineers who will build this platform, and the rest of the engineers will use this platform in order to create value for our business. And now my question that is going to follow this entire presentation is, how do we avoid this? How in two years from now, we don't end up with a bunch of organizations who look exactly like this. And this is what I'm going to try to answer in this session. Let me just give a brief introduction of myself. Hi everyone, I'm Gal Bashan. I'm the head of engineering at Epsagon, which was recently acquired by Cisco. Before that, I did a lot of cool cybersecurity stuff at the IDF. I love building and that's my Twitter handle if anyone wants to give me a shout out. So has I said before, our goal here is enablement. We want our platform, our internal developer platform, to enable our developers. And what does that mean? Basically, every developer is different. We have the senior developers who can tweak the helm charts and can do all the configuration that is needed, and that is good. But we also have the junior developers who don't care if the application is running on SBS or eks, and the platform that we build needs to enable both of them. So the goal here is to create some common use cases or some golden pitfalls or some useful tools that will help the majority of our developers develop faster, operate faster, and bring value in a better way. So usually we measure our IDP by three main aspects. Does it help our developers build in a more secure fashion? Does it help our developers build in a more cost effective fashion? And does our developers have a better developer experience? Now the first two are pretty self explanatory. If by using the platform I'm getting automatic security features and the code that I write is easily more secure and I don't need a lot of boilerplate to bake security in, then the platform is probably helping me get security makes in. If the platform is optimized the instance size two, the requirements of my application, it's probably saving me some money and it's helping me with costs. Developer experience is something that's a bit harder to measure. A lot of people try to measure developer experience by cycle time. How long does it take from when I start working on an issue until it is in the air and in production? Another way is to measure MTTR. Meantime, to resolve how long does it take me from the second that I found a problem within my system until it is resolved? We can also have a look at other metrics like how often does the pr get merging? Or how often a release is triggered. But really, if you want to know if you have a good developer experience in your company, you should look at your attrition. How many of the developers want to stay within your company? For how long a developer plays within your company? Of course, there are a lot of other factors that affect attrition, but if you have a good developer experience for your developers, you won't have high attrition rates. So now that we know what we're aiming for, why we want to build this IDP, I want to talk about a few of the important aspects of how to build this IDP, and I want to start out with a story. This is a story that we've had in Epsigon. Basically, it was way before we had platform engineering installed. We had a senior developer with some spare time and he wanted to build a tool that will help developers in our company. Now, Epsagon was a company that was building a lot with serverless technologies. So our entire backend was based on Lambda, and we had a lot of kinesis streams. And if you're not familiar with kinesis, then, and I know the data engineers here will be angry with me, but Kinesis is basically managed Kafka. It's a streaming service by Amazon that helps you connect different asynchronous services using streams which is similar two the Kafka streams that you're probably familiar with. And we were working a lot with Kafka because we were streaming. Our solution was can observability solution, and we were streaming a lot of traces that our customers were sending. And what these developers thought is that our team could benefit from the ability to stream those traces not in the clouds, not into lambda, but directly to our local machines in smaller quantities using sampling, but just so we can debug what's going on in the cloud. So he actually went out and wrote this tool and solved it. But after a month or two, we noticed that no one really used it. And the problem is that it wasn't adopted because this wasn't a real problem. It turns out that our developers were perfectly fine with just debugging on lambda or creating this code, like to get traces into your local computer. From the kinesis was relatively simple. It was like ten lines of code. So just because a solution is there and it is cool, it doesn't mean that this is a pain point that our developers really had. So this leads us to our first takeaway. The platform has to solve a problem that your developers has. Just because we can solve something doesn't mean that we should solve something. So what problems can our developers have? They can have actually a lot of problems in a lot of different areas. They can be wasting a lot of time on infra, they can be working. A lot of the time can go into just building helm charts, or configuring every single one of their pods or their instances. And if we can provide them with some templates, we can save time. There can be a lot of boilerplate code. Maybe in order to set up a service, they have to do a copy paste from a bunch of different services. And if we create a tool that helps them create a service that will save them a lot of time, they can be struggling with security. Maybe we should makes in some automatic KMS solutions, like key management solutions, just so they can use it more effectively and not have to look it up every time they need it. Maybe they're missing observability alerts for their service. They want automatically two be automatically alerted on red metrics instead of having to set up for each of their individual services. Maybe it's code ownership. They don't know who is in charge of this library. It can be cognitive load. They can be in charge of two many things. It can be quality. Maybe it's hard to write tests. There are a lot of different aspects. So how should we know where to focus? And the second takeaway is that our developers knows what problems they have. While the problem range is very big, our developers know what troubles them in their day two day. And if we just go and ask them, we'll know where to start and where we should start looking. So we should interview our developers. We should use those interviews to collect data. We should sit in retros and see what comes up. We should look at the recent bugs that we have and understand what were the latest root causes. And we should use all this data in order to choose what problems do we want to solve first? And again, don't start with the solutions. Start with the problems that you want to solve next. Even if we found the right problems to solve, it is easy sometimes to just focus on building a cool technology and somehow solve this problem instead of just solving this problem in an effective way. The problem is, if we're not focused on the need of the user, we end up doing probably one of three things. We just build every specific thing that our developers want from us. And this is not a very efficient way to build a platform, right? Because if we satisfy the needs of one user, we're probably not satisfying the needs of another 99 users. The second thing is that we just find a cool technology that we want to mess with, and then we spend six months trying to make this technology solve our pain. This is something that we often do as developers, just try to insert a cool technology that we want two use into our problem space. And the third thing that we may end up doing is come up with a solution that is good for us. Because I'm a developer and I can imagine myself having the problem that the developer that I interview is having. I can just imagine, okay, this is the solution that would work for me and build it, but it actually may not be the solution that works for the developer or that can be convenient for him. Before I go ahead and build a solution, I have to validate that the solution is also good for the user. In my case, the other developer in my company that I'm building it for. So the takeaway here is that when you're building the solution, you should focus on value. Don't go to impressive technology. Don't go. Two, what would be easy for you? Ask the user or the developer in your company what would be valuable for them and then build that. The next thing I want to tell you is a story that I've had back in my army days, and because some of it is restricted, I'm going to change the product domain a bit, but you'll get the big picture. So in my army unit, we were focused on the bagels, let's say product domain. We were working with many vendors that provided different bagels. Some of them were coated, some of them had salmon, some of them were round, some of them were half, some of them were full. And we wanted to take data from all of those different vendors. And we built system using this data, manipulating this data, storing this data that were our own. So we work with a lot of proprietary bagel formats, but we had to store it in a central location or for several products that shared those different vendors. So we came up with the solution of let's write a very generic bagel library that every project can later on use. So we went down and understood the specs of the bagel and what is like the dictionary definition of a bagel and how we should treat a bagel and how we should abstract the cucumbers in the bagel. We spent around six months creating this library, and the developer of course, was from one project and it has a perfect fit for this project. After those six months, we released this library and we asked the other projects, hey, do you want to use this library? And they gave it a go. And after two weeks they understand that it is just not usable for them. It is too complex, there are too many options, it is too generic and they are just not able to use it. And this library was basically neglected. So the takeaway here is that agile is still valid. Even when we're building a platform and not a user facing product. We should always iterate quickly, we should always give the user a taste of what is getting. We should always build an MVP and let the user try it before building the next has. So prioritization is important. We need to understand where we're starting, what is the most important pain point that we want to solve from the pain points that we identified and what is the easiest solution from the solutions that we identified and then execute the MVP of it. Give it to the customer, which is an internal developer in this case, but it's still a customer then get his feedback and understand only if you're on the right track to proceed to the next step. So just because it is an internal platform and not an external product is not an excuse to go in the lab, sit for six months, build out this gigantic group Goldberg machine and then just launch it into the internal developers of our company and have it fail. So let's recap what we've talked about so far. So we've talked about the fact that in order to build a good IDP, we need to first validate the problems that our developers have. We need to understand what problems they have and how we can solve them. Then we need to validate those solutions that we have in mind. We need to understand that the vision that we have for a solution is valuable for our internal developers. After that, we need to iteratively bring those solutions to those developers as fast as we can and validate that we're on the right track. We still have to use agile in order to make sure that we're not building something that is not usable. Another thing we didn't touch about is we have to go to market. We have to convince developers to use this. We have to make sure that they understand that it will give them value and help them be a better developer. Now this job description sounds kind of familiar. And it is familiar because it exists. This is the job description of a product manager. We have to look at our developer platform as a product. And because it is a product, we should have a product manager that leads it and makes sure that it is valuable and for our internal users. Now when you're going to pitch this idea to your head of product, you are probably going to hear one of those four things. First of all, I can bring on another product manager because it's expensive. It's another headcount. I don't have the budget for another headcount. Then you should go to your manager and ask him what is more expensive, hiring one product manager or spending an entire platform engineering team building something that no one will want to use because we didn't validate that. What they're building is actually useful internally in the company. Another thing that you may hear is that it's an internal tool so engineering managers can manage it. There's no need for product managers. Product managers are only dealing with outside facing customers. Now in my book, that's just disrespectful to PM skills. PPM should be able to talk to users, but user doesn't mean outside of the company, it just means someone that uses a product and understand the needs. And this is a very hard skill just to interview someone and do it effectively in a way that you understand what you can do in order to solve this problem. And if you're just saying that any engineering manager can do it without training, I think that's kind of disrespectful for product managers. Also, engineering managers has a lot to makes care of. They have to take care of the development of their people, both personal and professional. They have to think about project manager like the execution part of the job, just throwing the additional product management responsibility on them. This is kind of irresponsible. Another thing I hear pretty often is that developers know what they want. Look, you're a developer, you use platforms. So why can't you just build the platforms that developers want today? The answer is why do elasticsearch have product managers like elasticsearch? We all know it. It's a product that is used by developers. So if developers know what developers want, why does elasticsearch need product managers? Again, it touches back to the fact that just because I'm a developer and I know what I want, it doesn't mean that I know what every developers want. And a good PM can talk to a lot of developers, synthesize the real need and understand what we should build. And the last, most annoying excuse is that the platform usage is mandatory, so we don't need a PM for it because everyone is going to use it anyway. If that is your company's approach, then you are going to have a bad time because no one wants to use an internal tool that is very, very hard to use. The platform team will have a bad time because no one will want to use the product. The developer team will have a bad time because the platform will actually slow them down because it's not very useful or handy. So mandating usage of the platform is usually a very bad idea. You should have the PM and the platform team make the platform so useful that developers actually want to use it. The takeaway is here is that your internal developer platform is a product. It should have a PM and the PM should make sure that the platform team is building something that the rest of the developers in the company wants to use. Otherwise you're just going to end up with two organizations. One of them is Dev, one of them is the platform engineering. They're going to be siloed and the platform engineering organization is not going to be a valuable addition to your company. This product manager has to be measured against the success criteria that we talked about in the beginning, does the platform help the developers build a more secure application? Is it more cost effective? Is the developer experience better? Those are the things that this VM should be measured against. To sum it up, we need to build an IDP that enables developer we have two come with a problem first mindset. We need to solve a problem, not just build a cool solution. If we want to know what problem to solve, we should just ask our developers. They know what problems they're facing and they know what area we're lagging in the most. When we're building the solution, we need to make sure that we're building a solution that is valuable, not just that is cool or complex or using the most advanced technology. When we're building this solution, we should iterate fast, we should use agile methodologies and we should double check all the time that we are on the right track. And finally, all of this should be led by a product manager that knows what he's doing and has a clear vision of where he wants to take this platform forward. Otherwise we'll just end up with a fancy group Goldberg machine. That's it everyone. If you have any questions you can find me on Twitter or just mail me. I hope that this was in informative and I hope that you'll all go back to building a useful platform for your developers in your company. Thank you.
...

Gal Bashan

Head of Engineering @ Cisco

Gal Bashan's LinkedIn account Gal Bashan's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)