Conf42 Site Reliability Engineering (SRE) 2024 - Online

The reality of Platform Engineering at Enterprise Scale

Video size:

Abstract

How we embarked on a platform engineering journey & used a product management mindset to scale to almost every team in Dynatrace.

Summary

  • My session is all about platform engineering at Dynatrace. The journey that we've been on and continue to be on from up to a thousand developers. We'll talk about some of the discoveries and lessons learned. Where are we going next as an organization?
  • Adam Gardner is a CNCF ambassador and a developer relations expert at Dynatrace. He put together a hands-on platform engineering demo for the perform conference in Las Vegas. Everything is available for you to spin up in a GitHub code space. The demo also included 360 feedback for the platform team.
  • In platforms, we're talking about bloat. It leads to security, performance, compliance and maintenance issues. The smaller you can keep the product, the better your outcomes. Keep it simple. Avoid gold plating. Don't repeat yourself.
  • The MVSP is minimal viable, secure product. It goes through all the things that it means to develop an enterprise ready, secure products. We're using this activity as a testing ground to scale this across our organizations. Where do we go next in Dynatrace?

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and welcome to this session. First of all, thank you very much to the Con 42 audio organizers for allowing me this time to speak. Let's jump straight into it. So my session is all about platform engineering at Dynatrace. The journey that we've been on and continue to be on from up to a thousand developers and enabling 1000 developers with platform engineering at Dynatrace, some of what we've achieved and some of the lessons we've learned so far. So a quick agenda. We'll do an introduction. Who am I? I'll have to explain the dinosaurs perform conference and the hands on training session so that you understand how I got involved in all of this. We'll zoom out to the wider Dynatrace use cases across the organization. We'll talk about some of the discoveries, some of the lessons learned, and then, of course, I'll wrap up with what's next. Where are we going next as an organization? So first up, who am I? My name is Adam Gardner. I'm a CNCF ambassador. I'm also a developer relations expert at Dynatrace. And I spent about nine months working on this area of platform engineering at Dynatrace. So this story begins where Andy Andreas Grabner, a fellow CNCF ambassador and also works at Dynatrace, came to me and said, Adam, we want to do something for the perform conference in Las Vegas. Now, perform is the yearly flagship conference for Dynatrace in Las Vegas. And Andy said, we want to create an observable platform engineering hot session. What's a hot session? It's the hands on training. The day before the conference, prospects and customers come into the room and we enable them on some topics. So there's about ten rooms. Some of them are on dashboarding, some are on alert, some are on kubernetes, open telemetry and so on and so on. So of course we wanted to put together a platform engineering demo. It was a little bit of a unique setup because we had the instructors, of course, who were playing the role of the platform team, delivering that software as a product to the application teams. And each of the users that would come into the training session would model an individual app team. This is one of the sessions. I think we ran six of these sessions that URL, by the way, everything I'm talking about is available for you to spin up in a GitHub code space. You can actually spin this demo up yourself if you want to have a play around with it. So what was the platform? Well, the platform was based on GitLab. So GitLab was on the cluster being our source of truth. We did it because we didn't want or need the people walking into the room to have any sort of dependencies. Then we had Argo CD, argo workflows, and Argo rollouts to deliver the GitOps side of the solution. Backstage, of course, is the platform UI. Captain will watch what Argo is deploying and actually generate Dora metrics. So the open telemetry data for those deployments, of course, we've got open telemetry and the open telemetry collector on the cluster cert manager to generate TL's certificates. Open feature for feature flagging. Open feature is a vendor neutral specification for feature flagging. We've got cloud events, we've got NginX ingress, obviously, to get into the cluster, and then we've got a whole host of security scanning. So products like Kubehunter, Cube audit, cube bench and so on. So all of this data, all of this observability data is getting fed back into an observability platform of your choice. In this case, we were using, of course, Dynatrace. So, talking of the observability, it starts as soon as you spin up the environment, because we need to know whether the environment actually spun up successfully. And this is what you're looking at, an open telemetry trace of the platform itself, signaling that it's healthy. Once the platform has spun up, this is what it looks like. So Argo is managing everything else that's on the cluster, and then as users, we start to use the platform. So in terms of the platform capabilities, you've got two. You can create or onboard a new application, and when you hit choose, it walks you through a workflow that says, well, what's the application name? What version should it be? Do you want Dora metrics tracking this deployment? Do you want security scans, observability tooling, like dashboards and things? Next. Next. Next. And the platform goes and deploys your application. The second one is more about the 360 feedback for the platform team. Us, the instructors, of course, we need feedback to see whether the application teams are actually enjoying or find this whole thing useful. So when someone picks choose on the feedback form, they get a form, they fill out, it packages up that feedback as a cloud event and then sends that into dynatrace. And so we as a platform team, can see that feedback in real time. What about the use cases? If we zoom out across Dynatrace, because a curious thing happened on day two, after we delivered all of those sessions, I was standing upstairs and someone came across and said, oh, I hear you doing platform engineering. Can you tell me more about it? Went through the spiel and said, okay, that sounds good. And then someone else came up and someone else came up. And by the end of the day, I'd spoken to almost every other department, sales engineering services, d one, the post salespeople, research and development, the devs, and also the support organization. And boiling it down, they could all do with platform engineering. And it really came down to these use cases. Easily accessible disposable environments. So this is the sales engineering and post sales want to rock up at a customer site, demo something like open feature or open telemetry. Show the customer, tear down the environment and walk away. They don't want to worry about provisioning or anything like that. Now imagine you're a developer who's just been hired by Dynatrace. If you're a Java developer, you might need a template for a spring boot application. Or if you're a Dynatrace app developer, you need a template for a Dynatrace app. Get you up and running as fast as possible. Of course, the basic normal things like provisioning infrastructure, cpu, memory, network, clusters, buckets, storage buckets, that kind of thing. And possibly the most interesting or novel use case I saw throughout the day was the support ticket. So when a support ticket comes in, the support team need a reproducible environment. But then, of course, when they've solved that support ticket, they just want to throw that environment away. So the platform engineering aspect comes in. What can we layer on top of that to make all of these use cases accessible? So summing all of this up, this nine month journey, what have I discovered? What have we discovered on our way to 1000 developers using this sort of technology? At dynatrace, platforms are products. You're maybe not going to have money changing hands, but time is money. So you're going to have to convince someone that it's worth putting a team together to work on this stuff. So treat it like a product. Have a roadmap, have a product manager, have a sales tactic, an outreach program where you go out to other teams and talk about what you're doing, because people only have a finite amount of time and attention. Make sure you grab them and keep them. Platforms are abstractions. They're a way to take the complex and make it a commodity. They're also opportunities to offer guidance, best practices or enforce policies. So if you've got something complicated, platform engineering allows you as an organization to scale that in the most efficient way. But talking about platform engineering, we couldn't talk about it without complexity because it can be quite easy and tempting. You'll see in a moment. I fell for this as well, to just do everything, shove everything in there. But complexity. In platforms, we're talking about bloat. It leads to security, performance, compliance and maintenance issues. The smaller you can keep the product, the better your outcomes, the easier it will be to maintain. Talking about enforcing policies, guidance and best practices, this is actually when you spin up the platform demo, this is what you get. You get a checkbox that says, do you want Dora metrics enabled or disabled for this deployment? Do you want to include security scans or dynastrace or observable configuration like dashboards? Now, you may have no idea how that works. You might not be a security expert or know how security tooling works at all, but because there is a nice, simple yes or no box, you're more likely to say, yeah, well, I'll have some of that. As a platform owner, as the platform team, I can pre select that that is by default enabled, so I can guide you down the right paths. Wrapping all of this up. The lessons learned dry. Don't repeat yourself. Don't do things more than once. And this is a problem we're trying to solve with platform engineering at dynatrace. We've got a thousand developers, we've got many teams, and they're all doing things a slightly different way, in their own way, with their own stacks. Keep it simple. I think I've already covered this to death, but it's really, really important. A feature today that you implement is a support ticket tomorrow. If it's not there, you can't get a support ticket for it. So really, really be clinical and critical about what you're including. Avoid gold plating. That goes to the same idea. Keep it simple. The MVSP we're all familiar with MVP, minimal viable product. MVSP is minimal viable, secure product. I'll talk about that more in a moment. I said, I fell victim to this as well, because as soon as we developed the MVSP for this demo, this happened. I thought, oh, we could add feature flagging, and we can add Kaiverno and chaos tools and policy and compliance tools and on and on and on. And actually, there were no comments. Nobody wanted this stuff. So to me, that's a pretty good signal. Don't build it. And I went and closed all the issues. So it's very tempting. It's very hard to say no. Now, talking about the minimal, viable, secure product. This is actually an initiative that is a checklist. And it goes through all the things that it means to develop an enterprise ready, secure product. And it already has backing from some big names there, Salesforce, Google, the USA, CISA organization, Okta and so on and so forth. So I recommend checking out MVSP. That said, though there are some exceptions, and if I was a product manager, if I could meet these three indicators, I would say you don't need MVSP, you maybe just need MVP. So where the product or the thing that you're building is temporary or throwaway and it doesn't touch important systems. Now important is up to you to define. I define it as not touching customer data, not touching production data, not really touching any real stuff at all. Because if you can meet these three qualifications, you can assume a breach. You can basically say, okay, I know there is a bad guy looking at this data. It doesn't matter, there's nothing real in there. Talking about things that you may want to read, you may want to research the Cloud Native Computing foundation app delivery group have put out two brilliant pieces of content, the platform's whitepaper and the follow up, which is the platform engineering maturity model. So both of those combined, the white paper will give you a general introduction and the maturity model will really give you that checklist of where you are today with platform engineering and how you can and what you need to think about to get to the next level. Where do we go next in Dynatrace? What's next on this journey? We're using this activity as a testing ground to scale this across our organizations, as I've already said across those teams to more than 1000 developers. During this, we realized we needed a supported open telemetry collector distribution from Dynatrace. And I'm happy to say that has actually been delivered now. So of course we deployed backstage and I wanted to integrate that with Dynatrace. We do have a backstage plugin, but we actually, as part of this work, we're developing a new, improved, updated backstage plugin. We're working on cloud native standards for pipeline observability. So a pipeline run started, a pipeline run finished, an error occurred and so on and so forth, and how we standardize that across the company, but hopefully across the industry as well. In this demo, we're using Dynastrace Monoco tooling, but we're now looking at more cloud native ways like crossplane and or operators for configuration as code. So that work is currently in progress. As well and we're looking at dev containers. So this whole setup is based on the dev container standard and we're looking at how we standardize our disposable environments using the dev container specification. So with that, thank you very much for your time. I hope it was useful. Here are a couple of links. The QR codes, the left one goes to the hands on demo. You can spin that up in a code space, play around, destroy it when you're done. GitHub give you 2000 minutes I think for free per month. So yeah, go feel free to play around with this. Connect with me on LinkedIn if you get stuck or you need any help. And the links at the bottom are all of the white papers, the MVSP and the backstage plugin that I've discussed during this session. Thank you so much for your time and yeah, enjoy the rest of the conference and good luck on your platform engineering journey. Thanks. I'll see you soon.
...

Adam Gardner

Ambassador @ CNCF

Adam Gardner's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways