Bridging the Gap: Real-Time Topology as the Connective Tissue Between Platform Engineering and SRE

Video size:

Abstract

Discover how real-time topology graphs can enhance platform engineering by creating a shared framework between SRE and engineering teams, improving reliability and providing crucial runtime context for internal developer platforms – plus what cybersecurity teaches us about closing the loop.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everyone. I'm very glad to be here. I'm Amir, the CEO and co founder at Sensor. Sensor is an AIOps company, bridging together advanced, observability elements using, EVTF. And machine learning stacks, which uses data in order to look for deficiencies and service degradation, which we came to speak about here today is essentially a very interesting concept of bridging the gap, both technically, but also from an operational point of view and how to utilize the advances. In the forms of, the use of an internal, developer portal. and it's how essentially use, the real time topology or the dynamics of the environment as a connected issue between the platform engineering part and the DevOps and SREs. So going through the journey that we will cover today, we will speak about an IDP in general and what's missing from the IDP. We'll verify that it's not the first use case that suffered from the exact same phenomenas and what we can learn from cybersecurity in that aspect. We'll go in deep dive at how real time topology could be used as a shared language of how to bridge between the IDP operation on what I want to do, what is the configuration. or the configuring state of the services or system, and how to correlate and bridge that with the runtime topology. and that is an important part of, essentially connecting between these two worlds. we'll speak about what you can do with a closed loop, mechanism, what's happened if you force feed data from the observability mechanism, the real time topology back into the IDP. And we'll, cover or we'll finish this, session with a little bit of pointers about how to get started, which are, the most important parts to pay attention to. So I hope you'll, I hope you'll enjoy it on. Let's start, IDPs or internal developer portals, as being your businesses are very incredible tools for bridging between the development parts in the DevOps part, essentially, the ability for an organization to create. Both the schema, but also the rules of what is going to be running to production, what type of artifacts needed to be collected about them, configured about them and expose that, expose that to the developers as means to save time. for our platform engineering teams and DevOps team. but pay attention that they're very focused about what I'm trying to do, what I'm trying to achieve, what is the configuration state that I want to be applied in the production environment. And with that, we will see elements. In this case, we're looking at, we're looking at the Kubernetes environment or Kubernetes topology, in which we have the physical resources like the nodes, And we have the logical resources like the Kubernetes clusters, the namespaces, the workloads, and the pods. And we can see the schema element on top of them, the elements which we're trying to bring into configuration by the IDP and all the way to the developer. On their own. One half of the equation, because they're sometimes missing or lacking in some critical context. And that context, comes or stems from the difference between what's happening on design time to what will happen on the runtime environment. Resources will behave differently. If not, if we're looking at the pods environment or workload environment, of course, what I'm trying to do from configuration part, things like allocates or request and limit their. Elements or artifacts, which I'm configuring, but the behavior in the runtime is going to be different. It'll be different because the node has different resources to give, and so on and so forth. The interaction between the different elements are also sometimes hard to, encompass. If you're looking at things like ATIs, which are crucial, for the interaction between the element, there are parts which are much harder to model, in a static environment. Towards what they're really doing in production and how they're manifesting themselves in production. Another good example would be the use of third parties. Sometimes it's very hard to conceptualize even the schema layer. I'm using a certain third party service, which is an important part of the service that I'm trying to deploy. How am I learning about that exposure? I'm learning about that dependencies, which are, of course, very important in the. Complete life cycle or software life cycle, deployment life cycle that we're trying to achieve. And of course, runtime configuration. What eventually happened, what was allocated, what was configured on the different physical and logical layers and misconfigs, of course, that some of them can happen only at runtime. For example, someone going through not the IDP or doing a configuration directly. On the cloud environment could lead to a lot of interesting things, with regard to what the IDP is seeing versus what is the real situation. the important part or the important aspect to. Understand here is that we're not the only domain which facing this exact same challenge. A very good analogy could be the world of cyber security. And more precisely for taking the, taking the platform engineering in SRE, as a comparison, the world between the application security, the AppSecOps and the DevSecOps. The one which are responsible for what I'm doing from application security and production to the one who actually serve and maintain it on the runtime environment. and we could see that throughout the short history, but meaningful history of cybersecurity, that gap or bridging that gap between the static analysis and the runtime to create an environment which is tailored toward the shift left movement was an important part of making that a reality. Moving between the static catalog in which the artifact would be things like vulnerabilities and, OS, components, the authentication and encryption, keys, configuration, different framework, the secret, an analogy to an IDP, they will be part of the schema, which I'm trying to config. But the runtime environment yields a lot of important dynamics in order to tie the knot between them. elements which I'm trying to do in production would be, is something really deployed? Where it is deployed? It is internet exposed in terms of the surface of attack. Is it behind some sort of a firewall or, or, or filtering mechanism? Are there runtime mix configuration? Are there, do they have any relevancy to something? which is on the static catalog part, that bridging, was very important in order to one, get an accurate view of what is an actual attack surface, focusing or prioritizing the real risk, not just vulnerabilities, which of course would collapse the entire team because we'll just try to fix everything. but that formal formalization of what is important, what is the priority? it's, it's crucial for these teams to, predicting the function, efficiently. and of course, the embedding of policies and guardrail earlier on the SDLC, making sure that the things and the elements which I'm putting in the Configuration time, which I'm putting in an earlier stages of, the software development, the software deployment, pipeline are going to really appear in the runtime environment, which is a very strong analogy to what's actually happening on, on an IDP, on an IDP world. So for platform engineering, these are some contexts. comes from essentially having a very nice grasp or a very deep view about the real time topology of the services. What is actually going on in production. The benefits of using or being able to feed force this type of element back into the IDP are numerous. I will just mention a few. The ability to get full and closed loop visibility. What I try to do. What I meant for things, to be looking like, and what actually came about in reality, having more efficient troubleshooting. Of course, there is a lot of vague line between where, one's responsibility like a platform engineer is going to land and when. Another responsibility like the SRE is going to start. There's a lot of, there's a lot of back and forth between them and going way back to the developers which are fed by the IDP. Therefore, this type of information is super crucial. and eventually less IDP maintenance required from various reasons. One, being able to Actually be actuary on the status of the production environment being left cold or think about things which are runtime related and not IDP related and so on and so forth. white heart that we understand it's not a trivial move. We understand that people would have done it. There are some sort of runtime integration towards the IDP, like getting observability streams from your APM. We're infrastructure monitoring tools, but it's still hard. It's hard because the runtime dynamics are much more complex than the static dynamics. The static dynamics could be. Somewhat, somewhat compared to a configuration file or a configuration tree, something that I could traverse easily. That's not the nature of the runtime environment. The runtime environment is a live, moving, changing graph, which has a lot of inputs coming from things that happen on the infrastructure layer, things that happen on the application layer. And just changes coming from the software deployment pipeline. All of these together make these dynamics very, active, very, very, very time, sensitive. so this type of, this type of bridging becomes, harder to get than you might think about, at the beginning. So topology enriches the IDP with runtime concept on multiple layers. So we're speaking here not only about how a production environment is looking, but also with the context of the running environment, with the context of the running application, infrastructure. The network layer and the APIs will give a little bit of examples when we're going through the following slide, what you can see here coming from our system that could be achieved. I would assume by our system as well is 1st of all, the ability to automatically get the important information coming from the observability layers, which parts were deployed, what type of APIs are now being run, and that as a beginning, Gives a very nice glance to what was deployed in production moving forward. Now, if we look at the IDP hierarchy of the IDP schema, we'll most likely be faced with looking at specific services. so the deployments, run, status, the runtime configuration, which will, appear as runtime label, so on and so forth, and the performance of strategic services. If we look again here, and if we look at one of the services, now within that context, we can see things like the health behavior, the important, the important metrics, which are crucial to understand the health and performance of that specific service, the ability to track the deployment, what's meant to be happening, and how it behaves in runtime, and of course, the events, things which are accompanying or are the operational status of the service in the production environment. Moving forward with, the left element or trying to combine what we spoke about earlier, to the left level, this would be, the third party resources, the internal resources with regards to their actual properties, things like availability, latency, And the status of the third party resources could be public APIs, payment gateways, for example. If we look here again, and now we're looking at the real time Kubernetes topology, here are the third parties being used, things like the databases or queues, runtime caches and APIs. And also, if we'll open the, if we'll open the internal deployment, now we can see the runtime dynamic. Each and everything here is accompanied by a lot of different layers. but the importance is you can really see the flow. You can really see how it's all tied in, tied into the bigger picture. And all this information is essentially very complementary, and very important, very important when, being given with, being given with the IDP data or the what we're trying to achieve data. Okay, so let's, let's try to recap and, understand at the high level the real time topology. It allows us to enrich the service catalog with what we put in our schema, what we try to do. And what was actually happening or represented in the runtime environment with all the different contexts. This would be the configuration of dependencies. and that's a very important goal that we're trying to achieve. It also provides a common language. The IDP tries to encompass the platform engineering way of thinking about What is the right abstraction level of the production environment? That's something which is clearly need to be explained and used by the developers. But on the other hand, this will implicate the runtime environment. So the common language here cannot go on a single way. It's not only IDP up toward developers, but it also, if you will, looking at the IDP down toward the runtime environment. And with that, the production environment Personnel which is responsible for maintaining that runtime environment. SREs and DevOps would be a good example. You have to close that loop. And if not, then an IDP as a function would be always a center of attention because you need to comply with both these layers in order for it to work in reality. It also furnishes very crucial context for decision making. If I'm looking at something at the IDP that would be exposed to my developers, that's not the only knowledge in the world which is crucial in decision making throughout the software development lifecycle and the operational or maintenance lifecycle. There needs to be more data there exposed to the developers and exposed to decision makers. So throughout things like issues, procuring or maintenance or analyzing what should be done. the entirety of the needed information would be there, for right decision making and to have a clear view of what's going on. So let's look at three real world use cases. anticipating a reliability impact. one of the holy grails or trying to, or the most wanted things to achieve is to understand when I'm going to do something. What is going to be the eventual implication and how can I close that loop as fast as we can in order to verify that it's working indeed. And what we anticipated is the performance hit or is an implication on the runtime environment could be realized very early. of course, with that topology can enrich. The tribal knowledge. That's what people are thinking. That's what people are knowing from past experience. what they already encountered with past incidents. but that with the topology becomes not a tribal knowledge, but the actual set of data points, which could be verified. Understanding which are the affected services. If there is a runtime, implication, is there any implication on the service level objectives or the service level agreements, which I comply with to my user? and eventually it helps us answering real world important questions. Example, what does the service really cost from resource? point of view, where there are changes in the CPU consumption envelope, memory consumption envelope, what is eventually the spin off cost of a service. The benefits are, of course, being able to define, store stronger guardrails within the IDP based on real world data, based on data driven checklists, and not based only on what we call the tribal knowledge or based only on experience, which is, of course, bound to be true. are bound to be broken. people are making a lot of software changes. People are making a lot of infrastructure changes. And new deployments will fail. Even existing deployments with updates will fail. And that is something which has to be as dynamic as the runtime environment. real world example for something like that could be a service which is deployed in a high volume transaction environment. during peak hours, we would assume to see a lot of mechanism coming into play. Things like auto scaling or being, being, consuming much more, Resources, and that often lead to incident by itself, incidents which are related to memory overload or any, chain events like impacting, downstream services, things like dashboards or front end or any, any transaction, transactional environment. So being able to, foresee, do a loopback from the runtime environment towards what's being presented in the IDP can eventually both in a real time environment give you the right data, but also the process of improvement allowed that to be, allowed that to be incorporated within the IDP. So the guardrails really, identify and the guardrail, really represent what's happening in the production environment, and they're there for, much more, comforting or much more, much more, real. to what should be the, what should be the guardrail in the, in the actual deployment. Another element is how we use that to verify deployments in, in real time. as we said, IDPs often, the, often, offer a developer guardrail, things like, limited memory usage, CPU usage, latency. budget or bandwidth budget, to ensure eventually, the efficient running and reliable running of deployment. These are there to make sure that even a faulty deployment won't bring down the entire system, but it will be localized in a way so it doesn't, doesn't affect the rest of the services. but these are really in, in a lot of times are set, during development phase and much more, sometimes they're based on assumptions or they based on intuition, which is essentially the message that we're trying to convey here. This is a lot of times the source of many failures. This cannot be based on intuition, but has to be, foreseen from the runtime environment, which will feedback, which will values, and pushed into the IDP or integrated into the IDP. So that represents what's really happening there. Could be two directions. It could be that I'm thinking something which is too low, for the runtime environment. Could be too high for the runtime environment. And that would be another aspect of how efficient I want to be. In the runtime environment, as a topology can provide this runtime monitoring of performance of deviations from means or from any other metric, it could accelerate the remediation by having enough data for the developer who is now trying to understand what's going wrong or the platform team, which is trying to understand what. going on, and as we said, as the continuous process of improvement also allowed these guardrails to go and eventually converge towards reality. and when that's something which becomes that, when that's something which becomes, a continuous process on the platform engineering part on the development parts, eventually create a better methodology. to make sure that these things which run, smoother and smoother. a real world example here, when a new service is consuming more resources, example here could be excessive CPU, but it could be memory or any other consumable resource. the real time topology with regards to, the metrics, with regards to the log, with regards to other, evidences. can flag that immediately and can flag that with the content of who's speaking to who or what's going on within the dynamics, which is a valuable and crucial piece of information, which are needed by, needed by the developers. Investigating and responding to issues. So we understand that I did. I did these support a lot of the time. The notions of incident. They support a lot of the time the integration with observability tools, either on the infrastructure layer or the observability layers. They're supporting integration with third party or higher layer incident management platforms. So the data is. contextualized. The data is already meant to be there. the service catalog is a very valuable asset for root cause analysis. It's essentially the building blocks which I'm trying to look at. And the root cause or the source of an issue is going to be most likely something within the catalog. There are caveats to that. the example that we gave about third parties, but most likely if it's a, an issue and internal issue, it's going to be an element, which is part of the catalog, the problem or the element, which is missing. It's only extending through. Again, what I'm trying to configure, what is my schema, what I'm trying to orchestrate into the runtime environment. If we bring in the information from the runtime environment, now we have essentially the matching between the catalog and the runtime catalog, if you will, and those crosses between them are essentially a very important part in the root code analysis or. issue analysis that any engineering team is going to run through, essentially trying to mark off elements which are not relevant until they converge on the element which is faulty. So that is a very natural and native behavior for teams to act with and the tools like the IDP has to support it and has to bring in the data in order to enable that. A real world example here, I team needs to identify whether a service, which is now a faulty, degrading toward the customer or towards the internal customer, is related to issues with third party resource. That would be something which is really hard to do just based on. Orchestration or just based on the configuration part, we have to have more data in order to understand the health of that service to understand the dynamic and use of that service with 3rd party or external. Dependencies and that should be again. Presented towards the platform engineering team towards the development to understand. We're not only where to look at, but put, to put a focus or, to pinpoint the exact location of where it can start. And that's something which cannot be achieved just by looking, just by looking at static schema or just looking at, just looking at the service catalog. It has to come from somewhere else, but it's very important to look at that with, with regards and with the context of the elements in the system. Thank you. Common pitfalls and, how to avoid them. the maintenance, it's essentially, true for a lot of things. but closing the loop between the service catalog and runtime is only a first step. don't put, don't overlook the resourcing or don't think it would be, a minor or a, a very small engineering effort, to keep it properly maintained, to audit it, to service it. There's a lot of, there's a lot of commitment. Which is needed here to make the thing, to make these things run smoothly, to operate smoothly. But that is for sure an investment, an investment, worthwhile of, doing. Incomplete runtime context. In order to do what we're speaking about, you need enough context. And depending on the complexity of the system, depending on the amount of, if you will, free variables, in the system, which includes the resources, the APIs, the third party services. this thing is sometimes hard to bring, not only in terms of actually having the data, but from the configuration of the runtime tool set, from, the aspect of how much Investment is needed from not only time perspective, but also tool price perspective to having all the different connectors to the different layers in the runtime environment and bring it all to the IDP. It's worth mentioning here that technologies like EVPF technology can enable auto discovery of the runtime environment, which can better and ease the toil. of exposing that data, without manually going and configure a million type of dashboard, a million type of connection point, could be very helpful here. static snapshot. so to be useful, the runtime topology need to be continuously and automatically discovered and updated. Not only is a way to. Collect this data, analyze these data, able to not use a lot of documentation, which is most of the time irrelevant on the, on the day of writing it. so being able to actually get that, as a continuous process, is super important and saves a lot of time of trying to do it in any other way. Tips for getting started. building your topology. of course, observability tools can help. A lot of them has elements like, a lot of them has elements like, runtime topology, usually based on things like, DNS. Or basic modes of, networking, but they often require a lot of configuration around them to. Explore all the different, layers. there are platforms that offer auto discovery, in real time or real mapping, if you will, of the environment. This is, as we said earlier, often with the help of technologies like eBPF, which are exposing, the dynamics, not only from a configuration point of view, Or, or not only from the tendency point of view, but stemming from real networking elements, real interaction between the different services. the integration of your, the integration of the topology with the IDP. the IDP, already most of them offer integration with observability and incident response platform. It would be very native, it would be very native to integrate, the topology part. With the caveat we said, we spoke about earlier, as good as the good. It'll only be as good as the topology, exposed from, from the observability tool. but that is as a first step, even bringing it front facing, towards the user of the IDP, is very important to put all the needed data in one place, and have, and have this data reachable, by the u by the users of the, IDP, prioritization of use cases. Prioritization is, naturally important, in any engineering, in any engineering venue. but being able to identify the quick wins. Where are the hotspots? Where are the services which are more tailored toward failure, which makes more, problems or makes more challenges? In the production environment, focus on them. You don't need to solve the entire system. You won't be able to solve an entire system, especially if the scalable, scalable one, but the ability to. minimize that look at the important part, the important building blocks, whether they're internal, whether they're a platform or infrastructure part or service elements or third party elements, which are super important priorities based on them. And eventually, when the relevant team is looking at the IDP or the relevant thing is using the IDP. If we can put this quick win to where that specific team on the grand scheme of things, it's not only net positive, largely net positive, because from the beginning, these are the elements which are most, most likely to require that and use that. and the last element would be to scale and expand. Very, very common if we're speaking about the concept of quick wins, but you need to prove to the users of the IDP. This is a valuable tool. So using this early win and being able to go step by step, building more or additional opportunities to weave in or to put in the runtime context into the IDP are Not only valuable, but as a process of internal con conveying, whereas the process of proving internally that. Winning mechanism are super important and using this, strategy to build it in step, prove it in step and eventually using it in step, are going to eventually affect the software development life cycle, the deployment life cycle. All of that would be achieved, improving in, in ways that, people can cope with that. I really hope that you enjoyed that as much as we did. we think there's, A very large room for collaboration. We're in large collaboration between what's being done on the observability world between advances in, in, the abilities in the observability world and how real time topologies could be auto discovered, exposed, maintained by the tools themselves, and integrating this data toward the IDP. eventually it becomes, only logical that we're speaking about things like a continuous integration, continuous development, and thinking about the software development lifecycle, deployment lifecycle, those tools has to work in, have to work in, have to work in conjunction, to water, to one another, they're complimenting one another. And each and every one of them is depending on the other one from things like the IDP, which encompass the organizational knowledge about what should be defined, how it should be orchestrated, where it should be orchestrated. And eventually doing, the deployment, doing the deployment or commanding the deployment, to the production environment. And of course, all the way through the outer hand, which is the deployment environment or the runtime environment, in reality, and those won't leave this cannot to cannot live in void. Eventually, they're dependent not only with the teams which need to work together in order to solve issues or to add capabilities to the system. But also, from technical perspective, there is going to be a lot of convergence between the abilities of those two on integration here is native. I really hope that you enjoyed it. we will be happy to answer any question by email or by, by coming to us through, through the website. And I wish you very happy, very, very productive day. Thank you.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Bridging the Gap: Real-Time Topology as the Connective Tissue Between Platform Engineering and SRE

Video size:

Abstract

Summary

Transcript

Slides

Amir Krayden

Co-founder & CEO @ Senser

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Bridging the Gap: Real-Time Topology as the Connective Tissue Between Platform Engineering and SRE

Video size:

Abstract

Summary

Transcript

Slides

Amir Krayden

Co-founder & CEO @ Senser

Join the community!