AI-Augmented Platform Engineering

Video size:

Abstract

AI-augmented platform engineering is transforming software delivery in 2024. By integrating predictive and generative AI, platforms are improving developer experience, efficiency, and infrastructure optimization.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Ajay Shankarwad. I'm the CTO and Managing Director for Platforms and Products at Brilio. Brilio is a 10 year old digital transformation company who has been actually doing a lot of work within the platform engineering space and Gen AI. Bye bye. my topic for today is going to be AI Augmented Platform Engineering, which sort of brings together both those ideas. So let's just jump right in. So what I wanted to talk about today was how do you actually apply some of the GEN AI principles to improve the platform engineering experience across multiple platforms. vertical domains that you have within your organization. These vertical domains are typically things like your banking industry or your healthcare industry or automotive industry, whatever industry that you are in. trying to build the products and deliver them with efficiency and, reliability. So what I'm going to be talking about today, number one, going to start talking about, the platform engineering as a whole. What does it mean, or how do I define platform engineering? then we'll talk a little bit about Gen AI, the advances in Gen AI that we see these days. then we'll jump right into LLMs and why epochs matter within, training, your models. then we'll start talking about applying the ranks, to LLMs in the context of platform engineering. to understand that better, I might, take a case study on BFSI, and this is one of the verticals I was just talking about. And then I'll conclude with, some discussion about, how do you actually get started on this? as you really start talking about this, the first thing to think about, Platform engineering is the fact that, why are you actually doing platform engineering? And why do you need to improve that performance more? It's fairly simple, right? So think about if you find a few of your time to market any organization wanting to have faster time to market. So this is where you would need things like your, Code generation and you're improving your developer productivity and which everything that you need to actually start translating your ideas, your product ideas into your products that actually gets deployed to production as you do that, reducing the cost is also a significant part of how you should do it, right? because you need to find a way to find to reduce the kind of repetitive tasks that you have. As well as, improve your utilization of your existing resources, whether it is cloud resources, on prem resources, doesn't matter any kind of resources you have. As you can start using that less, you're going to reduce the cost. All these things lead into that developer productivity space. This is where you are going to have The reduced cognitive load of your developers trying to make sure that the developers can actually get their work done faster and they can be more happier. So none of these things have anything to do with Gen AI or LLMs as a whole, right? so what we are really talking about is that if you were to apply Gen AI to some of these principles, We believe that it can actually yield better results in the end. having said that, one thing that we should be talking about is what is Platform Engineering? I don't think we should be really talking about that in 2024, but at the same time, we do have this confusion all the time with respect to what is Platform Engineering? How does it fit into the larger ecosystem? let's just spend a few minutes trying to understand how I define Platform Engineering. there are a few things that you see on the slide, so think of it like eight or nine different axes. the first thing to think about is what we always think about from a runtime point, if you're an application runtime. If you have a certain type of applications, architected in certain ways, maybe you have containers and those containers require orchestration. and generally think of it like your microservices are containerized. as you try and do that, You need to really look at your workload and figure out what's the best way to do it. And that process of translating your application to that runtime of those applications is, completely within the scope of platform daily. The other part of it is a service mesh. This is where you really make sure that your microservices can talk to each other and make sure that there are tools like Istio and all that, that ensure that there is efficient communication between your microservices and your runtime is managed well. Anytime you talk about platforms dating, the first thing that probably comes to everybody's mind is pipelines. So to those pipelines goes things like compliance and governance that are built in as part of your pipelines. Any organization will have their own set of service catalogs, their own set of services that you really want to actually expose. So those things are, again, part of platform engineering. Observability requires no further explanation. we all understand what that is, but, it's not just building the observability platforms and really maintaining that. It's sometimes also about instrumenting and having the observability strategy as to how you're actually going to observe your systems. any applications require data and managing that database and the data infrastructure is integral part of micro GA. So is the, developer experience reporting you? So there are a lot of tools these days that actually, provide you reports with respect to how your developer experience is improving, what kind of metrics you're tracking and are these the lagging or the, or the leading metrics and how do you actually attack that? All those things are typically part of platform change space. But if you have all these 8 axes, the question really becomes, how do you actually bring it all together? How do you actually make sure that it all works as part of your path to production and as part of your value chain? This is where the internet developer platforms and the orchestrators around it will come in. So these orchestrators make sure that all these, different capabilities of this different classes of capabilities work together within the platform ecosystem. So that's how I would define platform engineering. So now let's also take a step back and really talk about use platforms nearing DevOps and TSD SRE, DevEx and all that. and these are conversations I still have, with a lot of people I talk to. so one of the easiest ways to look at this is that, your platform engineering is the core to all of these terms that we're talking about. If you think of DevOps as a cultural paradigm, a way of doing things that is efficient and making sure that your development process is efficient enough and your developers can actually get the job done in the most efficient manner, a platform engineering sort of enables that. And at the same time, your developer experience requires platform engineering. So anything that we talk about with the developer experience, whether it is ensuring that you're reducing the cognitive load of the developers that requires platform engineering, and it supports the whole SRE framework. So SRE being the Site Reliability Engineering. that really requires some of the platform's data capabilities. again, a very simplistic way of looking at it, but this ensures that, there is no confusion with respect to what platform is doing and what the scope of that is. we also need, can maybe, spend a couple of minutes on the actual evolution of this, right? we know that, 10, 15 years back, And when public clouds weren't as popular as it is today, we used to have data centers and we used to have this idea of DevOps, where we could really try and ensure that your developers were working in it as efficient as possible. Over time, we actually started seeing a lot more of those configuration management, large set of servers and scalable tissues and things like that. At that point, we started doing a lot more configuration management. And, then the DevOps in the early. Principles of platform sharing started coming. As we started moving to cloud, where the whole idea of cloud not being a location, being an idea of a way of actually doing your work, a way of working, platform sharing became a lot more prevalent. And that is what has led to the current state, where you typically have a hybrid cloud or a multi, at least a multi cloud situation for any organization, and they are really starting to see how can you make your platform activities a lot more efficient. This is where both AI led and AI infused platforms have become very popular. in the larger context of domain specific ways of looking at platform engineering. when I say domain specific, going back to the previous point, about those banking services or healthcare services or automotive services and those kinds of things. If you look at it, think of it like, at the end, this Venn diagram, where one of the circles here is those Dell AI capabilities in platform engineering. so these are, any kind of air capabilities, not particularly a predictive or a generative capability, but it's some kind of generally a capabilities. then you also have certain industry abstraction of those platforms, right? And so for any kind of specific domain platforms, you'll have certain industry specific data that is available and you need to extract that out and now applying the elements for these platforms will also improve that. But in order for all these things to work together, what are the things that we need to think about is that the organization should be able to support that. So anything that you're doing, the core of the whole success is going to be that console that you really bring about as part of your organization DNA. And as we just spoke about so AI hype, even though that's been on for a couple of years. we all know that, we have been using AI in one form or the other for a while. We have, predictive AI for a while. We have been using forecasting and trend analysis and tension management and a lot of those kinds of things. I've been using AI for many years, but with the advent of gen AI, we have, we are seeing a lot more advances in core generation, test generation, all that kind of things, which obviously needs NLMs, these large language models like OpenAI GPT or Codex, which has Really, a boon to the whole Gen AI, development that we're seeing now. So looking at some of the LLMs that's available today, I'm not going to go through each and every one of them, but as you can see, things like GPT 4 has helped a lot with the code generation and the general assistance of DevOps activities. And CodeX has definitely helped things like your IEC and API integration on that. So as you really start looking at what are the kind of models that you need to improve your platform engineering experience, you have to first think about, what is the problem that you're trying to solve, and how can you actually apply some of these very specific elements. As you really start training your models, the epochs come into play. in the epochs and whole process of really trying to make sure that you iterate through your training process so that you are converging on the right kind of data, right kind of models without having a situation where you're really starting to, not train them enough or over train them. So this is where that convergent process is extremely important. In order to do that, you need to find the right apox. Let's look at a very simple example of how something like this could work. So if you, as you really start training your s using epoch, one of the ideas that you need to do is first you start with some kind of, modeling initiation and, tokenizing the process. Once you set that, know, you have to really start looking at, can you actually set your hyper parameters? These hyper parameters are things like Neo. What's the number of feed blocks that you have? What's the batch size that you're training as well as some of the learning rates and all that. And once you do that, you really go into your data sets and start, you're ensuring that you load your data, make sure that you have the right kind of data that is coming out of the process. Then you actually go in and initialize that, optimization process, using the model parameters that you step, set in the previous state, and then set your mode of learning rate. At that point, you can really start going to an iterative process where you really start determining what the epochs are. And for each APOC release, once you actually go through the training process, make sure that after each batch is trained, see if that level of maturity has, the optimization maturity has reached. And once you have reached that, you end up essentially saving that trained model and the tokenizer, make sure that you move on to the point where you have an optimal fit. this is all great, right? And so we looked at the LLMs and looked at how you know how you can train that once you do that, one thing that you really have to think about when you talk about, especially when you talk about domain specific platform is getting is, the application of rags. So the retrieval augmented generation. so essentially think of rag is something that actually bridges your training data gaps. So you train the data in the previous step. And once you have that, you obviously have significant amount of knowledge. But you might still have some gaps if you haven't trained it the level in which you want to train it for the domains. And this allows the LLMs to be a lot more relevant, a lot more accurate, and essentially achieve what you really want to achieve, which is to reduce the cognitive load of the developers, make sure that they are more effective. Again, a very simple way of looking at it. So what on the left side is, if you have an NLM without the rags, what you end up doing is that you write the query, you send it to LLM, and you get the output. But what if you can actually send some domain and organization specific data? As you send that, what you really find is that you can actually create a lot more data sources. Which eventually would turn into an augmented query, not more efficient query that will yield better outputs. So very simplistic high level view of how the RACs can help. So now let's look at a case study, and specifically on the banking and financial services domain. as to how you could actually Enhance your developer experience for this particular domain. essentially, these engineers are building products within this domain, and you're going to integrate some traditional platform sharing activities to go with the last as you do this. So the approach is going to be, but first I recommend is going to be like in the start with identifying those critical BFSI APIs. and we'll talk what those things are and then we really start thinking about what are those domain specific lags that you would need? This is the most critical step, right? So you're really going to figure out where you're going to get this data from. Keep all this terminology aside. It's just like you want to build some things where you get the data from. Once you do that, the steps 3 to 8, as you can see, are very generic to building any kind of platform activity. Whether it is an engineering platform or business platform, it doesn't matter. Those steps should remain the same. And we will actually go through the process to understand how that. So the first step is identifying your domain specific APIs. So what are these domain specific APIs? Fairly simple. Again, if you are in the banking and financial services industry, the kind of things that are important to you are things like your account creation, the balance in account balance, the transactions and those kinds of things. About as well as things like your payment processing, your loan management, all that. So you need to clearly identify what those APIs are, what those functions are. If you can't get that right, then anything that you do later isn't going to be useful. So it's very important for you to do some product management. Involve your product management teams to make sure that you clearly understand what the business needs are. and even between two, two different banks, engaged in similar activities, your requirements might be different depending on where your business priorities are. Once you do that, the question becomes like, how do you develop a domain specific RACs? so as you try and do it, there are a few things that you should look at. Think of this like the data sources, right? Your internal documentation, your regulatory guidelines, your ontologies and taxonomies. Pretty much anything that where you can actually get this data from would be extremely useful. there are things like, for example, open banking API. So you might want to have your API comply to that within the lags. So essentially you include that information as you develop the lags. Then you also need to really look at it from the point of view, what are the models available? What are the knowledge graphs that you can incorporate those things to? So again, this is not an exhaustive list of things, this should be like a potentially the kind of things that you might look into or any kind of data sources that you look into. This might actually turn out to be a good guideline for you to start with, and keep in mind that this is very specific. In this case, it's very specific for the banking, financial services industry, but as you really start looking at healthcare, you can probably find some very similar kind of, equivalent, max for the, for, the things that you see here. Now, once you do that, Then the next set of steps are going to be again. I'm going to use some pseudocode to show this, but it's going to be fairly straightforward from the point if you know how platforms are built, right? So think of us like the first step in the process would be to define your A. P. I. S. For the common task. So what are these common tasks in this case? all the things that we spoke about in step one, then once you have that, you essentially create those integration A. P. I. S. essentially define those A. P. I. S. For your usage. And then once you define it, Go ahead and start exposing those APIs so that these things can be used by your developers. Next step in the process would be to set up a SimCell portal. so in the SimCell portal, again, fairly straightforward idea. Go ahead and initialize the portal framework. And this is where we talked about that, integration, framework that actually brings to your, pretty much all the steps within the platform's data ecosystem. And you can go ahead and add the appropriate documentation, and you can see that, this is actually pulling in the data that you actually collected from various data sources, and then go ahead and add this into your sandbox environment and make sure that you go ahead and deploy that port. The next step of the process would be to have your firmware environments. As a developer, you need to make sure that you have the right kind of environments to continue your development process. So you start by creating your dev containers and then publish those dev containers and provide instructions for the local server. So this is where you would actually provide that, right kind of configurations and the instructions for the developers to execute so that they can actually get this deployed on your local environments. next one is again, fairly straightforward. Think about it from your end to end observability. As you start setting up your observability, you need to think about your logging, your monitoring, your tracing, the metrics activities and all that. And once you have all of those things, expose those tools to the developers so that developers have access to it and can work with it. And, at that point, you already set up your site pipelines. Your pipelines, are things that you would actually set up as part of this step, but then your developers would then go ahead and instantiate it every time they are actually running through. So essentially things that will happen here is that you just create your pipelines, the IC pipeline, then add them their property stages in there for security compliance and automated testing and things like that. And the compliance would mostly happening, happen during the, during every step of the pipeline, making sure that you have some kind of compliance at every point that changes. the last step of the process would be to have your security and compliance. So essentially, this is where you're going to be checking your, your vulnerability scanning and to know some of the compliance checks that are required by the industry and by your organization. And then ensuring that these are built into the pipeline and report some of these things back to the observability platform or wherever the developers need to report that back to, so that whole ecosystem works together. so as you can see, a lot of these things are, very similar. The steps three to eight were exactly similar, irrespective of any kind of domain that you would do this in. And that's the beauty of how you would actually develop this. Sure. Having looked at this case study, one of the things that really have to think about is how is this really helping or how should this be helping? Today, if you really look at a, general organization, lots of things without using significant amount of platform engineering improvements, that about, 70 percent of the time for your developers are overhead based. what it might surprise you, but it's only about 30 percent of your developers time is productive. The rest of the time is an overhead, something that is duplicated work, something that they can avoid. But applying some platform engineering principles, what we, what the data shows us is that it can improve significantly, your value add can increase significantly from 30 percent to 60 percent. This is significant, right? But is that enough? probably not, right? Our goal is to increase that as much as possible. So one of the things that we see is that applying GNI with RACS, you can increase that further at least up to 90 percent. I haven't seen a lot of data yet. I'm starting to see a lot of very promising data in this space. But if you have some very specific data that shows that improvement, I would love to see, love to talk to you and see a lot more about that. you also have to think about, as you really look at this, what's the leverage that Gen AI with WAX is bringing to the table? We know a lot about the coding and debugging and the testing and the monitoring and observing. Do you think? Great ideas, Baird. that platforms, georeferencing can be applied and improved. But, there is also significant leverage that we're seeing within the solution architecture, the requirements analysis and user research. So you should be really thinking about having that holistic leverage across your overall ecosystem as you go through your path to production, as you really try and make sure that you've done it and deliver your product. in conclusion, I just wanted to say that, AI is literally changing the way we are doing platform engineering. It's definitely pushing some of the boundaries, but, it's been very expensive to train some of these LNFs. So we need to focus on their parts, make sure that we get the right kind of focus there. and ultimately, the difference is going to be the racks, right? Especially when you're doing the domain specific work, the real benefit is going to be by having the racks, that taking it to that next step. and one thing never to forget is the fact that why are we doing this? And metrics really tell you why you're doing it, what you want to achieve out of it. Those things doesn't change. So if you really have one eye on the metrics, you should be able to always find that, okay, is this what I'm really trying to achieve? And am I able to achieve that? So with that said, I just wanted to thank you. For your time, if you want to connect with me, go ahead and connect me on LinkedIn using that QR code. If you want to learn more about my company, definitely check out that. So thank you so much and I appreciate you listening to this.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

AI-Augmented Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Ajay Chankramath

CTO @ Brillio

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

AI-Augmented Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Ajay Chankramath

CTO @ Brillio

Join the community!