Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Ajay Shankarwad.
I'm the CTO and Managing Director for Platforms and Products at Brilio.
Brilio is a 10 year old digital transformation company who has been
actually doing a lot of work within the platform engineering space and Gen AI.
Bye bye.
my topic for today is going to be AI Augmented Platform Engineering, which
sort of brings together both those ideas.
So let's just jump right in.
So what I wanted to talk about today was how do you actually apply
some of the GEN AI principles to improve the platform engineering
experience across multiple platforms.
vertical domains that you have within your organization.
These vertical domains are typically things like your banking
industry or your healthcare industry or automotive industry,
whatever industry that you are in.
trying to build the products and deliver them with efficiency and, reliability.
So what I'm going to be talking about today, number one, going
to start talking about, the platform engineering as a whole.
What does it mean, or how do I define platform engineering?
then we'll talk a little bit about Gen AI, the advances in
Gen AI that we see these days.
then we'll jump right into LLMs and why epochs matter
within, training, your models.
then we'll start talking about applying the ranks, to LLMs in the
context of platform engineering.
to understand that better, I might, take a case study on BFSI, and this is one of
the verticals I was just talking about.
And then I'll conclude with, some discussion about, how do
you actually get started on this?
as you really start talking about this, the first thing to think about, Platform
engineering is the fact that, why are you actually doing platform engineering?
And why do you need to improve that performance more?
It's fairly simple, right?
So think about if you find a few of your time to market any organization
wanting to have faster time to market.
So this is where you would need things like your, Code generation and you're
improving your developer productivity and which everything that you need
to actually start translating your ideas, your product ideas into your
products that actually gets deployed to production as you do that, reducing
the cost is also a significant part of how you should do it, right?
because you need to find a way to find to reduce the kind of
repetitive tasks that you have.
As well as, improve your utilization of your existing resources,
whether it is cloud resources, on prem resources, doesn't matter
any kind of resources you have.
As you can start using that less, you're going to reduce the cost.
All these things lead into that developer productivity space.
This is where you are going to have The reduced cognitive load of your
developers trying to make sure that the developers can actually get their work
done faster and they can be more happier.
So none of these things have anything to do with Gen AI or LLMs as a whole, right?
so what we are really talking about is that if you were to apply Gen
AI to some of these principles, We believe that it can actually
yield better results in the end.
having said that, one thing that we should be talking about is
what is Platform Engineering?
I don't think we should be really talking about that in 2024, but
at the same time, we do have this confusion all the time with respect
to what is Platform Engineering?
How does it fit into the larger ecosystem?
let's just spend a few minutes trying to understand how I
define Platform Engineering.
there are a few things that you see on the slide, so think of it
like eight or nine different axes.
the first thing to think about is what we always think about from a runtime
point, if you're an application runtime.
If you have a certain type of applications, architected in certain
ways, maybe you have containers and those containers require orchestration.
and generally think of it like your microservices are containerized.
as you try and do that, You need to really look at your workload and figure
out what's the best way to do it.
And that process of translating your application to that runtime of
those applications is, completely within the scope of platform daily.
The other part of it is a service mesh.
This is where you really make sure that your microservices can talk to
each other and make sure that there are tools like Istio and all that,
that ensure that there is efficient communication between your microservices
and your runtime is managed well.
Anytime you talk about platforms dating, the first thing that probably
comes to everybody's mind is pipelines.
So to those pipelines goes things like compliance and governance that are
built in as part of your pipelines.
Any organization will have their own set of service catalogs,
their own set of services that you really want to actually expose.
So those things are, again, part of platform engineering.
Observability requires no further explanation.
we all understand what that is, but, it's not just building the observability
platforms and really maintaining that.
It's sometimes also about instrumenting and having the observability
strategy as to how you're actually going to observe your systems.
any applications require data and managing that database and the data infrastructure
is integral part of micro GA.
So is the, developer experience reporting you?
So there are a lot of tools these days that actually, provide you reports
with respect to how your developer experience is improving, what kind of
metrics you're tracking and are these the lagging or the, or the leading metrics
and how do you actually attack that?
All those things are typically part of platform change space.
But if you have all these 8 axes, the question really becomes, how do
you actually bring it all together?
How do you actually make sure that it all works as part of your path to production
and as part of your value chain?
This is where the internet developer platforms and the
orchestrators around it will come in.
So these orchestrators make sure that all these, different capabilities of this
different classes of capabilities work together within the platform ecosystem.
So that's how I would define platform engineering.
So now let's also take a step back and really talk about use platforms nearing
DevOps and TSD SRE, DevEx and all that.
and these are conversations I still have, with a lot of people I talk to.
so one of the easiest ways to look at this is that, your platform
engineering is the core to all of these terms that we're talking about.
If you think of DevOps as a cultural paradigm, a way of doing things that
is efficient and making sure that your development process is efficient enough
and your developers can actually get the job done in the most efficient manner, a
platform engineering sort of enables that.
And at the same time, your developer experience requires platform engineering.
So anything that we talk about with the developer experience, whether it
is ensuring that you're reducing the cognitive load of the developers that
requires platform engineering, and it supports the whole SRE framework.
So SRE being the Site Reliability Engineering.
that really requires some of the platform's data capabilities.
again, a very simplistic way of looking at it, but this ensures that, there is no
confusion with respect to what platform is doing and what the scope of that is.
we also need, can maybe, spend a couple of minutes on the
actual evolution of this, right?
we know that, 10, 15 years back, And when public clouds weren't as popular
as it is today, we used to have data centers and we used to have this idea
of DevOps, where we could really try and ensure that your developers were
working in it as efficient as possible.
Over time, we actually started seeing a lot more of those configuration
management, large set of servers and scalable tissues and things like that.
At that point, we started doing a lot more configuration management.
And, then the DevOps in the early.
Principles of platform sharing started coming.
As we started moving to cloud, where the whole idea of cloud not being a location,
being an idea of a way of actually doing your work, a way of working, platform
sharing became a lot more prevalent.
And that is what has led to the current state, where you typically
have a hybrid cloud or a multi, at least a multi cloud situation for
any organization, and they are really starting to see how can you make your
platform activities a lot more efficient.
This is where both AI led and AI infused platforms have become very popular.
in the larger context of domain specific ways of looking at platform engineering.
when I say domain specific, going back to the previous point,
about those banking services or healthcare services or automotive
services and those kinds of things.
If you look at it, think of it like, at the end, this Venn diagram, where one
of the circles here is those Dell AI capabilities in platform engineering.
so these are, any kind of air capabilities, not particularly
a predictive or a generative capability, but it's some kind
of generally a capabilities.
then you also have certain industry abstraction of those platforms, right?
And so for any kind of specific domain platforms, you'll have certain industry
specific data that is available and you need to extract that out and
now applying the elements for these platforms will also improve that.
But in order for all these things to work together, what are the things that we need
to think about is that the organization should be able to support that.
So anything that you're doing, the core of the whole success is going to
be that console that you really bring about as part of your organization DNA.
And as we just spoke about so AI hype, even though that's
been on for a couple of years.
we all know that, we have been using AI in one form or the other for a while.
We have, predictive AI for a while.
We have been using forecasting and trend analysis and tension management
and a lot of those kinds of things.
I've been using AI for many years, but with the advent of gen AI, we have,
we are seeing a lot more advances in core generation, test generation, all
that kind of things, which obviously needs NLMs, these large language models
like OpenAI GPT or Codex, which has Really, a boon to the whole Gen AI,
development that we're seeing now.
So looking at some of the LLMs that's available today, I'm not going to go
through each and every one of them, but as you can see, things like GPT 4 has helped
a lot with the code generation and the general assistance of DevOps activities.
And CodeX has definitely helped things like your IEC and API integration on that.
So as you really start looking at what are the kind of models that you need
to improve your platform engineering experience, you have to first think about,
what is the problem that you're trying to solve, and how can you actually apply
some of these very specific elements.
As you really start training your models, the epochs come into play.
in the epochs and whole process of really trying to make sure that you iterate
through your training process so that you are converging on the right kind of data,
right kind of models without having a situation where you're really starting to,
not train them enough or over train them.
So this is where that convergent process is extremely important.
In order to do that, you need to find the right apox.
Let's look at a very simple example of how something like this could work.
So if you, as you really start training your s using epoch, one of the ideas
that you need to do is first you start with some kind of, modeling
initiation and, tokenizing the process.
Once you set that, know, you have to really start looking at, can you
actually set your hyper parameters?
These hyper parameters are things like Neo.
What's the number of feed blocks that you have?
What's the batch size that you're training as well as some of the
learning rates and all that.
And once you do that, you really go into your data sets and start, you're
ensuring that you load your data, make sure that you have the right kind of
data that is coming out of the process.
Then you actually go in and initialize that, optimization process, using
the model parameters that you step, set in the previous state, and then
set your mode of learning rate.
At that point, you can really start going to an iterative process where you really
start determining what the epochs are.
And for each APOC release, once you actually go through the training
process, make sure that after each batch is trained, see if
that level of maturity has, the optimization maturity has reached.
And once you have reached that, you end up essentially saving that
trained model and the tokenizer, make sure that you move on to the
point where you have an optimal fit.
this is all great, right?
And so we looked at the LLMs and looked at how you know how you can train that
once you do that, one thing that you really have to think about when you
talk about, especially when you talk about domain specific platform is
getting is, the application of rags.
So the retrieval augmented generation.
so essentially think of rag is something that actually
bridges your training data gaps.
So you train the data in the previous step.
And once you have that, you obviously have significant amount of knowledge.
But you might still have some gaps if you haven't trained it the level in which
you want to train it for the domains.
And this allows the LLMs to be a lot more relevant, a lot more accurate,
and essentially achieve what you really want to achieve, which is to reduce
the cognitive load of the developers, make sure that they are more effective.
Again, a very simple way of looking at it.
So what on the left side is, if you have an NLM without the rags, what you end up
doing is that you write the query, you send it to LLM, and you get the output.
But what if you can actually send some domain and organization specific data?
As you send that, what you really find is that you can actually
create a lot more data sources.
Which eventually would turn into an augmented query, not more efficient
query that will yield better outputs.
So very simplistic high level view of how the RACs can help.
So now let's look at a case study, and specifically on the banking
and financial services domain.
as to how you could actually Enhance your developer experience
for this particular domain.
essentially, these engineers are building products within this domain,
and you're going to integrate some traditional platform sharing activities
to go with the last as you do this.
So the approach is going to be, but first I recommend is going
to be like in the start with identifying those critical BFSI APIs.
and we'll talk what those things are and then we really start
thinking about what are those domain specific lags that you would need?
This is the most critical step, right?
So you're really going to figure out where you're going to get this data from.
Keep all this terminology aside.
It's just like you want to build some things where you get the data from.
Once you do that, the steps 3 to 8, as you can see, are very generic to
building any kind of platform activity.
Whether it is an engineering platform or business platform, it doesn't matter.
Those steps should remain the same.
And we will actually go through the process to understand how that.
So the first step is identifying your domain specific APIs.
So what are these domain specific APIs?
Fairly simple.
Again, if you are in the banking and financial services industry, the kind
of things that are important to you are things like your account creation,
the balance in account balance, the transactions and those kinds of things.
About as well as things like your payment processing, your
loan management, all that.
So you need to clearly identify what those APIs are, what those functions are.
If you can't get that right, then anything that you do
later isn't going to be useful.
So it's very important for you to do some product management.
Involve your product management teams to make sure that you clearly
understand what the business needs are.
and even between two, two different banks, engaged in similar activities, your
requirements might be different depending on where your business priorities are.
Once you do that, the question becomes like, how do you
develop a domain specific RACs?
so as you try and do it, there are a few things that you should look at.
Think of this like the data sources, right?
Your internal documentation, your regulatory guidelines,
your ontologies and taxonomies.
Pretty much anything that where you can actually get this data
from would be extremely useful.
there are things like, for example, open banking API.
So you might want to have your API comply to that within the lags.
So essentially you include that information as you develop the lags.
Then you also need to really look at it from the point of view,
what are the models available?
What are the knowledge graphs that you can incorporate those things to?
So again, this is not an exhaustive list of things, this should be like
a potentially the kind of things that you might look into or any kind
of data sources that you look into.
This might actually turn out to be a good guideline for you to start with, and
keep in mind that this is very specific.
In this case, it's very specific for the banking, financial services industry,
but as you really start looking at healthcare, you can probably find some
very similar kind of, equivalent, max for the, for, the things that you see here.
Now, once you do that, Then the next set of steps are going to be again.
I'm going to use some pseudocode to show this, but it's going to be fairly
straightforward from the point if you know how platforms are built, right?
So think of us like the first step in the process would be to define your A.
P.
I.
S.
For the common task.
So what are these common tasks in this case?
all the things that we spoke about in step one, then once you have that, you
essentially create those integration A.
P.
I.
S.
essentially define those A.
P.
I.
S.
For your usage.
And then once you define it, Go ahead and start exposing those APIs so that these
things can be used by your developers.
Next step in the process would be to set up a SimCell portal.
so in the SimCell portal, again, fairly straightforward idea.
Go ahead and initialize the portal framework.
And this is where we talked about that, integration, framework that actually
brings to your, pretty much all the steps within the platform's data ecosystem.
And you can go ahead and add the appropriate documentation, and you can
see that, this is actually pulling in the data that you actually collected
from various data sources, and then go ahead and add this into your
sandbox environment and make sure that you go ahead and deploy that port.
The next step of the process would be to have your firmware environments.
As a developer, you need to make sure that you have the right kind of environments
to continue your development process.
So you start by creating your dev containers and then publish
those dev containers and provide instructions for the local server.
So this is where you would actually provide that, right kind of configurations
and the instructions for the developers to execute so that they can actually get
this deployed on your local environments.
next one is again, fairly straightforward.
Think about it from your end to end observability.
As you start setting up your observability, you need to
think about your logging, your monitoring, your tracing, the
metrics activities and all that.
And once you have all of those things,
expose those tools to the developers so that developers have access
to it and can work with it.
And, at that point, you already set up your site pipelines.
Your pipelines, are things that you would actually set up as part of this
step, but then your developers would then go ahead and instantiate it every
time they are actually running through.
So essentially things that will happen here is that you just create
your pipelines, the IC pipeline, then add them their property stages
in there for security compliance and automated testing and things like that.
And the compliance would mostly happening, happen during the, during
every step of the pipeline, making sure that you have some kind of
compliance at every point that changes.
the last step of the process would be to have your security and compliance.
So essentially, this is where you're going to be checking your, your vulnerability
scanning and to know some of the compliance checks that are required by
the industry and by your organization.
And then ensuring that these are built into the pipeline and report some of
these things back to the observability platform or wherever the developers
need to report that back to, so that whole ecosystem works together.
so as you can see, a lot of these things are, very similar.
The steps three to eight were exactly similar, irrespective of any kind
of domain that you would do this in.
And that's the beauty of how you would actually develop this.
Sure.
Having looked at this case study, one of the things that really have
to think about is how is this really helping or how should this be helping?
Today, if you really look at a, general organization, lots of things
without using significant amount of platform engineering improvements,
that about, 70 percent of the time for your developers are overhead based.
what it might surprise you, but it's only about 30 percent of your
developers time is productive.
The rest of the time is an overhead, something that is duplicated work,
something that they can avoid.
But applying some platform engineering principles, what we, what the data shows
us is that it can improve significantly, your value add can increase significantly
from 30 percent to 60 percent.
This is significant, right?
But is that enough?
probably not, right?
Our goal is to increase that as much as possible.
So one of the things that we see is that applying GNI with RACS, you can increase
that further at least up to 90 percent.
I haven't seen a lot of data yet.
I'm starting to see a lot of very promising data in this space.
But if you have some very specific data that shows that improvement, I
would love to see, love to talk to you and see a lot more about that.
you also have to think about, as you really look at this, what's
the leverage that Gen AI with WAX is bringing to the table?
We know a lot about the coding and debugging and the testing
and the monitoring and observing.
Do you think?
Great ideas, Baird.
that platforms, georeferencing can be applied and improved.
But, there is also significant leverage that we're seeing within
the solution architecture, the requirements analysis and user research.
So you should be really thinking about having that holistic leverage
across your overall ecosystem as you go through your path to production,
as you really try and make sure that you've done it and deliver your product.
in conclusion, I just wanted to say that, AI is literally changing the
way we are doing platform engineering.
It's definitely pushing some of the boundaries, but, it's been very
expensive to train some of these LNFs.
So we need to focus on their parts, make sure that we get
the right kind of focus there.
and ultimately, the difference is going to be the racks, right?
Especially when you're doing the domain specific work, the real benefit
is going to be by having the racks, that taking it to that next step.
and one thing never to forget is the fact that why are we doing this?
And metrics really tell you why you're doing it, what you
want to achieve out of it.
Those things doesn't change.
So if you really have one eye on the metrics, you should be able
to always find that, okay, is this what I'm really trying to achieve?
And am I able to achieve that?
So with that said, I just wanted to thank you.
For your time, if you want to connect with me, go ahead and connect me
on LinkedIn using that QR code.
If you want to learn more about my company, definitely check out that.
So thank you so much and I appreciate you listening to this.