Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, everyone.
I'm very glad to be here.
I'm Amir, the CEO and co founder at Sensor.
Sensor is an AIOps company, bridging together advanced,
observability elements using, EVTF.
And machine learning stacks, which uses data in order to look for deficiencies
and service degradation, which we came to speak about here today is
essentially a very interesting concept of bridging the gap, both technically,
but also from an operational point of view and how to utilize the advances.
In the forms of, the use of an internal, developer portal.
and it's how essentially use, the real time topology or the dynamics
of the environment as a connected issue between the platform engineering
part and the DevOps and SREs.
So going through the journey that we will cover today, we will speak about an IDP in
general and what's missing from the IDP.
We'll verify that it's not the first use case that suffered from the exact
same phenomenas and what we can learn from cybersecurity in that aspect.
We'll go in deep dive at how real time topology could be used as a shared
language of how to bridge between the IDP operation on what I want
to do, what is the configuration.
or the configuring state of the services or system, and how to correlate and
bridge that with the runtime topology.
and that is an important part of, essentially connecting
between these two worlds.
we'll speak about what you can do with a closed loop, mechanism, what's
happened if you force feed data from the observability mechanism, the
real time topology back into the IDP.
And we'll, cover or we'll finish this, session with a little bit of pointers
about how to get started, which are, the most important parts to pay attention to.
So I hope you'll, I hope you'll enjoy it on.
Let's start, IDPs or internal developer portals, as being your businesses are
very incredible tools for bridging between the development parts in
the DevOps part, essentially, the ability for an organization to create.
Both the schema, but also the rules of what is going to be running to production,
what type of artifacts needed to be collected about them, configured about
them and expose that, expose that to the developers as means to save time.
for our platform engineering teams and DevOps team.
but pay attention that they're very focused about what I'm trying to do,
what I'm trying to achieve, what is the configuration state that I want to be
applied in the production environment.
And with that, we will see elements.
In this case, we're looking at, we're looking at the Kubernetes environment
or Kubernetes topology, in which we have the physical resources like the
nodes, And we have the logical resources like the Kubernetes clusters, the
namespaces, the workloads, and the pods.
And we can see the schema element on top of them, the elements which we're
trying to bring into configuration by the IDP and all the way to the developer.
On their own.
One half of the equation, because they're sometimes missing or
lacking in some critical context.
And that context, comes or stems from the difference between what's
happening on design time to what will happen on the runtime environment.
Resources will behave differently.
If not, if we're looking at the pods environment or workload environment,
of course, what I'm trying to do from configuration part, things like
allocates or request and limit their.
Elements or artifacts, which I'm configuring, but the behavior in the
runtime is going to be different.
It'll be different because the node has different resources to
give, and so on and so forth.
The interaction between the different elements are also
sometimes hard to, encompass.
If you're looking at things like ATIs, which are crucial, for the
interaction between the element, there are parts which are much harder
to model, in a static environment.
Towards what they're really doing in production and how they're
manifesting themselves in production.
Another good example would be the use of third parties.
Sometimes it's very hard to conceptualize even the schema layer.
I'm using a certain third party service, which is an important part of the
service that I'm trying to deploy.
How am I learning about that exposure?
I'm learning about that dependencies, which are, of
course, very important in the.
Complete life cycle or software life cycle, deployment life cycle
that we're trying to achieve.
And of course, runtime configuration.
What eventually happened, what was allocated, what was configured on the
different physical and logical layers and misconfigs, of course, that some
of them can happen only at runtime.
For example, someone going through not the IDP or doing a configuration directly.
On the cloud environment could lead to a lot of interesting things, with
regard to what the IDP is seeing versus what is the real situation.
the important part or the important aspect to.
Understand here is that we're not the only domain which facing
this exact same challenge.
A very good analogy could be the world of cyber security.
And more precisely for taking the, taking the platform engineering
in SRE, as a comparison, the world between the application security,
the AppSecOps and the DevSecOps.
The one which are responsible for what I'm doing from application
security and production to the one who actually serve and maintain
it on the runtime environment.
and we could see that throughout the short history, but meaningful
history of cybersecurity, that gap or bridging that gap between the static
analysis and the runtime to create an environment which is tailored
toward the shift left movement was an important part of making that a reality.
Moving between the static catalog in which the artifact would be things like
vulnerabilities and, OS, components, the authentication and encryption,
keys, configuration, different framework, the secret, an analogy
to an IDP, they will be part of the schema, which I'm trying to config.
But the runtime environment yields a lot of important dynamics in
order to tie the knot between them.
elements which I'm trying to do in production would be,
is something really deployed?
Where it is deployed?
It is internet exposed in terms of the surface of attack.
Is it behind some sort of a firewall or, or, or filtering mechanism?
Are there runtime mix configuration?
Are there, do they have any relevancy to something?
which is on the static catalog part, that bridging, was very important in
order to one, get an accurate view of what is an actual attack surface,
focusing or prioritizing the real risk, not just vulnerabilities, which of
course would collapse the entire team because we'll just try to fix everything.
but that formal formalization of what is important, what is the priority?
it's, it's crucial for these teams to, predicting the function, efficiently.
and of course, the embedding of policies and guardrail earlier on the
SDLC, making sure that the things and the elements which I'm putting in the
Configuration time, which I'm putting in an earlier stages of, the software
development, the software deployment, pipeline are going to really appear
in the runtime environment, which is a very strong analogy to what's actually
happening on, on an IDP, on an IDP world.
So for platform engineering, these are some contexts.
comes from essentially having a very nice grasp or a very deep view about
the real time topology of the services.
What is actually going on in production.
The benefits of using or being able to feed force this type of element
back into the IDP are numerous.
I will just mention a few.
The ability to get full and closed loop visibility.
What I try to do.
What I meant for things, to be looking like, and what actually
came about in reality, having more efficient troubleshooting.
Of course, there is a lot of vague line between where, one's
responsibility like a platform engineer is going to land and when.
Another responsibility like the SRE is going to start.
There's a lot of, there's a lot of back and forth between
them and going way back to the developers which are fed by the IDP.
Therefore, this type of information is super crucial.
and eventually less IDP maintenance required from various reasons.
One, being able to Actually be actuary on the status of the production environment
being left cold or think about things which are runtime related and not
IDP related and so on and so forth.
white heart that we understand it's not a trivial move.
We understand that people would have done it.
There are some sort of runtime integration towards the IDP, like getting
observability streams from your APM.
We're infrastructure monitoring tools, but it's still hard.
It's hard because the runtime dynamics are much more complex
than the static dynamics.
The static dynamics could be.
Somewhat, somewhat compared to a configuration file or a
configuration tree, something that I could traverse easily.
That's not the nature of the runtime environment.
The runtime environment is a live, moving, changing graph, which has a lot
of inputs coming from things that happen on the infrastructure layer, things
that happen on the application layer.
And just changes coming from the software deployment pipeline.
All of these together make these dynamics very, active, very,
very, very time, sensitive.
so this type of, this type of bridging becomes, harder to get than you
might think about, at the beginning.
So topology enriches the IDP with runtime concept on multiple layers.
So we're speaking here not only about how a production environment is looking,
but also with the context of the running environment, with the context of the
running application, infrastructure.
The network layer and the APIs will give a little bit of examples when
we're going through the following slide, what you can see here coming
from our system that could be achieved.
I would assume by our system as well is 1st of all, the ability to automatically
get the important information coming from the observability layers, which
parts were deployed, what type of APIs are now being run, and that as
a beginning, Gives a very nice glance to what was deployed in production
moving forward.
Now, if we look at the IDP hierarchy of the IDP schema, we'll most likely be
faced with looking at specific services.
so the deployments, run, status, the runtime configuration, which will, appear
as runtime label, so on and so forth, and the performance of strategic services.
If we look again here, and if we look at one of the services, now within
that context, we can see things like the health behavior, the important,
the important metrics, which are crucial to understand the health and
performance of that specific service, the ability to track the deployment,
what's meant to be happening, and how it behaves in runtime, and of course, the
events, things which are accompanying or are the operational status of the
service in the production environment.
Moving forward with, the left element or trying to combine what we spoke
about earlier, to the left level, this would be, the third party resources,
the internal resources with regards to their actual properties, things like
availability, latency, And the status of the third party resources could be public
APIs, payment gateways, for example.
If we look here again, and now we're looking at the real time Kubernetes
topology, here are the third parties being used, things like the databases
or queues, runtime caches and APIs.
And also, if we'll open the, if we'll open the internal deployment,
now we can see the runtime dynamic.
Each and everything here is accompanied by a lot of different layers.
but the importance is you can really see the flow.
You can really see how it's all tied in, tied into the bigger picture.
And all this information is essentially very complementary, and very important,
very important when, being given with, being given with the IDP data or the
what we're trying to achieve data.
Okay, so let's, let's try to recap and, understand at the high
level the real time topology.
It allows us to enrich the service catalog with what we put in
our schema, what we try to do.
And what was actually happening or represented in the runtime environment
with all the different contexts.
This would be the configuration of dependencies.
and that's a very important goal that we're trying to achieve.
It also provides a common language.
The IDP tries to encompass the platform engineering way of thinking
about What is the right abstraction level of the production environment?
That's something which is clearly need to be explained and used by the developers.
But on the other hand, this will implicate the runtime environment.
So the common language here cannot go on a single way.
It's not only IDP up toward developers, but it also, if you will, looking at the
IDP down toward the runtime environment.
And with that, the production environment Personnel which is responsible for
maintaining that runtime environment.
SREs and DevOps would be a good example.
You have to close that loop.
And if not, then an IDP as a function would be always a center of attention
because you need to comply with both these layers in order for it to work in reality.
It also furnishes very crucial context for decision making.
If I'm looking at something at the IDP that would be exposed to my
developers, that's not the only knowledge in the world which is crucial
in decision making throughout the software development lifecycle and the
operational or maintenance lifecycle.
There needs to be more data there exposed to the developers
and exposed to decision makers.
So throughout things like issues, procuring or maintenance or
analyzing what should be done.
the entirety of the needed information would be there, for
right decision making and to have a clear view of what's going on.
So let's look at three real world use cases.
anticipating a reliability impact.
one of the holy grails or trying to, or the most wanted things to achieve is to
understand when I'm going to do something.
What is going to be the eventual implication and how can I close that
loop as fast as we can in order to verify that it's working indeed.
And what we anticipated is the performance hit or is an implication on the runtime
environment could be realized very early.
of course, with that topology can enrich.
The tribal knowledge.
That's what people are thinking.
That's what people are knowing from past experience.
what they already encountered with past incidents.
but that with the topology becomes not a tribal knowledge, but the actual set
of data points, which could be verified.
Understanding which are the affected services.
If there is a runtime, implication, is there any implication on
the service level objectives or the service level agreements,
which I comply with to my user?
and eventually it helps us answering real world important questions.
Example, what does the service really cost from resource?
point of view, where there are changes in the CPU consumption envelope, memory
consumption envelope, what is eventually the spin off cost of a service.
The benefits are, of course, being able to define, store stronger guardrails
within the IDP based on real world data, based on data driven checklists, and not
based only on what we call the tribal knowledge or based only on experience,
which is, of course, bound to be true.
are bound to be broken.
people are making a lot of software changes.
People are making a lot of infrastructure changes.
And new deployments will fail.
Even existing deployments with updates will fail.
And that is something which has to be as dynamic as the runtime environment.
real world example for something like that could be a service which is deployed in
a high volume transaction environment.
during peak hours, we would assume to see a lot of mechanism coming into play.
Things like auto scaling or being, being, consuming much more, Resources, and
that often lead to incident by itself, incidents which are related to memory
overload or any, chain events like impacting, downstream services, things
like dashboards or front end or any, any transaction, transactional environment.
So being able to, foresee, do a loopback from the runtime environment towards
what's being presented in the IDP can eventually both in a real time
environment give you the right data, but also the process of improvement
allowed that to be, allowed that to be incorporated within the IDP.
So the guardrails really, identify and the guardrail, really represent what's
happening in the production environment, and they're there for, much more,
comforting or much more, much more, real.
to what should be the, what should be the guardrail in
the, in the actual deployment.
Another element is how we use that to verify deployments in, in real time.
as we said, IDPs often, the, often, offer a developer guardrail, things like,
limited memory usage, CPU usage, latency.
budget or bandwidth budget, to ensure eventually, the efficient running
and reliable running of deployment.
These are there to make sure that even a faulty deployment won't bring
down the entire system, but it will be localized in a way so it doesn't,
doesn't affect the rest of the services.
but these are really in, in a lot of times are set, during development
phase and much more, sometimes they're based on assumptions or they based on
intuition, which is essentially the message that we're trying to convey here.
This is a lot of times the source of many failures.
This cannot be based on intuition, but has to be, foreseen from the runtime
environment, which will feedback, which will values, and pushed into
the IDP or integrated into the IDP.
So that represents what's really happening there.
Could be two directions.
It could be that I'm thinking something which is too low,
for the runtime environment.
Could be too high for the runtime environment.
And that would be another aspect of how efficient I want to be.
In the runtime environment, as a topology can provide this runtime
monitoring of performance of deviations from means or from any other metric,
it could accelerate the remediation by having enough data for the developer
who is now trying to understand what's going wrong or the platform team,
which is trying to understand what.
going on, and as we said, as the continuous process of improvement also
allowed these guardrails to go and eventually converge towards reality.
and when that's something which becomes that, when that's something which becomes,
a continuous process on the platform engineering part on the development parts,
eventually create a better methodology.
to make sure that these things which run, smoother and smoother.
a real world example here, when a new service is consuming more
resources, example here could be excessive CPU, but it could be memory
or any other consumable resource.
the real time topology with regards to, the metrics, with regards to the
log, with regards to other, evidences.
can flag that immediately and can flag that with the content of who's speaking
to who or what's going on within the dynamics, which is a valuable and
crucial piece of information, which are needed by, needed by the developers.
Investigating and responding to issues.
So we understand that I did.
I did these support a lot of the time.
The notions of incident.
They support a lot of the time the integration with observability
tools, either on the infrastructure layer or the observability layers.
They're supporting integration with third party or higher layer
incident management platforms.
So the data is.
contextualized.
The data is already meant to be there.
the service catalog is a very valuable asset for root cause analysis.
It's essentially the building blocks which I'm trying to look at.
And the root cause or the source of an issue is going to be most
likely something within the catalog.
There are caveats to that.
the example that we gave about third parties, but most likely if it's a,
an issue and internal issue, it's going to be an element, which is
part of the catalog, the problem or the element, which is missing.
It's only extending through.
Again, what I'm trying to configure, what is my schema, what I'm trying to
orchestrate into the runtime environment.
If we bring in the information from the runtime environment, now we have
essentially the matching between the catalog and the runtime catalog, if
you will, and those crosses between them are essentially a very important
part in the root code analysis or.
issue analysis that any engineering team is going to run through, essentially
trying to mark off elements which are not relevant until they converge
on the element which is faulty.
So that is a very natural and native behavior for teams to act
with and the tools like the IDP has to support it and has to bring in
the data in order to enable that.
A real world example here, I team needs to identify whether a service, which is now
a faulty, degrading toward the customer or towards the internal customer, is related
to issues with third party resource.
That would be something which is really hard to do just based on.
Orchestration or just based on the configuration part, we have to have
more data in order to understand the health of that service to
understand the dynamic and use of that service with 3rd party or external.
Dependencies and that should be again.
Presented towards the platform engineering team towards the
development to understand.
We're not only where to look at, but put, to put a focus or, to pinpoint the
exact location of where it can start.
And that's something which cannot be achieved just by looking, just by looking
at static schema or just looking at, just looking at the service catalog.
It has to come from somewhere else, but it's very important to look at
that with, with regards and with the context of the elements in the system.
Thank you.
Common pitfalls and, how to avoid them.
the maintenance, it's essentially, true for a lot of things.
but closing the loop between the service catalog and runtime is only a first step.
don't put, don't overlook the resourcing or don't think it would
be, a minor or a, a very small engineering effort, to keep it properly
maintained, to audit it, to service it.
There's a lot of, there's a lot of commitment.
Which is needed here to make the thing, to make these things run
smoothly, to operate smoothly.
But that is for sure an investment, an investment, worthwhile of, doing.
Incomplete runtime context.
In order to do what we're speaking about, you need enough context.
And depending on the complexity of the system, depending on the amount of,
if you will, free variables, in the system, which includes the resources,
the APIs, the third party services.
this thing is sometimes hard to bring, not only in terms of actually having
the data, but from the configuration of the runtime tool set, from, the
aspect of how much Investment is needed from not only time perspective,
but also tool price perspective to having all the different connectors
to the different layers in the runtime environment and bring it all to the IDP.
It's worth mentioning here that technologies like EVPF technology
can enable auto discovery of the runtime environment, which
can better and ease the toil.
of exposing that data, without manually going and configure a million type of
dashboard, a million type of connection point, could be very helpful here.
static snapshot.
so to be useful, the runtime topology need to be continuously and
automatically discovered and updated.
Not only is a way to.
Collect this data, analyze these data, able to not use a lot of documentation,
which is most of the time irrelevant on the, on the day of writing it.
so being able to actually get that, as a continuous process, is super
important and saves a lot of time of trying to do it in any other way.
Tips for getting started.
building your topology.
of course, observability tools can help.
A lot of them has elements like, a lot of them has elements like, runtime topology,
usually based on things like, DNS.
Or basic modes of, networking, but they often require a lot of
configuration around them to.
Explore all the different, layers.
there are platforms that offer auto discovery, in real time or real mapping,
if you will, of the environment.
This is, as we said earlier, often with the help of technologies like eBPF,
which are exposing, the dynamics, not only from a configuration point of
view, Or, or not only from the tendency point of view, but stemming from real
networking elements, real interaction between the different services.
the integration of your, the integration of the topology with the IDP.
the IDP, already most of them offer integration with observability
and incident response platform.
It would be very native, it would be very native to integrate, the topology part.
With the caveat we said, we spoke about earlier, as good as the good.
It'll only be as good as the topology, exposed from, from the observability tool.
but that is as a first step, even bringing it front facing, towards the
user of the IDP, is very important to put all the needed data in one
place, and have, and have this data reachable, by the u by the users of
the, IDP, prioritization of use cases.
Prioritization is, naturally important, in any engineering, in any engineering venue.
but being able to identify the quick wins.
Where are the hotspots?
Where are the services which are more tailored toward failure, which makes
more, problems or makes more challenges?
In the production environment, focus on them.
You don't need to solve the entire system.
You won't be able to solve an entire system, especially if the scalable,
scalable one, but the ability to.
minimize that look at the important part, the important building blocks,
whether they're internal, whether they're a platform or infrastructure
part or service elements or third party elements, which are super
important priorities based on them.
And eventually, when the relevant team is looking at the IDP or the
relevant thing is using the IDP.
If we can put this quick win to where that specific team on the
grand scheme of things, it's not only net positive, largely net positive,
because from the beginning, these are the elements which are most, most
likely to require that and use that.
and the last element would be to scale and expand.
Very, very common if we're speaking about the concept of quick wins, but you
need to prove to the users of the IDP.
This is a valuable tool.
So using this early win and being able to go step by step, building more or
additional opportunities to weave in or to put in the runtime context into
the IDP are Not only valuable, but as a process of internal con conveying, whereas
the process of proving internally that.
Winning mechanism are super important and using this, strategy to build it in
step, prove it in step and eventually using it in step, are going to eventually
affect the software development life cycle, the deployment life cycle.
All of that would be achieved, improving in, in ways that,
people can cope with that.
I really hope that you enjoyed that as much as we did.
we think there's, A very large room for collaboration.
We're in large collaboration between what's being done on the observability
world between advances in, in, the abilities in the observability
world and how real time topologies could be auto discovered, exposed,
maintained by the tools themselves, and integrating this data toward the IDP.
eventually it becomes, only logical that we're speaking about things like
a continuous integration, continuous development, and thinking about the
software development lifecycle, deployment lifecycle, those tools has to work
in, have to work in, have to work in conjunction, to water, to one another,
they're complimenting one another.
And each and every one of them is depending on the other one from things
like the IDP, which encompass the organizational knowledge about what should
be defined, how it should be orchestrated, where it should be orchestrated.
And eventually doing, the deployment, doing the deployment or commanding the
deployment, to the production environment.
And of course, all the way through the outer hand, which is the deployment
environment or the runtime environment, in reality, and those won't leave
this cannot to cannot live in void.
Eventually, they're dependent not only with the teams which need to work
together in order to solve issues or to add capabilities to the system.
But also, from technical perspective, there is going to be a lot of
convergence between the abilities of those two on integration here is native.
I really hope that you enjoyed it.
we will be happy to answer any question by email or by, by coming
to us through, through the website.
And I wish you very happy, very, very productive day.
Thank you.