Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to my talk about next generation enterprise observability workflows.
I'm Matt Morris. This talk will focus on unlocking observability
at scale, especially in enterprises, by reimagining what's
possible when integrating ITSM tools with observability tools.
If you just rolled your eyes when I said itsm, two things.
One, I get it. Two, stay with me. You just might be surprised.
So after a brief backstory, I'm going to propose a new framework
for these integrations and what we should expect of them with real examples.
Whether you work in dev, Ops, SRE, ITSM leadership,
or some other functional area, I think you might find some value in this talk
for you. So let's jump in. So why are we talking about this?
I think it's important for us to take a step back and think about some
of the things that led us here between the time when ITSM
was really in its heyday and now. First dev and Ops
used to exist separated the wall in between
that once the product was developed, it would be tossed over the wall for
Ops to run. And this led to a situation where there was a lot of
disconnect and honestly some tough outcomes
due to the fact that they were separated. This led, as we know, to the
DevOps revolution, which combined these two functions and allowed us to achieve much
better outcomes by having complete ownership end to end.
And SRE emerged as a practical way of achieving DevOps
outcomes with recommended practices and frameworks
for how we could do some of the activities that come with this model of
integrations. But this pulled us away from the traditional approaches to
running services, which formed the foundation for a lot of
the ITSM methodology at the time. Next on prem workloads
were lifted and shifted to the cloud and then re architected to run on containers
and now shifting evermore to run on services that are
platform as a service or function as a service. And this means that
we went from a monolith architecture to tightly and then more loosely
coupled microservices, and maybe next to multi runtime microservices.
Continuing the trend towards smaller modular pieces.
Being managed individually seems to be where we're at. And this
led to a lot of interesting outcomes in terms of what used to
be normal with ITSM versus what we see in this world.
First of all, change management has evolved a lot because CI
CD pipelines and DevOps practices mean that change management is
often not the toll gate for deploying the prod that it used to be.
CMDB, the configuration management database predicates a lot of
its value on having a consistent inventory with rich attributes about
all of the items that are a part of delivering services to
your end users. Getting an inventory that's up to date for ephemeral
resources, especially when they're hosted in the cloud, is near impossible,
and consuming extra overhead to do it is very hard to justify.
So we need to be looking at ways to leverage the information that we already
have. Next up, service maps have become more challenging than ever,
and modeling them using traditional systems of mapping that are available in these ITSM
tools is suboptimal, and visibility suffers
because black holes develop in processes. Executives and leadership
can't really see the overall picture, and there's a lack of view into performance and
user experience, user happiness overall, these are a lot of the things that
the ITSM tool is supposed to be able to deliver. This might seem like bad
news, but the reality is we have the ability to deliver
a lot of these things with the observability data that we already have. We're just
not leveraging it. So go with me here, and let's
think about this from a visual standpoint. We have monitoring or APM
or observability tools and the evolution that has happened there into the tools
that we have today. On one side, and they're producing a lot of data
right now. On the other side we have this iTSm tool like Servicenow
or Salesforce or something like that, and it's producing a lot of data, but it's
really from a process standpoint about how do we get the things done that we
need to get done. Now, the thing is, in this current environment,
because of the challenges that we just talked about, there's this wide chasm that has
developed between these two sides, and the data passing between the two sides
and the handoffs between them have honestly been really sad.
I've spent years and years working with the integrations that exist between
these tools, making my own, and I can honestly say that these
are consistently some of the toughest integrations to get the outcomes that
I want. Now, there have been some attempts to unify these two
sides and drive some communication
across this chasm. One example is just taking a webhook,
and whenever there's an alert that's happening in an observability tool,
we toss that thing over to the ITSM side. Now the problem is this
just hearkens back to the old problem that we have with Devon Ops. We're just
tossing the thing over the wall. There's no richness here, no workflow.
There's no automation capability. And some integrations
have tried, with varying degrees of consistency, to try to bring
in host data or info about the entities that are
being monitored by the tools. But this has been a
very lightweight amount of information that's being brought
so far. And I think if I had to summarize the biggest problem that I've
seen with these types of integrations, it is that they're not approached from a
perspective of the outcomes that we want to be able to drive.
By bringing observability data
and ITSM processes and automation
and workflow capabilities together. Right. They're not
driving it from an outcome standpoint. It tends to be from a perspective of
we need an integration, we have to get this data across and
we kind of check the box, right? I think we
need to ask for more. That's where I'm coming from. And this has led to
several subpar outcomes. First of all, we can't maximize the
value of the tools that our organization is paying for. There's a lot
of value being left on the table on both sides of the equation. Observability data
can do a lot more than just toss an alert over the wall or create
an incident. ITSM tools can do a lot more than just try to assign the
incident and produce some MCTX dashboards. Troubleshooting is
another area that we could do so much better. Observability is about context
rich data. Don't listen to those people who try to tell you it's about three
pillars. The webhook plus light instant data approach
strips away lots of the most valuable information that is
available here, and users are forced to do manual context switching between
platforms, trying to figure out what the incident even means, where it came from,
how to fix it, never mind who's affected or how badly. And these are
the outcomes that are core to observability. Automation and continuous
optimization are supposed to be core focuses of every discipline
that we're talking about here. But again, the lack of tight integrations
and thoughtful design for the interplay between these two sides
means that many opportunities just fall through the cracks, and we want to break
down silos. This is what drove the DevOps revolution to begin with. And although
walls have been broken down between dev and ops, in many cases the ITSM
team and their processes kind of remain on an island. Painfully.
Those processes that are supposed to be protecting quality of service, the company's
bottom line and user experience, come to be seen more as bottlenecks,
red tape, and low value activities. And meanwhile, the lack of governance and
process visibility with ITSM on the sideline can be a serious risk
to the business. So what do we do about this? I took it as my
mission that I want to try to contribute to a world where the
combination of observability tools and ITSm tools
can be more like the second emoji here instead of the first one.
So I'm proposing a new framework that will allow us to get to the outcomes
that we want and maximize what this relationship could be between
observability tools and ITSM tools. I'm calling it the observable ITSM
framework, and this is version one. And I've broken down some components of what
I'm including in this framework across a few different areas. First,
in terms of changes or deploys, we should be able to automatically
create changes for CI CD activities and display those flags in the observability
side too, so that we have full context what changes are happening when,
and we can use that as very rich intelligence when we're debugging our
applications, we need to enable the attaching of an SLO to a change request.
If the SLO is burning post change, then back it out. And this aligns
well with practices that we probably already do in SRE. We bring ITSm
into the fold here, and it's something that can take zero effort from the
DevOps side to make this possible. And we should be able to open change requests
in the observability tool to see change outcomes as experienced by
real users. Great examples of this is on the ITSM side,
somebody's creating a change and they have tagged in a service
that we're observing in observability tool, we should be able to open that up and
see if as that change is deployed, if it affects the performance
of our service. Because changes come from a lot of different sources and in
a lot of different packages in terms of service components and
maps, we need to be able to create records for all entities and slos
observed by the observability tool. Create those into CMDB based
on telemetry data, and they should be auto refreshed to avoid
staleness. We do this so that we can then make
really rich maps out of these entities, and we should map them based
on host attributes that are in telemetry data and parent
child relationships and traces. And all of this should be something we can set up
in five minutes or less. Minimal steps. We need to be able to create rich
incidents directly or via the event management processes
that are including full context services, entities and slos
that are affected and the ability to pass the responsible team or severity
fields like this into the incident directly.
What we want to do is enable the teams that are creating alerts to add
some intelligence into those alert payloads that
actually get processed automatically. On the ITSM side,
we need to be able to open a detected incident from an observability tool in
one click back into that tool for troubleshooting. We shouldn't have to
be copying and pasting links or searching around trying
to find a certain alert number or incident number in the
tool that generated the notification to the ITSM side. And we should be able
to use one click to open a user reported incident in the observability tool
as well. Just because a user happened to be the one who created an
incident and said hey, this thing is broken is no excuse for
us to not have a good route for being able to open that up in
the observability side. If you like the sound of this, then you're going
to like this next part. Let's look at some examples what this can look like
in practice. I have an integration that I made for honeycomb to integrate to servicenow
and start to achieve some powerful outcomes that aren't possible with any of these other
observability integrations that exist for other tools today.
So first of all, in terms of setup, we clone this repo,
we get this update set that's available here, and we bring it
into the retrieved update sets in serviceNow. The setup is
simple here and it's not even a store app, even simpler for the store app.
So we open up the update set.
So now we've previewed and committed the update set,
we just connect a new environment by adding an API
key.
We'll give it whatever name we want and submit
it here. We have the choice
if we want to populate CMDB from the tool, or if we
don't want to, I'll allow that.
We'll submit. It's that simple.
At that point, it's going to trigger a lot of actions behind the
scenes that will allow me to route alerts
directly to serviceNow. And I'll also
import all of the entities that are being observed in Honeycomb
into CMDB and map them into services.
And allow me to do several of the other outcomes that I talked about a
minute ago in terms of workflows that will
help with troubleshooting, observing changes, so on and so forth.
So now, for example, on the honeycomb side, look at an slO. We can
configure a new recipient for burn alerts.
And this recipient was created automatically and registered in
the background by the integrations
and we'll send it in as an event.
We can see that on the servicenow side, we now have a new service that's
been created called microservice demo. We can open
it up and we can see the service map again. All this is done just
by inserting API key.
And this service map is drawn completely by using
telemetry data that's already available in the observability tool.
So we can see all of the services that are up at higher levels and
we can even see down to Kubernetes pods and
where we have data about them, they can see the nodes as well.
So now that we have this service that is available, we can
go ahead and hit this service with an SlO
burn alert by triggering this Slo.
Something else to note here is that we do have the ability in
this integration without any touches from the user
to be able to specify things like severity or Simon group that
these alerts should route to or incidents should route.
And so we'll show an example of that. So here we're specifying
this one called cloud operator group.
So what we can do now is I'm going to change the target for the
SLo so that it will trigger.
Okay, you can see that it is triggered now we go over
to the service map side. We'll shortly see the service
map lighting up with the impact of the alert that
came in.
And we can see that the severity of the service map did just change.
Received a minor alert against the front end service.
If we want to see what this alert is about, we can open it
up here, we can see the details of
the SLO.
We can see the full payload. We can see that it was
assigned to the assignment group that we wanted it to be.
We can see the full payload down here.
And most importantly, getting back to our conversation
about making things seamless and
allowing troubleshooting to be easy, we have a button here
that says open honeycomb. We click launch and it takes us directly
to the slo that is affected by the issue.
And this doesn't have to be done through event management. We can also do
the same thing. We'll have a recipient that's automatically created for
incident creation. We can just as easily
map this to that recipient.
We have as well an incident option.
Now let's look at an example of a change request.
So we can see that here we have a change request that is making some
changes to our cart service caching servicenow is
tagged to is our MS demo service, which is observed by Honeycomb,
as well as our cart service as the main configuration
item that's being affected. We've gone through
the process of getting our change ready to go, and the next thing that we're
going to do is put it into implement state.
Now that it's in implement state, we have a button here says open
in honeycomb so that we can observe our change in real time as
it's being deployed. And this pulls us into a query where we
can see in real time the count of transactions and the heat
map of their duration against this service 2
hours before and 2 hours after the change was starting to implement.
And we can see that it's even scoped down to the service name,
which is cart.
One last example, let's pretend that we're a user who
has come in and is reporting an incident for an issue that I'm
seeing. Could do this from the service catalog or various other record producers
on the servicenow side, or I could create it here directly. So I'll
just pick a caller which can be me,
I'll give it a description. The MS demo application is
slow and
we'll pick our MS demo service.
We don't know anything else besides that, right? We'll just say, hey channel,
this is a self service thing state. It's new.
Okay, fine. And impact. I don't really know what
this is all about, but I'm going to set it to a one because the
service is really important to me. I'll save this incident
now without knowing anything else. Incident was just created.
We have. The only piece of information we
know is that it's affecting this service and this brief description
here. We now have an open and honeycomb button that
we can go in and see what's going on with this particular
service. Because we have identified that this
is something that's being observed in honeycomb. Could be a kubernetes pod,
kubernetes node. A service that underlies the
top level service. In this case we're looking at the top level service. So we'll
click open in honeycomb and we get back a query that shows
results from just a moment ago until now
about what is going on with this application.
And if we wanted to, we could zoom out even further. We can say,
hey, let me see what this is like the last 8 hours. What happened leading
up to this point? Looks like there was a spike in latency a little earlier
in the night, maybe. I want to go back and figure out what's going on
with that. So if I'm responding to this incident,
I have a very easy one click option for me to get into troubleshooting
exactly what's going on here.
And this is just the tip of the iceberg. What's possible with
a really thoughtful integration between observability tools and itsm
tools. So what's next?
I'm planning to come out with a second version of the observable itSm
framework in the end of Q three this year.
Version 2.0 is going to be packed with big plans for some very
cool features that I have on the roadmap. If you're interested in this
journey, adding your voice, building out an integration like this, or just
commiserating about the things that we want to be better,
then let's talk. Look me up on LinkedIn and let's connect.
You can dm me, let me know you watch the talk and two things you
liked if you liked to talk, or two things you hated if you hated to
talk. Let's keep the conversation flowing, challenge the status quo, and demand more
from these integrations.