Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Matty from Uptime
Labs. Thanks so much for coming to my talk at uptime
Labs. I lead generative AI projects in the space
of instant management. Instant management is very important to
me because in the past couple of years me and my team have been
dealing with machine learning models
and large data to handle. And as
you know, a lot of incidents happen in that space.
Incident management incidents management role so I
started to start with I asked Chat GPT to
give me a definition of incident management in general.
Although the short answer is correct,
the main components that all we know related
to people management is missing in this definition.
As we all know, there is more people managers elements to
instant management elements such as effective coordination,
effective communication, managing people in stress and under pressure,
time managers, and understanding incident mechanics. Of course, I provided these
as feedback to Chat GPT, so hopefully
you won't see you will get more complete
answers regarding incident management if you ask chat GBT
uptime Labs is a London based startup,
almost two years old at this point. So when you ask
Chat GPT to give you some information about uptime labs, of course
it lacks that information, but in this case
Weiner was a Google Bard which
gives us a brief information about
uptime labs. We are an immersive environment to
enable people to practice people skills
along with technical skills such as time
management. As I said, coordination communication in
a safe environment without breaking, worrying about breaking things,
everything is controlled and in a gamified environment you
can practice and hone your skills and improve them.
Through this gamified process.
You can also improve instant response skills
and muscle memory, identify the weaknesses with the
reports that you get and also test instant
response plans, et cetera, et cetera. So you're
going to see in a demo in more specific way
what we mean by practitioners these skills.
Unlike many other solutions out there, our focus at the moment
is people and processes.
A lot of solutions focus on monitoring tools
and monitoring incidents and providing information on incident.
But we believe by practicing and
improving people skills,
navigating complex systems becomes
more manageable over time. So this immersive
environment, we believe will help to hone
incident responsive skills over time.
The immersive solution that is provided
by uptime Labs is an immersive
environment where you see a full blown
e commerce company. This is their configurable of course,
that you have several characters with different personalities and different
characteristics that are powered by generative
AI and you can interact with
them with your team and you have also
complete stack microservices,
kubernetes that realistic
dashboards are provided and in the demo you're going to see these
dashboards and how these work.
The immersive solution part of it, the main part of it is the virtual characters.
Virtual characters that can interact and it can improve
your skills. You can provide feedback and they love your feedback.
They keep improving over time.
And this team is reconfigurable
based on different drills and different games.
They change skill set and personalities. Of course,
they can be stubborn. They have different personalities.
They can have replying within several minutes, which really
happens in reality. In case
of incidents under pressure, people's response
are different critical incidents.
Management skills that you can practice. The main
important ones, the train points, identify scope
of the problem, effective congregation under time pressure,
the structured troubleshooting, understanding incident mechanics,
forming effective hypothesis, step by
step reasoning and effective coordination. All of these are
very key to incident management. And we believe with the
help of this immersive
gamified environment can actually practice these one by
one and move forward.
I've prepared a demo for this talk where
you can see how
this gamified environment works. And also
I have a surprise for you toward the end.
So yeah, bear with me.
The first part of the demo
is about the roles in your account. You can play
different incidents, management sre role.
You have this virtual online boutique
environment that is completely realistic.
You can go and place order and there
are different variety of products are being sold
on this platform. You have your team full
hierarchy from CEO to development and platform
team escalation team. And this
is the hierarchy of the microservices. A lot
of different drills are designed and available for you to practice. Different skill
set. This drill set
are personalized based on requirements
from our customers and they can be personalized
for different skills that you target to practice
over time. The next demo next part
of the demo is the actual start
of the drill. One of the drills I've selected
for this talk is
powered fully by generative AI.
And here you can see the full dashboards.
You can see the orders per minute or all the logs
and crfs and the changes that are happening in
reality. And in the
main monitoring dashboard you can see the logs happening across different
teams and services and the
main communication platform chosen to be
slack at the moment. And in slack you
are able to interact
with your team. At the moment.
The plan for supporting audio and video will
happen in the future, but we're working on it.
But at the moment it's text based and the communication happens
on Slack. When you started rail,
you are given two main channels automatically
generated for you. Instant bridge,
basically. And a business communication channel.
When you start drill your characters,
your team come and introduce themselves. Like introduce
the company. Bez is the CEO, the person
that sets the tone for the company and
gives an introduction to the train.
And as they're explaining
and like a bit of chats going on, to give you some
idea about different characters. In the meantime,
you're understanding the mindset of the people in
your team. And there's a point
that incident
gets started.
And at that point, Bob, which is customer
services manager, comes and reports
that the customers are not able to see place
orders and
incidents. Manager role is to check the dashboards.
And you see in reality that things are going down,
things are breaking and user activities
going down and starting to get the pressure,
feel the pressure from the team. A lot of different messages
coming from different characters.
And of course they have different responsibilities
at this point, declaring that who's
the incident manager is very important.
So that the team knows who's taking command.
And in this example,
Daniel is my assistant and
from development team. And I'm interacting
with Daniel to form better hypothesis
and be able to.
Because of Dan having access to logs and
the patterns from the previous incidents. I like to question
Dan in terms of the
changes and a bit of information about that. So Daniel
in this case, is a generative AI agent.
And in the meantime,
as you move forward and as you practice,
you're getting rewards and hints as well.
And they are directly going to
your report of how you play the drill, your performance report.
At this point, I'm asking Daniel about the
latest changes. And I would like to know
if the incidents, the current incident,
there's any similarities and the patterns and
what a response I'm getting from Dan is that the previous
incidents that we had in the last week, like database
capacity and ISP issues, or the
front end issues of cache.
And Daniel comes and says there's similarity in
terms of based on the
logs, it's able to match that with the information there.
And it shares the exact logs to
see what the logs are. And I see that the request
per minute are increasing unexpectedly.
And based on the analysis, I gather in the size up
stage, which is one of the main steps of instant mechanics,
like the size up. And then I'm moving triage, action and review.
At this point in the size up stage,
after gathering enough information, I come to the solution
that, okay, it's time to revert certain change.
And that change is the front end change that
I believe is going to help.
I command and command
Dan and Dan takes care of all dads automatically
for me. And at this point
by checking the dashboards, I keep an eye on the dashboard as an incidents
manager and I see that at some point the
placing orders are back up
and running. So at this point
which is needed to refresh Grafana
and that change,
I can see that the orders
are being placed again and things are back online.
So I reported back to the team and I
provide that information to the team that the
customers are able to put more
orders online and the
entire cycle of moving from size up to triage
to action to review is try to show
in this demo within five minutes how
generative AI is helping me to better understand the logs
and find the patterns and understand the changes
and reverting changes and stuff like that within
a drill. Hope you enjoyed this
demo. Generally gentle AI at this point
helping us with drill design, personalized drill
design as I said, other fact of automation
and being able to control the platform.
At this point we're getting help from generative AI
reporting pros and cons and the performance
reports based on which could improve and also insights
that are available in the data that can
use AI for getting that information.
Finally, please feel free to scan this
code and the first five to sign up
will receive a one month free access and hope
you enjoyed my talk. The thank you very much and see
you soon. Thank you.