Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone and welcome to Conf France 42 and
thank you for attending my presentation on unlocking sites through incident
management reports. My name is Andre, I work for Sporting Tech as
an incident team leader and report is part of my
day to day weekly report, monthly reports, let's say
internal reports, reports for stakeholders. So this is being not only
on this current job, but previously also a big part of,
of my work and actually a personal passion. So thank
you so much for attending this presentation here with
me. We're going to have a light presentation.
We're going to talk about basically six points.
The essence. Incident management team lead mining,
insights from incident data, fueling platform improvements,
customer centric perspective, optimizing resource allocation
and enhancing incident reaction.
So as I said, very live presentation. We were
just going to talk about overall topics here and
how these incidents help. Incident management team lead SE
helps us achieve better platform and better
customer satisfaction in the end of the day.
So first of all, we need to understand what is the incident cycle.
This is the first step. So incident cycle, we can divide it in
five major steps. First detection
where what's going on, we identify
an issue and we need to tackle it and we pass to step number
two, investigation, understand root cause and
find out a way to mitigate or permanently fix it.
Step three, we are ready to fix the issues we are fixing
can issue and we're analyzing all the fixes,
deploy new codes, do the necessary communicating
with stakeholders and clients that we are
fixing these issue four point recovery.
Here is the part where we going to monitor, basically we
recovered, we fixed the issue,
we are recovering, we are seeing all different
dashboards to understand these platform health and
we go on to take care of all the actions that are left after
an incident, see if there's something
that needs to be revised and so on.
Fifth point is post event activity where
we collect data along all the points. So this data
increases from detection to recovery.
And on points five is where we're going to analyze and we're going to actually
start here with our incident management team
lead where we have data to see how we can improve
all this incidents cycle and our time for reaction and SLA
for that. Incident management team lead can
have a role, a purpose and key
metrics that they can differ company to company.
So I just try to list here the ones that I consider the
most important ones for the role, the role
of the incident management team lead to capture incidents,
what's going on, how many capture the information
that happened during the month to coordinate
between teams for a better reaction in
the future. For same incidents or for new
incidents documents, all the process
that we are dealing with and how we are working
and to prevent further incidents like this
so we can learn and prevent incidents
to have a more reliable platform. These key
metrics, and again, this can differ company
by company. I'm just listing here the four ones
that I think that are most important, ones that are severity of the
case, MTTR, time to resolve the case, slos.
So the agreements that we have with our clients
and internal or external slos and frequency,
how many times these things are happening and how many they
are repetitive. So basically this is what I consider
the most important key metrics for incident
management, team lead.
And we go on, and we go step
by step, month by month, and we are collecting data. Now,
this data that we collect, again, will depend on the company.
Some companies are going to focus more on the financial parts, others on
the report part. So this depends actually on the company that you're working for.
So basically, month by month here, the idea is
you collect affected services, you collect quantities,
you can collect quantity of incidents happening,
you can collect how
was the SLA, how long it was to resolve the different issues,
the average to mitigate an issue and so on.
So here is going to differ again, as I said, company by company.
But the idea here is month by month,
you collect information day by day. Even in one single incident,
as mentioned previously, you can collect data
on different steps of it.
And now we have data, we collected data and now
we need to understand what are we going to do with this data.
So data per se doesn't say anything, it's just
random numbers, text, not even
formulas yet we don't know what's going on
there. So the data, to be able
for us to work with the data, we need to basically, first we need
to can the data, we need to revise all the data, understand that
if there's any typos, making sure that all the fields are matching
the ones that we want, make sure
that there's no empty values, or if there's empty
values, they have a reason. So basically make sure that the data is the
most trustful possible.
Second, structure this data. So we need to
find a way to relate data.
Relate information from, for example, if you work with Excel,
I love to use Excel to not only to clean, but also to structure
my information.
Sometimes you have different tables and you need to find a way to relate them.
Or in order to use a pivot table, for example, you need to
make sure that this has a structure on this data
and on the same report.
And then my favorite part is visualize.
Here is the part where we create the graphs, the charts
where you start seeing things happen in a most more
easy way to be honest. For me is my third
part of it. And now let's see an example
on how this works. So after we clean,
after we structure, now we visualize the data basically.
So these we can identify and with this we can identify patterns
and trends. So transactional services were
the most affected ones let's say as an example,
the increase of incidents in the transactional services lead to a drop of
5% of revenue.
This we can see also that there's an increase of incidents quarter
by quarter. So here we are starting to relate three different metrics.
Logging access teams are performing quick and efficient
on these incidents with low MTTR.
They are on average quarter by quarter happening
on the same amount. And we can see that these
MTTR, sorry, is low. So here
give us more input on how we cross. Information tools attack another
example with low or no end user impact. So although
it's happening with increase on the last two quarters,
we can see that there's basically no impact.
Now we have different topics, different information
that we crossed. How can we prioritize the things that we need to do.
And how we're going to do that is by using a
prioritize matrix. So every time that you don't
know how to prioritize, just look at this, it's very
easy. I love this tool. And here basically
we're going to cross again your urgency
with impact. So we're going to understand what is these critical
thing that we need to address the most. And as for my experience,
I know that on the previous graph the transactional
services will be the first one to address followed
by these login access and the DDoS stack due to
the cross information. But every time that you can have here this
prioritize matrix, that helps you
understand what can you do. And with
this after improvements, after you address the things you can see,
then compare year by year how
your efforts result in a positive
way of working and a positive revenue for these company.
You can see that using the insights of incident report and understanding
what's going on, you will be able to see
decrease on your incidents increase on the MTTR,
on slos and reduce the impact for
customers which returns
in a higher revenue for the different companies.
And this is not only internally. So these improvements
are like a cycle. So like a sign BIOS
if you got my point, because customers experience
also improvements customers,
the service gets more popularity among customers.
One friend tells the other friend, the family says, comment this
service is reliable, you guys need
to try it. So basically
by improving our platform, we can improve customer satisfaction,
which again will improve the company and will help
companies expand and gain popularity along
the market. That areas operating.
Moving on how also incidents
reports can give you more insights on other areas.
Let's talk about resource allocation.
Resource allocation. So there's a new project in the company or
there's a development or something that we need to fix
came from an incident for example. And we're going to
define what we need to do. We're going to plan the
actions and now is the part to identify the technical resources needed
and identify the workforce needed. Here is where incident management
reports can give a big insight into this.
With the incident management report we can see that we have x
amount of incidents and this took us x amount of
developers and this amount of hours.
With these information, with gathering
more and more information throughout the different months,
we can understand that in order to tackle
certain type of issues, we will need more or less
people. So here is where incident management report
can give a big insight on this and after
implement six points. Follow up,
follow up document everything, see how this
can impact and use the incident management
report to understand how was
the development and how these
actions that were just taken by the company improve
the company on the incident. Part of you, part of it.
Sorry. And the benefits are clear.
Best MTTR,
best Slos increase on
the GGR. And my favorite one,
employee satisfaction with less work.
With best resource allocation, people feel
less stress, they work better, they areas happier at
work and this only brings benefits to
the company. Now with all
of this said and to conclude the presentation, let's see
how revolutionizing incident reaction can happen with
incident management reports.
We saw all of these steps till now and we can collect information with
every single thing. By leveraging our past, we can
prevent or predict till an extent
the future. This is just a representation on
how we can just start improvements,
alerts and preventive actions before things happen.
Because we know, we learn now that things might happen this way
or that. So we can prevent them to happen and be more
preventive than reactive.
We can improve our platform and increase our stability on
the platform to reduce the amount of incidents that we have.
And all of this will lead to increase of the revenue that
can be translated into more investment into
the platform. Again, learning our
past can teach us a lot of ways to tackle
the situations and to act in the more preventive
way and not so reactive.
And that was it for me, guys. I hope you enjoy.
I hope I was able to waken up
a little bit of curiosity in yourselves about
reporting and hope to see you soon.