Transcript
This transcript was autogenerated. To make changes, submit a PR.
All right, thanks for listening to my, presentation.
it's nice to be back at another CONF42 event.
this is Managing Vendor Incidents.
things you can do when you're Vendors have an incident or an outage.
I am Mandi Walls.
I am a DevOps advocate at PagerDuty.
you can get in touch with me.
I'm LNXCHK on most social media things.
Not really on X.
com anymore.
you can find me on Blue Sky.
If you are totally old school and you want to email me, I'm mwalls at pagerduty.
com or you can join our community site.
We just relaunched it in September of 2024.
It's brand new and that is community.
pagerduty.
com.
That's as much PagerDuty as I'm going to talk about.
if you'd like to talk PagerDuty, I'd love to hear from you.
Go ahead and get in touch.
What I am going to talk about is vendor incidents, because
they're everywhere, right?
And there's reason for that, right?
One of the first key tenets or mantras of cloud computing was
you own your own availability.
If you look for that, search for that on your search engine of choice, You should
find a post by John Allspaugh on his Kitchen Soap blog from like 2013, I think.
It's old, right?
It's been out for a while.
But the idea is being that, your cloud providers are making this
infrastructure and all these great services available to you.
And then you and your organization has to decide what to use, how to use it in
order to meet your organization's goals.
Your cloud provider They don't have any knowledge of your KPIs, they just
know you've got something going on and you're paying them for it, right?
over the last 10 to 15 ish years, more organizations have become
increasingly reliant on cloud computing.
Because it's great, right?
you've got the ability to focus on the things that your business does
that sets you apart, and not having to worry about all of the other
stuff that goes into your tech stack.
you fire up some stuff on your cloud vendor.
And it just appears, you don't have to wait for provisioning
and tickets and purchase orders.
And maybe you had to do reverse auctions and all that kind
of stuff back in the day.
And you spent Hundreds of thousands of dollars on
equipment for your applications.
And you had to wait six to eight weeks before they'd even
arrive at the data center.
And then they had to be racked and stacked and networked and powered
up and all that great stuff.
And it would take forever.
And now it's much less friction, right?
You can Add the functionality you need without all of that stress and to a
certain extent without a lot of the hassle like if you were around in those
days 20, 2006 or so and you wanted a large, reliable traffic shaper and
firewall kind of device, 40 grand, right?
Things have moved along, but that means that you are accepting a certain amount of
risk for how your vendors run their stuff.
And sometimes, most of the time, it's great.
You, things just run.
They're just, they're gonna be there, you're gonna click on them, you're
gonna do the job that you need to do, and they're up all the time.
But things do happen, right?
There is, there's risk, right?
Cloud providers have experienced outages, sometimes due to configuration errors.
It does pop up.
sometimes due to, denial of service attacks.
Those also come up.
And even other, the catastrophic problems.
if you recognize this photo, it was the OVH cloud data center that
burned in Stroudsburg in 2021.
the data center was a total loss.
Those machine rooms were completely gutted.
fire started in a battery.
And it just ate the entire data center.
there's a fire truck in the foreground, but if you look closely, the stream
of water that is in the background is coming from an actual boat that
was a fire boat that was fire, fighting the fire with them as well.
yeah, absolute disaster for folks who didn't have other plans.
it's still annoying for folks that did and could get out.
and we know lots of folks who did.
yeah, that's it.
Those are rare, but they do happen.
What happens more often are smaller errors, hiccups, other problems,
and we can learn to deal with those a little bit better too.
your organization has accepted that you own your availability.
The things that matter to your customers, the services that you provide, the
products that you make available, don't really matter to your vendors.
They matter, in that we want you to be successful.
Bye!
It's not that we don't care, it's just that we don't necessarily
know exactly what you're doing.
Your vendors have a lot of customers, and your business is your business, and you
know what's going on there best, right?
Your vendor recognizes that you've got work happening.
They might even ask you to do a case study, right?
That's awesome, you want to participate in that.
But your services are your services.
Your vendors, Whether it's your cloud provider, another SaaS provider, they're
just your landlord on the internet, right?
So you need your own insurance and your own way of dealing with things,
when your landlord has a boo, right?
so you're answerable to your customers.
You're, as a customer, Then your vendors, the folks that you use, should
be answerable to you as well, right?
You should have that relationship with them.
So we're going to talk about being a good vendor customer and how that's
going to help us deal with, working through these issues and things when
incidents happen on our vendors.
So the first thing we want to do is take stock of our vendors and I'm going
to suggest a sort of organizational strategy to approaching this, but I know
that you have lots and lots of vendors.
You open up your authentication authorization platform and you
have like dozens and dozens of tiles of different applications
that you might have access to.
I open ours and I have no idea what some of this stuff is.
I know it's for other teams, maybe in my department, but I don't have any,
role in the things that they're doing, but the stuff is there just because we
happen to belong to this department.
there's just so much stuff that you might potentially need to be
thinking about in the course of preparing for a vendor outage.
So we're going to classify them a little bit to make them a little
bit easier to manage and easier to think about and easier to prioritize.
especially when we see things like.
vendor outages, to a greater or lesser degree, depending on the functionality.
I've arbitrarily decided to call them tiers.
You can pick whatever classification you'd that makes sense to you.
We use, severity and priority for dealing with incidents.
this, we didn't want to pull that, Sort of priority part in,
because it would be confusing.
if you are familiar with networking, and internetworking, your tier one providers,
they are going to be your backbone folks.
the same idea.
Your tier one folks, are the directly customer facing components.
Things that are in support of your production systems.
It's going to be your infrastructure as a service.
Obviously.
It's maybe going to be shopping cart and payments providers,
if you're outsourcing that.
Hopefully you are.
No one wants to write their own payments library.
It might be search.
We outsource some of that for parts of our site.
and also things that you're going to need if something goes wrong on your side.
Metrics, monitoring, alerting.
Things that if you have an incident or an outage on your systems, you
want those folks to definitely be available at that time.
So they're all sort of production touching, related to the customer
experience for your customers, right?
in a way, a Tier 1 provider outage or an incident on their
components might actually mean a revenue hit for your organization.
If you are at a point where you can actually codify how much it would cost
you for, say, a minute or ten minutes of outage in the main major components,
you want to make a note of that as well.
Not everyone has that granularity, or has that kind of understanding, or has that
sort of predictability in their revenue stream, but if you do have that and
you know, then you can mark that down.
Tier twos.
These are part of your build and test processes, right?
The things that are like your QA tools, CICD deployments, maybe some
infrastructure as code components, where if they're not available, you're, it's
gonna slow you down in your solutions.
It might also be your source code management and maybe even user
documentation and things like that, that the customers use.
But on a less lesser basis, or if your customers are mostly internal things
that they would also necessarily rely on, but wouldn't necessarily, get in the
way of them finishing their work, right?
For example, so an incident on a two tier, a tier two provider.
might not impact all of your customers directly, but it could impact your
ability to then ship a fix if you have an incident on those platforms, right?
So thinking about all the things that impact your SDLC, all the
things that sort of get in the way or are part of that workflow
are, important components of that.
So if I have an outage in my production and I want to be able to push a
fix for it, these are the things that would be part of that process.
Tier 3 and below, if you'd like, are additional tools, workflow components.
They're usually outside the workflow to get things into production.
an incident on a Tier 3 provider should not impede work on production.
So these are going to be anything else that might come up, right?
lots of other tools that you might have, lots of things that
might run in test occasionally, other sort of provisioning things.
The system that builds new images for you, whatever the case might
be, that you don't necessarily have to fix things in production.
And then there's stuff that might fall under other.
There's plenty of, you probably have plenty of vendors and
providers that are other parts of your incident response process.
You have, maybe your conference software.
your team chat, whatever that is, internal docs, like your wikis, if
you're, purchasing that as a SaaS, these are things that you may want to
note and have a backup plan for, right?
So we use PagerDuty, we use Zoom and Slack.
We know if an incident happens when those are down, we have a backup plan.
We go to somewhere, go somewhere else, make sure that everyone has
accounts in those places and can get to them and knows where they are.
Okay, I'm going to have a backup.
A lot of this stuff is probably owned by IT, whereas many of the other systems,
probably the primary owner or decision maker happens to be in the technical area.
organization rather than say in IT, which often reports into
like your CFO or maybe a CIO.
So you may have different reporting structures for how things are
billed and how they come in.
So if these are owned by IT for you, they may already have a backup plan
or a run book or things like that, that you can crib off of to figure out
how you want to manage your vendors.
If you have them for your operational components.
Okay.
Note down all of your vendors.
We want to know who they are, the things that our customers rely on,
that rely on our vendors, right?
for infrastructure as code, everything they hit, obviously.
That's where those things go.
But it might be less obvious of which parts, which components of those cloud
providers all of your services are using, so make sure that we're noting that down.
Add them to your service diagrams.
Sometimes they just don't appear.
There's like this nebulous thing that kind of goes off the
diagram into the cloud somewhere.
But we want to know exactly what services we're consuming.
So that if we do see how there's a problem on this service, we know exactly
where that is, how it impacts our users.
So we can definitively say, yes, we are seeing this degraded
behavior on our application.
We're waiting for the vendor to, to respond and to be very,
structured about how that functions.
So we need to have that documentation of knowing exactly where So customer
requests would come in and where that's going to be impacted by a potential
third party vendor that we might have.
So noting where all of those live in our ecosystem, super
important, doesn't always happen.
We also want to be very conscious of building positive vendor relationships.
It's easy for me to say this as a vendor, right?
But at the same time, like your tier one vendors, you've declared
them as tier one because they're important to your customers.
That's a kind of a big deal.
People are paying you for a service, and you're paying someone else for a service.
You want to have those positive relationships there.
For your Tier 1 vendors, the folks that, if they have an incident,
it might impact your revenue,
consider buying up to a premium support package if you don't already have one.
For a lot of vendors, if you spend X number of dollars or euros or whatever,
you will get extended benefits.
a premium support package.
Not everyone does that for all their vendors.
Totally makes sense.
You have a lot of vendors.
It's a lot of things to think about.
But for your Tier 1 providers, if you don't have a large enough spend
based on the capacity that you're purchasing, consider upping that
a little bit on the support side.
It's not because you expect to need help with the implementation,
which is what a lot of people sort of associate support with.
It's so you can access the VIP relationships that you will
get with that vendor based on that premium support contract.
I say it again.
You buy up to support, not because you need help with the implementation,
but because you want a closer relationship with the vendor.
Secrets, right?
Who knew?
Usually when that happens, you'll get like an account manager.
You'll have more of a direct line of contact with the vendor.
Almost none of us are big enough to get this based on our spend with the
cloud providers, but you can buy up.
For a lot of these, premium support contracts.
For all of your other vendors, everybody's got something available.
Some of it's pretty affordable.
Sometimes you're like, oh man, we don't know.
You really do need to run the numbers.
If they are down for X number of minutes and it costs us this much of
revenue, 10 minutes of outage on their side pays for the entire contract.
you want to run those numbers and figure that out based on
how much you use of their stuff.
You also want to make sure you are maintaining that relationship.
Have the calls that they want to have.
Often you'll get like an every four to six week check in and yeah, it
might feel like a slog that like, Oh, you feel like they're just going to
try and sell you something else, but they're also there to listen, right?
To your concerns, to your problems, the things that you want out
of them as a vendor, the things you want out of their products.
It should be a two way relationship.
That's not what you're getting from your account manager.
Talk to their boss, but, you want to have this, very reciprocal relationship with
your Tier 1 providers so that you can be first in line if something happens, right?
Have those calls so that they know who you are, they have a personal
relationship with you, And they know, hey, company X, they're a great customer,
and they're having a problem with this particular feature, or they're seeing
this, do we have an incident running?
And you may be the customer that reports an incident into
them, have that relationship.
It also helps to be helpful, right?
Think about the folks that are your customers, and they say, oh, this thing
isn't working, and that's all they tell you, and you're like, what isn't working?
What are you talking about?
Be helpful with your vendors, right?
Bug reports, feature feedback, all of that stuff.
Anything that gets you more time and attention in a positive way
from that vendor that gives you a better relationship with them.
They might invite you to their customer advocacy board, or they might invite
you to talk at their conferences, or talk to their other customers,
or any of those kinds of things.
And that gives you a better relationship, better term on that
vendor, so that If incidents do happen, they're your friends, right?
They know who you are, you've talked about your kids, all that great stuff.
And you have a more personal relationship with them.
It's super helpful.
Because vendors are going to have incidents.
Not all of them have them all the time, for sure.
And not all of them have very large ones.
But, I happen to be recording this on a Tuesday.
And yesterday, one of the big wireless providers in the United
States had a massive outage.
We still don't know why, but it was like nationwide is a mess.
My mom's like frantically calling me.
What's going on?
I'm like, it's Verizon.
It's not you.
My niece had to go find a landline because my brother
doesn't have a landline anymore.
It was drama, right?
Like it happens.
So your vendor has an incident.
How does this come into you, right?
Your customers see a problem on your site.
They don't know that you have.
linked to some other provider for that particular piece of functionality.
They have no idea what you're doing there.
Your team might be alerted by PagerDuty as well.
Hopefully you've got monitoring and it's saying, Hey, we're getting 500
errors back from this vendor, right?
We don't know if this request is us or them.
So you start your incident response process as you would for anything else
because right now it looks like it's a vendor, it's having a problem, but we're
not sure, so we're going to investigate.
So we start our incident review process, our incident process,
Everybody joins the Zoom call or the Slack channel or wherever you are
and your engineers are investigating.
And as they unpeel all the layers in your distributed system, it's yeah,
this does actually point to the vendor.
So you pull up the vendor's status page and there's nothing there.
now what do you do, right?
What's going to happen?
Hopefully, if you have had that relationship with your
vendor, contact them directly.
You contact your account manager.
Hey, we're seeing this.
What are you guys seeing?
Or you've got a support line that you can call or email or get in touch with.
Somebody that you have built this relationship with over the past 12
months or whatever of your contract, so that they know that you're
seeing a problem with their stuff.
Part of that will come down to, over time, clarifying the
ownership of the relationship.
Because, especially if you are in the enterprise, And you have to
go through like a purchase order process or a procurement process.
Lots of people are interacting with your vendors and it can become very unclear who
actually should be the one to reach out.
Who are the folks who know the vendor, who have the relationship to the vendor,
who should have that relationship to the vendor, who attends the QBR calls,
attends the monthly call with the account manager, who those people are, right?
Because it can change over the course of the relationship.
At the beginning You might have procurement, they're the ones, somebody
has picked this product to use, and they ship it over to procurement for
whatever lengthy process that they have.
you might have legal and finance that are involved in the contract
negotiation process, all of that kind of comes in, and that can be pretty
onerous depending on, where you are, what kind of, process you have.
Once you get to implementation, depending on the size of your
organization, you might have, program managers or project managers who are
in charge of that implementation.
They sign teams up to get onboarded, they make sure everybody has training, they get
everybody aligned, and they keep moving through the process across the department.
Depending on the scope of the product,
catch all teams might own the relationship after it's been implemented, right?
So depending on where you are in all these phases, It might be somebody,
different who would reach out to your vendor if there's an issue.
eventually you might get to a place where, hopefully in most places you'll
get to a place where, some kind of application team owns the primary part
of the relationship because they are the main consumers of that vendor's products.
For things like your app, Infrastructure is a service and larger components
like that, it might fall to platform engineering or SRE, or even call it a
DevOps team somewhere, and they own that relationship because they're the ones that
interact with those components most often.
They're the ones making changes, they have the administrative rights,
all of those kinds of things.
You might have a DevTools team for things like your CI, CD, and
deployment tools that own that relationship, know where the accounts
are, and how everything is managed.
So depending on What the thing is and who's consuming it, that's probably
who should own it from the perspective of what do we do if there's a problem.
So when you are relying on a vendor for something in your production
environment, someone or some team, hopefully, that has an on call
responsibility already, should be the internal contact point for the vendor.
Let me give you an example.
Let's say you've got like some kind of retail site.
You have an external shopping cart provider that is going to help
you, main manage your, customer.
shipping and billing and all those kinds of things.
So you have an external service for that.
Your main website has the content on it and points out to the billing site
when people are ready to buy something.
The people who implemented that integration between your website
and the external biller should be the folks who are responsible if
something is wrong on the billing side.
They can talk to those people, they know how it's implemented,
they should Hopefully 'cause most likely they wrote most of it, right?
that's the kind of the point.
You wrote it, you run it, but in this case you integrated with
it, you know what's going wrong.
They know where the logs are for the code that consumes the external website.
So they're gonna be best placed to be able to say, Hey, there's something
wrong on our external side here.
The vendor that we're using, they're also the ones that are gonna be able
to say, Hey, it looks like everything is good again, and we're all clear.
we don't want folks fumbling around.
This happens from time to time.
There's different components that we have that are publicly facing,
say, our documentation site.
The regular engineering folks who are on call for, the rest of our production
environment don't touch the main components of the documentation site,
things that support documentation.
knowledge base, right?
The tech writers are the ones who maintain that content.
They chose the platform that it runs on.
So if something happens on that platform, like the search isn't running, or it's
not rendering correctly, or something is going on, the rest of the engineering
team can't fix that for us, right?
We have to go to the tech writers and say, hey, look, you guys know more
about this platform than any of us do.
You have the accounts to log into their portal or whatever
and can tell us, how to fix this.
So you need to have The tech writers in that example have an
on call responsibility because it's a user facing site, right?
It faces external to your customers So making sure that all those things
are covered and all those bases are covered as you onboard a new vendor
relationship super important Just making sure that If our customers
see it, it looks like us, but we're using a vendor, someone on our side
can take responsibility for it, right?
And maintain that relationship.
Part of that is going to come down to creating vendor runbooks.
And your IT team may already have something in place for
the things that they own.
If you don't have it as part of your tech team, consider putting these together.
You might have some stuff that looks a little bit like this.
It doesn't have all the stuff in it.
But it's good to have all this as a baseline.
And definitely do this for your Tier 1 and Tier 2 vendors.
The folks that are, in your revenue stream, or are going to impact
your ability to ship a bug fix, or fix an incident on your side.
Make them available to everyone.
I know that sounds a little scary, right?
For folks who have a more tied down procurement system.
But this, If there's an incident going on at three o'clock in the morning, the
last thing you want to be needing to do is call up some procurement person who
doesn't have an on call rotation, right?
So you're trying to find their phone number, figure out who their boss
is, and all that kind of stuff.
We don't want to be doing any of that.
So some stuff to collect up.
Account numbers and user IDs.
Super important.
That's how your vendor knows who you are, right?
That's the piece of information that's going to be able to tell them, Okay, this
particular customer is having a problem.
It helps them triangulate all the information about
their own incident as well.
If they're looking at, Oh, it's just customers on this
particular database, segment.
Or it's just customers in this region, geographical region,
that are seeing this incident.
They, that helps them pinpoint all of that stuff, right?
So you want to know exactly what your account numbers, user IDs, however
they identify you as their customer.
have that.
That may not necessarily be all the logins, but you should
have those, at least the names of them or the keys as well.
Some of that stuff is probably in your vault and the part of the
having, I want to call responsibility for the relationship owner is also
knowing where those things are in the vault if you're not able to put
them in plain text somewhere, right?
So have those pointers that people have access to.
Because a lot of the times if you're using an API or something like that.
You're consuming it and you're only using some kind of token or API key to do that.
You want to make sure that you know exactly what that is.
have the contact, your contract information.
This seems a little weird, but like it comes up, right?
Know what feature set you purchased, right?
Because the last thing you want to do is like contact the vendor
and say, Hey, this isn't working.
And they're like, you didn't buy that, right?
Make sure you know exactly what you bought and you have it documented somewhere.
Because a lot of the folks that you're dealing with are large vendors, they have
lots of different components, and you have to subscribe to some stuff individually,
or to buy a package of particular features, and all that kind of stuff.
So save yourself some headache and document what you bought, right?
So you know.
also your level of support.
So you know, I put a ticket in and I haven't heard back in 30 minutes.
it's because you bought the two hour support, right?
I also have that documented, know your email address, support phone number,
support website, however you contact those folks, as well as your dedicated
account manager, if you have a direct line for them, or their email address,
or you share a slack channel with them, they, in either direction, they're,
they have a channel in your slack, or you have a channel in their slack.
Whatever the case might be, make sure you document that so people can find things
as quickly as possible when an incident indicates the vendor is having an issue.
Include the status of your account and the renewal date.
Again, you don't want to be contacted by the vendor and things aren't working and
find out that, Oh, your contract expired.
Making sure you're keeping up on those things.
When your renewal date comes around, They're probably telling
you about it in your monthly calls or, quarterly calls with them.
Hey, we're up for a renewal.
are you going to renew with us?
We'd like to see you do that, da.
They're trying to upsell you, all that stuff.
So you should know when your renewal date is.
Hopefully, you also know when it's been paid, right?
that you're not calling them when you haven't paid your bills, right?
Any other guides that they provide you for specific reporting.
Someone, some of them will ask you for specific error codes.
They might want stack traces.
back in the day they might have had a TFTP site.
I hope no one's still doing that, but who knows, maybe out there they are.
or a place to like upload stuff in a customer portal or things like that.
make sure you're documenting that for folks as well.
Because your on call person, your responder might be the
newest person on your team.
They don't know all the ins and outs of this.
These are also good to have if you've caused your own problem, right?
Like we occasionally see, we've got vendors where we've made too
many requests of the wrong type and they lock us down, right?
That happens.
we've tested something out, it hasn't worked, or it's like something changed
and we didn't catch it and, we were making bad requests, they lock us out.
We want to know, okay, here's the vendor, here's the contact, how do we
know, so that we can get that fixed, so have this stuff anyway, like it's
just important to know so that you're not struggling to fix something, and
not delaying your own response times.
The biggest question you're going to get then when you have a vendor
incident or vendor indicated incident is do you contact them or do you wait?
Contacting them if you're not sure exactly that it is an incident
or they haven't verified it's an incident on their status pages yet.
definitely do that for your main providers.
Large outages obviously are going to make the news, right?
And they become a train wreck that everyone watches.
So for large incidents that way you'll know that something is going on.
small subsystems though, Generally don't make the re the regular media news.
They don't always even make the status page depending on what's going on.
If it's something that doesn't have a lot of users, some vendors
get very sloppy about that.
but making contact with your account team, you should do that.
If you're not seeing any indicators from the vendor that something is going on.
You definitely do that too if you want to be on the first notified list.
It's one of those things that you often get with that premium support package is
front row for remediation on incidents.
Especially if it's we're only seeing this on a limited number of cases.
It's only in this database segment.
It's only in this geography.
It helps them, reaching out to them helps them narrow down
the scope of their own issue.
It also gets you into that list of folks who want to be notified.
If you decide to wait it out, don't all hang out and wait, right?
And I've definitely seen this.
Bridge calls, 80 people, 100 people sitting around waiting for some
big vendor to fix an incident.
It's a huge waste of time, right?
You have other things you could be doing.
Let folks go.
Appoint a contact person who's going to watch the updates and be the contact point
for, the responses back from the vendor.
Someone who can then keep your stakeholders up to date.
Hopefully you have a stakeholder communications plan.
That's a whole other talk, but briefly, you should have some other
way that folks who aren't actively responding to an incident can get
updates about that incident, whether that is a special channel in your
team chat, or they can subscribe to other updates and get text messages.
However, your methodology and your tools work, you should have
that for the folks who are not responding so that they know that.
Someone's still looking at this.
Top people are on it, right?
have that in place.
Hopefully that's already part of your incident response, plan.
Make sure you're also engaging your response team.
For your Tier 1 vendors, obviously something is down.
It's going to impact how your services perform.
So your support team needs to know about that, right?
They're going to be on the receiving end of all the sort of reports from
the customers, whether it's social media or emails, or you've got a
phone number, or they reach out to account teams, whatever it is.
They want to be able to collect all that stuff up and make sure they
reach out, reach back out to all those customers when things are fixed.
Update your status pages, keep your customer message, messaging up to date.
If you've not worked in support, the way a lot of those tools will work
is that there'll be one big incident ticket that has some boilerplate in it.
We know, here's what we're seeing, we know it's an issue, we've contacted the
vendor, we're waiting for remediation.
Really straightforward, whatever, and they can send that directly
back to those customers and they'll add that customer to that ticket so
they know this person reported in.
about this particular issue that's ongoing, right?
So they can track all of that.
Make sure that you are engaging your support team so that
they can keep that up to date.
So the person who is watching for updates from the vendor can also be
updating the support team to say this is what they're telling us, they're
anticipating another half an hour.
We don't want to prematurely say that, but we can say that, yeah,
the vendor is looking at it.
For things like status pages and external status updates, beware
of disparaging comments, right?
Be nice to your vendors.
You don't want them to, kick you off their platform because
they don't like you, right?
it also sucks for your customers to say, Oh, yeah, we were out
because vendor X is crap, right?
you decided to use them, right?
So it's still your problem.
making sure that you're using, nice language.
There, you've built this relationship hopefully with some of these providers.
You're able to say, recognize that our customers experience blah, blah, blah.
Our vendor remediated it, blah, blah, blah.
and go from there.
But, beware, shit talking anybody on the internet that way.
Finally.
Run a post incident review.
A lot of folks are hesitant to do this about vendor incidents,
but you absolutely want to.
You do really and truly want to do this.
I know it's a lot of work to run another incident review for things that are like
our fault, but you want to do it, right?
Just like you do for your own incidents, because there's learnings to be had
here for your organization as well.
Create your timelines, put all that stuff together.
As you collect these up over time, you're replacing your vibes and your feelings
about this vendor with actual data.
it feels like this vendor is down all the time, but then you go back through
your history, you can say, Oh, it was five minutes in December and then three
minutes in March, but then they did have that great big long thing, like 15
minutes, half an hour in June, right?
You have that data, right?
It also gives you some leverage when you go to renewal that contract, right?
You can say, we're not really seeing the reliability we need from you guys.
What can you do for us?
And you have the data.
Because they're not going to come forward with their reliability data.
Especially if it's still within their contractual SLAs.
But you have it.
Because you recorded it in your post incident reviews.
So you have it for yourself.
Creates all that stuff.
Discuss any missing information.
Anything that's new.
Anything that's changed.
this time they ask us for logs in this particular part of the portal.
It was different from the last time.
Update the runbooks.
Any new learnings about the vendor.
Right here.
You can also invite the vendor to these, right?
Think about that.
You've spent all this time during the duration of your contract building up
this great relationship, and you meet with them every month, and they tell
you all about the new stuff, and you talk about whatever sport ball you
like or whatever on these great calls.
You're going to have a post incident review about their incident,
how it impacted your customers.
Invite them along.
They might send.
A bigwig, right?
They might send a C level executive to your incident review.
That expands your relationship with them, right?
It helps them also get better at mitigating those incidents.
It's a reciprocal relationship there.
And then after everything has quiesced and everyone has calmed down and
you've left the vibes behind, right?
Then it's time to discuss if this has been enough impact to find a new vendor.
It's not a decision you're going to make in the moment when they're down.
Because you're probably not going to be able to get your
stuff out of there anyhow, right?
When everything is quiet,
then you can discuss finding a new vendor.
Because there's a lot of things to balance out.
If you have that history and you're able to say, this year, since our contract
started, they've only reached 95 percent reliability, that's not good enough
for us, we want to start looking for someone else, that's Then you can discuss
things like your switching costs and the contract duration, if you need to
buy out the contract, how long it would take you to implement on a new vendor,
what are the potential other vendors that meet your needs, will you need to
re architect anything to use a different vendor, and have another discussion there.
Because that's going to be in depth, it's not going to be something that
you can solve in an hour meeting.
It's going to be something that takes several weeks to do all the
discovery and figure everything out.
So it's not something that you want to jump into.
While they're having an outage.
If the data center is burned down, that's a different story.
And you need to, you don't have anywhere else to go.
That, that might be a mitigating factor there.
But for a lot of the incidents that we do see across SAS providers and those kinds
of things, like there's, and rushing is not going to help you in that respect.
You want to be in a place where you can make a rational decision
outside of the scope of the incident to figure things out.
So there's a lot of things here.
a lot of things that we don't necessarily think about all the
time because we figure, Oh, it's their problem, they'll fix it.
But taking ownership of your customer experience for your
customers is super important.
And it does encompass how you have relationships with your vendors.
So in summary, organize your vendors, figure out who's most important to
you, and who maybe you should spend a little bit more time and resources with.
Document your relationship owners.
Who knows that vendor the best?
Who is out there in our team who knows how to use that thing the best?
Create your vendor runbooks.
Know exactly who you are to the vendor, how your relationships
go, how to contact support.
Keep in touch with the important vendors.
I know sometimes it feels boring to have those regular calls, but do it, right?
Do it for yourself.
Keep your customers informed in a positive way.
We're working with the vendor, we're expecting remediation, da.
Run your post incident reviews, capture that data, capture those timelines,
know how things went, work with them.
Hey, we would really appreciate it if you could update your status earlier.
Let us know things earlier, those kinds of things.
You're building those relationships, and they can take that feedback.
They may not change, but they can take their feedback.
So there's a lot of things here that you can be doing in a more proactive way,
even though what's going on is outside the scope of something you can fix.
hopefully that's helpful, it gives you something to chew on, and think
about, because obviously as we move more and more into what's SAS based
products, they're going to happen, right?
Even if you bring everything internally, you probably have someone else that
you've outsourced a lot of things to somewhere else in your, in your workflows.
A lot of things that hopefully you haven't built yourself, right?
Because, That's a lot of work for some of that stuff.
Alright, so you can learn more.
we're at pagerduty.
com, you can always find us there.
We have an entire document about incident response, that goes through sort of
our plans and how we manage things that we would love to share with you.
that's at response.
pagerduty.
com.
We also have a podcast episode about this, with Jeff Martens.
And, it was great.
I'd love for everyone to listen to that.
it came out in September.
And if you are already a PagerDuty user, thank you.
And we'd love to see you on our community site at community.
pagerduty.
com.
thank you for listening to my talk.
I hope you enjoy the rest of Conf 42.