Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi there, and welcome
to sleeping with one eye open experiences in production
support. My name is Quintin Balsdon. I'm an Android developed.
I've been working in London since 2017,
and I've been developing applications since about
2010. I've supported quite a lot of different applications,
from startups to long running apps. They've been
both mobile, desktop, back end.
I even have a stint supporting us. For seven years I
supported an Excel macro that I wrote, and I think it's really important
that as people involved with software, we understand that
we are part of an entire ecosystem,
and it's something that it's not really explained
to people when they join, that everyone is a part of software
delivery and the entire process that's contained within
that. After a particularly intense three year project where
production support was a major aspect, my company asked me
to just write down the things that I'd learned, and that's how this presentation
actually came about. And it's my hope that I create an
environment of understanding and of communication
that we can all learn from each other and grow and become better
at doing this. So the first question we need to
ask is, how does software support fit in? And I
think in terms of the software development lifecycle, it is
that last element of maintenance that we're really focusing
in on. I think that everybody should be involved with
production support, and that doesn't mean that you are
doing overnight support getting called out at 03:00 a.m.
Not everybody is at a point of their lives where they are even capable
of doing that. But I do think that everyone needs
to be aware of what's going on. I have found that those who have
invested the time in doing production support have a far more
holistic view of the domain in which they operate.
And they understand the nuances
of all the infrastructure and technologies that are communicating,
and it makes them better developers and better
people who seek to understand how things work. And I think it's
imperative as well that we consider it critical,
because if people aren't involved,
they tend not to understand the nuances of the
infrastructure and architecture that they're working with, and they tend
to be less capable of spotting potential problems earlier
on. And so the more we get involved with production support, the less
we actually have to do it, because we are learning
the way that our particular infrastructure works and are capable
of dealing with that. There are times
where we had one particular component,
when that component failed, we knew where to look,
and once we started noticing that pattern, we were far more
capable of saying we need to do some work in that area.
We need to go and figure out what's going on. Why is that component failing
and how do we make it better so that we don't get called out all
the time? I think that supporting a production
application can sound quite scary. No one wants to get
called out or feel massively responsible. There's a
lot of ghosts in the shell that we might not want
to have to experience, and it's very important
that we see both the positive side and negative sides of
supporting a production application. One of the best
things that I've found is that you build your
team in such a phenomenal way when
you get called out together. There is a big sense
of camaraderie in walking off of the battlefield tired
and broken, and knowing that you've done your best
to support your customers. A few years ago, I wrote a personal
app as a joke. It really was not intended to
be popular and someone created a Reddit page
for it and the popularity skyrocketed and
I ended up with 30,000 people using my app at one time.
That was particularly scary for me because I
had no mechanism of supporting a user base
on that scale. And I realized that no matter what I put
into the wild, it could get used by a lot of people. And I
think having that knowledge in that aspect is really important.
One thing we can do is look at the example of others in
the news. I've done a lot of learning from
just watching how other companies respond and react
to problems. And so I'd like to introduce to you a few use
cases. Some of the stories that I'm going to mention
here, I've actually chosen very recent ones.
Latest is two years, and I'd like you to keep these
in mind as we go through these, because these are public
mentioned in the news. You may have experienced them personally,
but learning from how other companies respond, whether good
or bad, can be really, really useful. So I'd like to just mention
a few of these incidents. One of the biggest ones that stood out to me
was from the 20 April to the 20 May
2018 TSB had a problem where 1.9
million customers couldn't access their accounts. They had no
access to their bank accounts for a month as a result of a
software rollover to a new system. They were exporting their servers from
one place to another, and as a result of
not being willing to roll back, they denied
their customers access. In December 2018,
there was a third party certificate renewal for two,
a cellular provider in the UK, and that resulted in 30
million customers having no access to the cell phone network for
a significant period of time. It was the good part of a working day.
If. If 2020 didn't have enough problems,
July we had Virgin Media where they had 10,000
customer complaints recorded in down detector, and that was
their second outage in two weeks. In the
same month, Facebook sdks came out with
a problem, and that caused Spotify, Pinterest,
Tinder, and a lot of other apps to fail.
In August, Spotify's transport layer security
certificate wasn't up to date. Security certificates are a big problem,
and it's something one of the things you should keep the
finger of the pulse on and one of the biggest ones in
2020, you might have experienced it,
but for an hour, Google's single sign on
went down on December 14,
and people had no access to YouTube, Gmail,
and other Google based services. That was really telling.
While that was going on, it was really interesting to see how they responded
and how they were trying to mitigate the problem and what the
public's access to this information was.
One of the biggest learnings I got from that, that we had as well,
is you don't want to necessarily blame a particular
service. YouTube is down. Gmail is down. When you start seeing
a whole bunch of services not working, maybe it's the sign on or some kind
of authentication layer. We can get to that later.
And then most recently, Signal, which is a messenger
app, suddenly gained popularity because of
WhatsApp's privacy policy changing.
And they were endorsed
by Elon Musk. And the influx of new users
created a problem for them at scale that they struggled for
quite a few days, and they were really good at telling people what
the problems were. Monzo has also been quite good at getting back
to customers and saying, we're sorry, we're down, we're working on the problem. Please have
patience. So it's no doubt that even the biggest
of giants are capable of falling and slipping
up. And how we manage ourselves as
the developers of software and as the
representatives of these companies can make a huge
difference. So you get called out,
it's 03:00 a.m. In the morning. You've just kind
of opened your eyes and what are you going to do?
What is my advice to you? I would say the first thing
to do is to ensure that you have the right goal
in mind. So I call that putting on the right hat in
my day to day job. I'm an Android developer. When I go
in, I have certain tools in mind. I want
to make the app better. I want to improve the infrastructure,
I want to do certain things, and I'll have my sticks
that I want to use as my
particular tools. When going into a production support call,
it is so important that we lose those agendas,
that the goal is to diagnose the problem without
necessarily laying blame on any individual. We want to
delegate in terms of making sure that we've communicated
with the right groups of people, and we want to make sure that
the decision we make is the best one we can
given where we are at that point in time.
It's so relevant to point out that production support
is mostly a communicative and collaborative effort.
What happens during a call out will affect others perceptions
of you. This is personally, professionally and externally
as a company, and behavior is so important to
your reputation. You want to ensure that these
elements are always maintained, the perceptions and your reputation.
When software fails, it doesn't matter who's to
blame. The fact is, there is a problem. Blaming will
only get you so far. Understanding will get you so much further.
And so when it fails, we want to
make sure that we discover what's broken and
we take the time to fix it properly.
It's really telling during that two outage that they were
so quick to point out that their third party provider didn't renew a
certificate. And it's really unfortunate
because they really laid the blame quite hard on Ericsson. And it's an
understandable error. Some people just don't have that, and they weren't thinking
about that. You set the security, the certificate to expire
in ten years, and you don't write it down or have a system to create
a reminder. It's an understandable error. One of the things that
happened to me is we had
an issue where the hardware secure module had failed
and they needed to drive out a new component to the data center.
And while this was going on, we had to put all our load
onto one particular server. So our load balancers weren't
really operational, and we had to keep quite a tight finger on the
pulse of our system. We had to baby it and look after it.
And thankfully, others were two of us on call at that time.
What ended up happening was we had one
person on monitoring and one person on communication.
And deciding on those roles early was so important
because we were the people who were involved on the front end. And when apps
fail, the front end gets blamed. And so having
someone who was monitoring the front end and ensuring things were still working,
doing diagnosis, checking everything, and not having to worry
about the communication element was really, really important. We decided
on whose role was who was doing what, and it was
really effective because when all those requests were
coming in, we had so many requests from WhatsApp, from email,
from an internal communication tool, from slack, from teams everywhere.
It just seemed to be coming from everywhere and piling on us. But one person
focusing on communication meant that that was your job, and one
person focusing on diagnosis and fixing that was
their job. And defining roles early and quickly was one of the
best things we could have done. In another incident that we had. Once the sales
team called me from an international client and I had to run
out and sort and source a Bluetooth
low energy printer for a system we were developing.
And so right at that point in time, my goal
was not only to get something out that could work and do
the job, but I had to actually go and find a supplier
that was a really interesting one to have to actually go and source
hardware years.
The real point is just define your role and know in
that case, what you're going to do. Sometimes the best way to
do that is to ask questions. I think questions are one of the
most effective ways of guiding a conversation and
taking control of a situation. When we know how to wield our
questions in a proper way, I think we can use
questions so nicely when trying
to understand rather than trying to point out a particular system.
And again, this goes back to agendas. I trust believe
that one particular component needs to be rewritten.
And I go in and when I get a call out, I say, it's that
component, it's doing it again. And you just want to take your
agenda and drive it home, whereas you might actually be wrong.
And when you're wrong and you're making a hard statement like this
component is the problem, you are developing a reputation
of someone who can't be relied on in a crisis. And by
rather asking questions, we can get to
a point where we're learning and suggesting without
developing a negative reputation around ourselves.
One of the best questions that I used to
ask was, could it be this component? And then someone would come along and
say, no, it can't be that one, because we see this issue here.
And so not only do I learn that my stuff isn't working,
but that someone else's stuff isn't working. And so we know now we know
as a result to look higher up in the chain or to look at different
ones that might be related to that, and see how do we
tease out this web of
problems? By using effective questions,
which also tends to the point of don't cry wolf unless you're absolutely
sure, because it's going to distract your team. So often
qas and testers will tell me, to reproduce a problem,
you need to click here and then push back, and then you'll see the issue.
And I found that I really need to ask a lot of questions
around that, just that kind of action, clicking on a
button and then pushing back. When I navigate to the next screen, do I wait
for the next screen to finish loading before I push back, or do I push
back while it's loading? And a lot of times qas just assume,
or testers or even developers will assume that you know
what they're talking about. But I'm not seeing what you're seeing necessarily.
And questions are such a great way to guide that conversation,
because even in software, there's such a big
disparity between what people mean when they say certain terms.
Certain terms carry with them different aspects for different people.
Some people say crash when they might mean an error,
or they might say frozen when they mean a
crash, or they don't understand what lag really
is. So defining these terms and asking questions
around their terms, what do you mean? What are you actually seeing? Like, could you
show me these kinds of questions? And an inquisitive
nature and taking an interest in the problem rather than in particular
people or systems, we can start teasing out
exactly what's going on. One of the best questions to ask
is, where did this problem come from? Who's reporting it?
Is it coming from one user who called, one user
who tweeted? Is it coming from our systems
themselves, the diagnostic tools that we've put in place? I find that questions
can be asked because of three great reasons.
One is because the answer is important. We need the answer.
Sometimes we ask questions because asking is important
and people might, in their explanation of
something, realize a component that they need to elaborate
on. Sometimes the process of answering the question
is important. Someone taking you through how they came
to the end result that they've determined can yield the best result.
I think we also need to be so careful that we don't use questions
as a mechanism of intimidation, that we're careful in
the way that we construct it, in the way that we communicate. Because in
these times of stress, it can be so important
to make sure that everyone's our friend. Because when we
need to get information out of people, we want to make sure that we're
getting the best possible results. A few questions that I've
learned to ask is, what actually caused the problem? How are we seeing it?
What are users seeing? How is a user particularly
affected? How do we know that something is wrong? Is it our
system? How will we know when this is fixed?
What is the mechanism by which we
can rightfully say that the problem is actually fixed? Can we determine
how many unique users are affected right now?
Sometimes a problem exists, but it's not affecting anybody.
If someone can't do a particular action
at 03:00 a.m. In the morning, is it really worth
getting six engineers and 20 managers up at that point in time?
What is the best call or the best reaction to the
problem? How long has our system been down? If it's down,
and how do we know what to do to fix the issue?
And then also reflection, how would we do this differently
afterwards? So what I would say then is,
when you get called out, having tools available to
you other than just the ability to ask questions, can be critical.
Before we get called out, we want to know that the
different mechanisms by which we analyze and
look at our system are ready. So when identifying our
source, we want to look at, was it social media? Was it call center
complaints?
Some customer experience might be determined by different devices.
Is it just Android? Is it just iOS? These kinds
of things. Quite often, when a user tells me there's
a problem with Android, it doesn't work. My first question is,
did you try it on iPhone? Because if they haven't tried it on a
different system with a completely different code base, we can't
be sure whether it was the back end or the front end that's failing.
That is one of the easiest ways to distinguish that there's a problem
on a particular platform. Did you try it on web? Do we use
the same back end for web? That kind of thing. Sometimes we
can look at how our historic baselines were
working. So when we compare our baseline
this month to last month, we can see, oh, this is just an anomaly,
or, oh, this is occurring every time payday hits,
sometimes Christmas and New Year's quite often result
in spikes because people are bored or something like
that. Excuse me. So we want to be careful that our
tools are capable, that our tools are correct,
and that we look at it from a number of angles.
So ensure that you have access and
that you know how to get access if you need it.
We've had an issue in the past where one
person was the only one who had the password
to gain access to a production feature. And so every time
there was a problem that may have involved that feature, whether it
was that feature or not. We needed to call them out so that they
could log in and check, and that just very quickly we
resolved that, because we can't rely on one person,
and we also need to make sure that access is maintained. So you
don't want to have a situation where the policy says that passwords automatically
or user accounts automatically dissolve after
three months, because that's your security policy. What you want
is when that a few days before that access is
revoked, they get an email. Their line manager gets an email, and people
are aware, and the team gets an email, that people are aware that
access is changing and how to regain access. We had a particularly
complex tool in one of my previous clients that it was
so complex, and the terminology around getting access was
so confusing, we actually lost track of what we
needed to know to get access to that component. And so we
had to come around with these runbooks just for getting
access. But runbooks are an essential part. You'll never escape
runbooks if you want to do production support
successfully. I cannot express how important runbooks
are. We used to order our runbooks by
feature and ensure that all of our runbooks contained
the core team responsible for that delivery.
So we knew not who to blame, but who we could ask.
Who can we ask our effective questions to in
order to gain a proper understanding of that particular feature? We also had
emergency contact information. So when something falls over
in that area, others was a mechanism, maybe not
a particular person, but a mechanism by which we could go and
access someone that could give us the information we
needed. We also included in our runbooks a status report
link so we could go into from our runbooks. We could click
a link and go into a reporting tool that would give us as much as
possible to try and understand what that feature was doing.
We also included an architecture diagram, and architecture diagrams
were really useful in identifying dependencies and
how dependencies relate within that system or feature, so that if
there were multiple features failing and they all had an element
in common in the architecture, we were capable of communicating
with people. And the Google incident
is so good about that, because that's not the
first time I've seen a single sign on fail. There have
been other cases where people couldn't access internal systems
that I've seen, and you keep on thinking, oh, what's wrong with YouTube and
Gmail and Google Docs? What is wrong with all these
systems? Why are they failing? And then it turns out it's something on your security
layer that your security keys
aren't up to date, or that element is failing.
We also included a repository link in our runbooks because
having access to the code could help.
I don't recommend trying to learn code base
at 03:00 a.m. In the morning. It's not fun.
But what you could do is you could go and look at
what tests have been written. Is that particular feature tested,
and if not, why? Or maybe make a suggestion
in the washop, which I will recommend later, you could make
a suggestion that teams implement tests so that these
problems don't arise. And having that kind of status involved
in your project beyond just the code is very effective,
or we found very effective. And while reports
can be very useful, I have found that
reports can also give you a very skewed perspective if
you only measure certain elements of that. You want to be
careful with reports that you know how to read in between the lines.
When you look at a report, you can't always determine
exactly a situation. Our biggest request
from management when doing
reports on production incidents were how
many unique users were affected by this problem.
And if your report is just a blanket crash report,
where it shows you this is how many crashes happened
between this time and that time, you cannot assume that that
count is the number of unique users that were
impacted. A lot of times, if someone's doing a sort
of primary feature activity, they might try
several times, and so one person might have five
tries, whereas another person could have tried ten times.
And so the unique number of impacts cannot be measured necessarily
by that. And I would strongly recommend against any
kind of one dimensional reporting. Just having one
report is just not good enough. Knowing how many
sessions were alive during that time, and not
in a way that could uniquely identify your customers,
necessarily, because that might not be possible
given your environment. If you're in a financial institution, you want to be
very careful that your reports cannot uniquely identify people and
accounts. You want to keep that separate from your development team.
Never let the development team have access to your production
financial server. The problems there are
just unending. You want to be able to identify a
problem for the right reason, and I would
strongly recommend for all of this.
Management is having one communication
tool. Communication, like I've said before, is going to be
your biggest asset, but it can also be your biggest
detractor. In a support incident.
I already spoke about the time where a
colleague and I determined to have different roles, where one was working
on the system itself and one was just managing communication,
because we had all these different mechanisms, and especially
now in the world we live in, others are so many things that
can ping you. I think a lot of us
are just so tired of things pinging at us. Where you've got Skype,
WhatsApp teams,
emails, texts. Determining one tool
that you're going to communicate with will really help
you focus in on the problem and effectively communicating
that need to management is a skill in and of itself.
I remember there was one problem that we had with internationalization
in Android, and I was just trying to fix a
problem with the way that the particular internationalization
worked. And I kept on getting just pinged.
I just want to ask, hey, how's it going? What's going on? Are you nearly
done? And every 2 seconds I got this ping and I
had to tell them, I'm busy working on a solution.
I cannot be disturbed. But I also don't want to ignore you,
so I'm not sure what to do. Can I tell you when I'm
done? And eventually I set my status saying, working on
this issue, please don't ping me. And I put a little warning light
on it. And that really helped another time that
we had a really being issue with, people were saying there was a problem
with Android only, and it turned out that it was an IP six issue.
So something to do with the way that one of the networks,
a very popular UK network, was managing IP six packets
caused a massive packet loss. And the only way we
actually determined there was a problem with that because we
couldn't see it on the production apps we were running, and even some
of our qas and testers on different networks couldn't see it.
And eventually someone who was a customer of that network
realized that the problem was with the network or the
service provider that our customers were using as well.
The result of realizing that kind of problem,
which was a really great collaborative effort and a communication
effort, understanding that that was the issue, we managed to create a
tool and a report as a result that showed us
which networks were being used. And then that became
another part of our ecosystem. And that leads
me to my next point, trusting the team in our day to
day development processes.
For me, the way that I work and the way that a lot of people
work is we're given a feature to work on. We write tests,
we write the code, it goes through a code review process, it then gets
merged into a branch, then hopefully
it gets reviewed by a QA, then it merges into the main branch,
then it gets regression tested, then it gets released,
and these processes are quite long winded. This is not something you
could do easily on a production support call out.
And there's a reason why we have these processes in place.
We take it slow because we want to be careful. The reason why we want
to be careful is, again, it boils down to reputation.
I think that especially maybe in a smaller team environment
or one where there's less control, the temptation
is to bypass code review and bypass
testing in the name of an immediate fix. And this kind of reckless
cowboy behavior, while you might get away with it once or twice,
you always run a massive risk. And the risk that
you take on is not just. Not just on your behalf personally,
but when you take a risk like that, you're taking it on behalf of
the entire organization. And while
it might be cool to try and be a superhero, you need to
ensure that you end up being a superhero every time.
And that is a compounding level of risk, because once
you lose that battle, once your reputation
is gone, and you also undermine the entire point of having those
things there in the first place. I remember years ago was one of my first
jobs. I was called out for a production
support incident. I had to drive into the office and sit there,
and I had a number of people kind of standing
over me telling me to release this app. And eventually I
had a fix in place, and I said, but wait, we need to do a
code review. And they were like, no, just push it out.
And it's really difficult when you've got
that pressure on you, even from your team,
to say no. And I'm so grateful that I did. I can't say
I think it was just pure luck. It wasn't so much that I'm fantastic,
I'm definitely not. But I was really grateful
that I called another developer, woke them up, and I said,
look, please, just do a code review. I'm getting a lot of pressure.
And thankfully, they were willing, and they spotted an error that could have
caused a massive crash. And we managed to fix the
problem without massive incidents, but it was because
we did it as a team and we trust in the process that was
put in place. And in that particular incident,
yes, we came out with a result where the error
was fixed. But there are also times when you just need to leave
the problem there, that a problem has to remain unresolved
until the team can wake up. If it's not worth
waking up 20 developers, six qas,
and five managers at that particular point in time, you might
have to leave it. This happened to me a
while back where someone wanted to turn a feature off,
and I said, look, in order to do that, we have to take the whole
app down. That means production goes down for everyone,
whereas this particular feature, it was trust,
a contact feature or something like that, it wasn't a primary
functionality feature, it was some esoteric part
of the system. And I said to them, if we do take the app
down, which is a possibility, we affect everyone.
And granted, this problem is not going to happen,
but then we're not going to make money. And I had to put
it in those terms. Unfortunately, that particular person decided
to go over my head, which is perfectly reasonable.
I was speaking from a logical perspective as a developer.
They went over my head to another manager, who unfortunately wasn't
Tools receptive to being woken up at 02:00 a.m. And they said,
yeah, I'd rather leave others or there and deal with it later.
But at least as a team, I made a decision. I stuck to my
guns and thankfully I was corroborated.
And if they had said, no, we need to fix it, we need to wake
everybody up, I would have been happy to do that the same, but at
least it was a discussion that happened. And also, again,
not blaming anyone. People don't want errors.
I remember there was a particular problem in
an app, and this was one of the first apps I wrote.
There was a spelling mistake, and in my mind
the client was taking too long to fix the spelling mistake. And so I
released a new version with that spelling mistake in there. And I felt absolutely
terrible because I realized that I had taken that step of
trying to be a hero, not trusting the team, not trusting management.
And thankfully, my company was very gracious with me and
they were very kind on that behalf. And I'm
glad I had that small experience where it was a brush with failure
rather than an actual failure. But it's something important to keep in mind
that you don't want to be in that situation. And that
leads me to my next point, which is to take down
time. I think it's so important that we best,
especially as people who are willing to get up
at all hours in the morning in order to satisfy clients and
keep the company going, because I think it shows that we value
what we do, not to the point of just writing beautiful code
or producing something worthwhile, but supporting the people
who use our application. And we need to be really sure
that we take time to look after ourselves. No one's
going to offer to look after you. And I think it's so good
to be able to take a step back and say, at the end of a
call, I'm going to be coming in
later because I need to rest and being able to say,
I'm going to take a day off now because of this, and discussing
this with people, create a discussion around it and agree on
what will happen. You don't want to be the only person that people can rely
on. You want to create very, very distinct boundaries of what you're prepared
to do and what you're not prepared to do. Without communicating,
we're not going to get anywhere. Those boundaries need to be effectively
communicated, especially when it comes to taking time off, getting renumeration
for perhaps extra work done, or being allowed
to take time in lieu. These things are all part of a communication aspect,
and I strongly recommend that that communication is done
in writing before an incident happens,
and know the sacrifice that you're going to make.
Some people will get an offer for just extra money,
and they think that that's a fantastic outcome of doing
production support. And it can be, but knowing that
you are preparing yourself to be on edge while sleeping.
I was so scared of being on production support.
Sometimes. There were sometimes when a new feature had just
come out, people were going to use it a lot. And I would
lie in bed, literally with one eye open and
not being able to sleep. Nothing would happen. And then I can't take
time off because I was just on call. I was just on call,
but I never actually got called out. So understanding that we're making a sacrifice
of sleep, of time, and that is the money worth
it, is a question we really have to ask ourselves. And so for some
final thoughts, I think one of the things I
would really encourage is doing wash ups.
And what I mean by a washop is taking a scientific view
of what happened. After something's happened
and you've taken your rest. You come in the next day and
you talk to your team and you explain what happened,
who you were speaking to, what you think the cause was,
who was affected, how you resolved it, or how
you came to your delegation and what decision ended up being made
and who made that decision, whether you were the one who made the decision or
whether someone else did. Quite often I would get
a call out just from our internal systems, and I would look
at it and I'd be like, oh, this was the garbage collector going crazy.
We know about this problem. It's an existing issue. It's a blip.
It's at 03:00 in the morning, so not a lot of users are effective.
I'd put that on our slack board and then I'd go back to bed because
the system had already corrected itself. But having those
wash ups is so important, it allows you to communicate
that you understood the problem. It shows an interest in
your system. It is a way to teach other people to do production support
themselves, and to show them that what systems they have access to.
And this is what encourages learning and correction
for the future. So yeah, please, for the sake
of your own sanity, do a wash up afterwards. I think it's also
really important to check your merging and release strategy. Some people just
merge straight into their main branch without thinking,
or they don't create proper release notes. And this can be
particularly dangerous when code is just thrust
into the main branch, and then your releases are branched
off of your main branch. Quite often you can end up releasing features
you weren't intending to, even if they're not turned on and users wouldn't
see them. You want to be careful of what's going out there and how it
might impact other systems. We've seen this in particular with
cross platform systems, where a feature gets released for
one platform but not the other, and then
there's some kind of issue between that. And also
having release notes can be very telling. If you've
just released an app and now there's a problem, those release notes
are gold at really weird
times when you have to be looking at why a particular system
has failed, and knowing what branch was merged into
that particular release can be vital
in ascertaining what a problem was. And the last element that
I'd like to just quickly mention is schedule management.
Having the buddy system employed on your
support team, on your support roster, I think is the best decision you can
make. We used to have a primary and a secondary that would alternate,
and that was really useful because we would know who was
primary, who was taking the main role, who was telling people
what to do, and who was secondary. In terms of if primary
doesn't get called out for whatever reason, they're on the tube,
not that that happens lately, or they're incapacitated for
some reason, or just unavailable, maybe the network's down in their
area, there's a secondary who can come in and help, or if primary
is feeling super overwhelmed, they can call the secondary and say,
hey, I need my buddy, can you jump in? And making
sure that people have access to the numbers they need to call
to the right teams, what to do in
the case where you need to escalate beyond the
secondary. So if you've been called out, like having
it in your calendar, all those vital numbers and contacts
can be so useful, and putting a
special tool in place that is not only accessible to
everyone but that is managing who's doing what
when can be the best idea you can
come up with. Also you want to make sure that you don't overwhelm
developed or any support engineer. You want
to make sure that everybody's taking it in turns and you don't end
up with people being on call for three weeks in a row or being
primary all the time and having those distinctions
and also enabling people to know who's primary,
who's secondary. I'm secondary today. Who's the primary, who's knowing
those roles and who to call in that instance can be super super
useful. So schedule management is not just about having a
calendar in place, it's about connecting, collaborating and communicating
and yeah, so I'd like to thank you for attending this. If you have
any questions please feel free to look me up on GitHub.
And thank you so much for attending this. Please enjoy
the rest of the conference.