This transcript was autogenerated. To make changes, submit a PR.
Engineer oncall is a common practice.
Each company takes it to a different direction, but the idea is
similar across all time and space. We want
to empower engineers to become fearless guardians of
their domain, and with great power comes
great responsibility. And we do want
them to be accountable and take full responsibility.
The oncall period might be daunting for many, but today we'll
unravel how a blend of strategies, onco practices,
and nurturing culture can mold encore
heroes out of engineers hi,
I'm Maynath Mahat, an R D group leader at Payoneer,
embracing roles as a problem solver for over two decades
and a community leader at boat. I'm here to provide
you with a panoramic view of the oncall landscape
with a rich history as an encore hero since 2012 and
a mentor to emerging champions
in the field since 2016. Leading by
example is a principle I stand by as an engineering leader,
always ensuring I'm present and supportive for my on call
heroes during and after incidents.
Reflecting on my own experiences as a once novice
on call hero, I strive to smoothen the journey for new
engineers embarking on their own heroic path.
Should you choose to accept your mission, you will walk
the path of engineers who safeguard the system.
Our exploration begins with routine and expectations
that form the foundation of effective on call duties.
Moving into the on Call Odyssey, we navigate through challenges
and experiences from incident alert
to resolution, offering a real world view
into the lives of onco engineers.
Lastly, we'll delve into leadership's role
and the nurturing of an effective culture,
uncovering how these elements fortify
an efficient oncall environments. Let's go.
In the challenging journey of oncall duty,
sidekicks are indispensable in
navigating through the technological chaos.
Key players like incident management systems,
namely ServiceNow and Pagerduty, along with
insightful dashboards and APM tools such as
Datadog and Neuralik,
become our vigilant allies, creating us
at the slightest murmur of trouble.
But throughout the battles, our most valuable sidekicks
are our teammates, always standing besides
us, ensuring that no challenge is faced
alone. The cornerstones of oncology routine,
usually tailored by team,
whether on a sprint, weekly, or other basis,
with orchestrations often led by a team leader or
designated individual. Tools range from
simple documents to sophisticated systems like services,
pager duty, zen duty, or others,
ensuring smooth sailing and enabling meticulous tracking,
all aimed at fostering more serene on
call shifts. The oncall motor is clear
guard the system's health. Being oncall means
standing vigilant ready to respond to the summons.
Whether from a teammate or an automated alert,
you should respond and acknowledge. I want
to state the office. If it's a person calling you in the middle of
the night, be nice. Trust that he's not doing
that for fun. Embarking on the pivotal
oncall odyssey, our hero seamlessly
transitions between proactive vigilance and
reactive action. The cycle commences with deliberate production,
where our engineer monitors dashboard and alerts while
examining recent deployment to detect potential incidents.
When an issue does emerge, she springs into action,
responding promptly. During the resolution phase,
our oncall hero navigates through triage, mitigation and
resolution, enlisting the aid of necessary teams
to reestablish system stability.
Even when the immediate crisis is mitigated,
she rallies the troops to permanently resolve lingering
issues in the aftermath.
Reflection and learning are crucial. Our hero
spearheads discussions and learning moments that shape
enhance future incident management and proactive
prevention. This helps make the system
stronger and gets the team ready for future problems.
Heroes respond to the oncall, but on call superheroes
go a step further. They are proactive, scouting for
potential pitfalls and resolving them before they escalate.
Regular check ins on system health and keen eye
on recent updates can prevent troubles when
an alert beacon's quick action keeps minor
glitches from becoming major crisis in unpleasant times.
Whether you catch a whisper of trouble, which is ideal,
or see a bad signal blazing in the sky,
what's next on the agenda? It's all about triage,
mitigation and holding the investigation of root cause for
later on. Call isn't a lone vigilance
game. It's about highlighting issues and ensuring
they are managed promptly. Help is around the corner remember
our hero. Aim to restore peace, but first
they need to show up triage tactics.
Sometimes the issue might be outside your power to fix.
Your mission is to gather the basic facts. If the problem
is in your wheelhouse, jump into action, triaging and
either mitigate or resolving it. Work on the
problem till a solution is in
sight or until you've handed it over to the right specialist.
In this case, make sure to perform a warm handover.
Remember, you own the scene, stay on scene and lend your
expertise until the incident concludes.
The primary goal when facing an incident is
service restoration. Gather the required intel,
form a strategy to mitigate or resolve the issue,
and rally everyone who can help.
Prioritize minimizing the impact on
customers before diving into investigation mode. If mitigation
isn't in the cards, aim for resolution.
Time is of the essence after
the temps of an incident comes. Whether through a
walkaround, a rollback, or direct fix.
Our journey continues into a meticulous exploration
of its depth. It's time to dissect the event and
discover its root cause. This is
also not a lone venture. It calls the assembly
of your allies together in unity.
Delve into the enigma, investigate,
and piece together the story behind the chaos.
Post incidental reflection is crucial.
After the dust had settled, it's important to
learn from what happened. Conducting a lesson learned also
known as postmortem or take in detail
an incident review that sketches a timeline of events.
Analyzes the root cause, employing strategies like the five whys
outlines action items with designated owners
and etas. Draw up tasks for
future handling to ensure every lesson morphs
into a production measure.
Post incident arm yourself better with
prevention protocols craft runbooks
documentation based on the newfound knowledge,
leading to quicker triage in the future.
Where possible, develop tests and
automations replacing the incident, replicating the
incident scenario, and refine metrics
and alerts for visibility. Every incident thus
morphs into a shield against future troubles.
The roles of leadership in this realm are diverse.
Initially, they must ensure the engineering manager schedule
the oncall rotations. Following that, it's imperative
to provide guidelines and establish clear expectations around
the process. Occasionally, evaluating and adjusting
the process is necessary to ensure it meets its objectives
and aligns with the company's context. Moreover,
determine the severity and slas of incidents
is crucial task. They also formulate an escalation
policy that aligns with the severity of incidents.
As an engineering leader, stepping into the role of a
superhero, your main mission is to protect your team
from oncall burnout. Make sure to shield them
from too much work and to always
respect their time. Keep late night shift strictly
for real emergencies, knowing that many issues can afford to
wait. Balance is essential. Put your team's
well being above constant hard work.
Some of my proudest leadership moment come
from identifying a situation as noncritical
and telling my heroes to stand down when
done right. This is not only avoid burnout,
but also builds trust and results in a team that's more engaged
knowing you'll only call them when it's truly necessary.
The key to unlocking the full potential of our heroes
lies in cultivating a positive and supportive oncall
culture. There are several do's and don'ts, but we'll focus
on the crucial ones.
Avoid a blame game. It's not about pointing fingers
at other teams or passing incidents around like a hot potato
in the heat of the moment. It doesn't matter
who started it. A blame culture can determine the individuals
from taking risks or owning up to mistakes.
Stay proactive. Don't just wait for solutions to
come to you. Engage actively. Exhibit a sense of
urgency. Understand the essence of your call to
action and take charge of it.
Communicate effectively. Ensure clear
communication with your teammates, handling the incident,
your manager, and with your stakeholders.
A well informed team is a well coordinated team.
Embrace responsibility. Tackle challenges in
a respectful and kind manner, embodying the true
hero within you. Navigate through the storm
by cooperating with others, working together to
restore calm in our journey today,
we review the encore routine and expectations,
delving into the hero's journey. From vigilant monitoring
to prioritizing service restoration during
incidents and deep post incident analysis
to shield against future issues.
Leader emerges vital curating oncall schedules,
clear expectation and thoughtful policies,
always with a keen eye on team well being
and balanced workloads. Our exploration also
illuminated the vital oncall culture, championing solutions
focused, blame free environments.
Together, let's embody our inner hero,
fostering and empowering oncall culture.
Now it's our collective mission to forge
our own league of Guardians.
Visualize your team as a robust league of heroes,
each engineer possessing unique skills standing
resilient against the chaos of incidents.
Your mission invest, ally, and share.
Cultivate practices that uplift your engineers.
Forge alliances within your ranks and
share your war stories of trials and triumphs.
Embark on this quest build your league,
share your sagas, and redefine the future of encore
duty. The adventure to craft a league
of extraordinary encore heroes begins
now. Thank you.