Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi everyone. Welcome to Conf 42.
And don't forget the humans. So last year
I was in Spain at a technology conference,
enjoying this beautiful beach from friends from the conference.
This was an actual story, by the way,
and one of my friends saw the ocean and said,
I really want to go out for the swim. It's beautiful. Oh, I want to
decompress. Would you mind keeping an eye on me?
Of course. I said yes. I was going to take a look
every five to ten minutes to make sure that I see them.
They were just afraid of getting eaten by a shark. Mostly I
take a look, I see them continue with my day. There's a lot
going on at the beach. There's beach balls, there's frisbees,
there's margaritas, there's popsicles. A lot of
fun. Time was being had. Next thing I do is
that I look and I no longer see the bobbing head
that I was taking care of. Wait, Anna, you literally forgot
an actual human? I did. I forgot a human.
I forgot a friend. Well, on that note,
why dont we tell you who we are so you don't forget us.
I'm Julie Gunderson, senior developer advocate at AWS.
You can find me on socials at Julie Gunderson Gund. And my
name is Ana Margarita Medina. I'm a staff, staff developer
advocate, lightstep. You can also find me on all socials on
Anna at Medina. Thank you for that, Anna. So let's kick
off with talking about reliability.
The AWS well architected framework. It defines
reliability, AWS the ability of a workload to perform its intended function
correctly and consistently when it's expected to.
And look, the increased reliance on technology
in various industries and in everyday life
and by just regular people and the expectations of
people has really increased this push for
reliability in our systems, because technology plays a
larger role in critical infrastructure and healthcare and
finance, transportation and all other areas.
And so we have to have a need for reliable systems.
And it's not just this extra need of reliable
systems, it's also understanding that these
dead stars that we see up on here, maybe you've seen them before,
is that they continue to get more complex. And the
interconnectedness and complexity of these systems
means that one failure is very likely to
actually cause a lot more cascading effects and cause
more failures, more downtime, which highlights the need
for us to really focus on reliable systems. It is known
that in any complex system that failure
is going to happen. But reliability requires
us that we make sure that we become aware of these
issues, and that we avoid the customer impact,
that availability of our infrastructure. And if
our systems can actually bring themselves back up automatically,
we want to work towards that. And if we're not there,
let's start doing the work to make sure that we get there.
And to add just a little but more stress to those death
balls that you just saw on how complex our systems are,
the reliability of our systems, it's important for a variety
of reasons. I mean, you can look at safety. Technical systems
are critical for the safety and well being of individuals and communities.
I mean, for example, you can look at power grids or transportation systems,
medical equipment, airplanes. All of
these play a crucial role in ensuring that people are safe and healthy.
And if these systems aren't reliable, the consequences can be
severe. This is potential loss of life and injury.
Not just julie not being able to watch the witcher on Netflix,
which, by the way, is very good, but we won't get into that right now.
There's also the economic impact. If we look at how
our systems are critical for the functioning of many industries.
And mean, you can look, system outages
make major news now, and they cause significant economic loss.
Look at the stock market after a company experiences
a major outage. And then we also have quality of life. And that kind of
goes a little bit back to the witcher. But really,
technical systems do play a critical role in determining the overall quality
of life. Look back at the last three years
and how our lives were able to really continue on
in a sustainable way due to some of these technical
systems. We also have environmental protection and the systems
that rely on these, such as
waste management or water treatment.
The reliability of these systems is crucial to minimizing
impact. And then you also have reputation.
Look, anybody can go to the socials and say
all the things, and this happens when we have incidents
that impact our sites. So it's a lot of pressure, and I know
that. So let's talk. But ways to kind of reduce some of this
pressure. So, first, I want to start just by talking about some of the principles
of system reliability. And this is what we look at
from the AWS well architected framework perspective.
So, first, we want to automatically recover from failure. And so we do
this by monitoring a workload for key performance indicators,
or KPIs. And those KPIs, they should be a measurement
of business value, not the technical aspects of the operation
of the service. And by doing this, you can trigger
automation when a threshold is breached, which allows for
automatic notification and tracking of failures and automated
recovery processes that work around or repair that
failure. And with more sophisticated automation,
it's possible to anticipate and remediate failures before
they occur. And then you can also test recovery
procedures. So testing is often conducted to prove that the
workload works in a particular scenario. In the
cloud, you can test how your workload fails, and you can validate your
recovery procedures. You can use automation to
simulate different failures, or to recreate scenarios that led to failures
before. And this is something that Anna and I are actually both very
passionate about. We have actually for years worked on teaching
people about chaos engineering, which is actually intentionally
injecting failure into your systems to identify weaknesses
and remediate them before your customers know this
approach. It exposes failure pathways that you can
test and fix before a real failure scenario occurs,
thus reducing risk. You also want to scale
to increase aggregate workload availability. You replace one
large resource with multiple small resources to reduce the
impact of a single failure on the overall workload.
Distribute requests across multiple smaller resources to
ensure that they don't share a common point of failure, or one
person monitoring that individual in the ocean to make sure that they're
still there. Now, a common cause of failure in
workloads is resource saturation, and you can monitor
demand and workload utilization and automate the addition or
removal of resources to maintain the optimal level to
satisfy demand without over or under provisioning,
and then finally manage change through automation.
Changes to your infrastructure should be made using automation.
The changes that need to be managed include changes to
the automation, which then can be tracked and reviewed.
Thanks Julie for talking about the principles of reliability.
I do really think that the AWS well architected framework
gives us a good overview of what we need to aspire for
in order to be reliable. I love all of them,
and I do think that the capacity one is funny,
especially when we think about if we had more people
taking a look at this person that is out swimming in
the ocean, maybe we wouldn't have lose them.
But we are talking about humans today,
and I want to make sure that we also remember that humans are
very complex. There's multiple things that are
built into us, and these multiple
systems that we have, they each themselves have
their own complexities. We can go on and on talking
about all of them, but we can narrow it down to your
circulatory system, your respiratory system,
your muscular, your digestive, your nervous.
There are so many of them that one,
they have to be interconnected in order for you
to function as a human. But there's other parts that
when you start having issues with one, you might start
seeing the cascading failure happen in another part of the system.
And when something actually goes wrong, you start wondering
what part of my human system is actually having these
issues. And that's when we go and we look
at doctors and ask them, hey, can you help me figure out what's
going on? Yeah, thanks for that, Anna. Because the whole point
as to why we wanted to give this talk anyway was to
understand that our human systems are very similar to
our technical systems, and we can apply some
of the learnings from our technical systems to our human
systems. Werner Volgos, he said,
failures are a given, and everything will eventually
fail over time. Now, how we handle those
failures can make a huge impact, not only
on the meantime to restoration of our services, not only on
the bottom line of our organization,
but also on the impact to our humans.
So let's talk a little bit about incident response.
First of all, we talk about an incident. What we mean by that
is an incident is an unplanned disruption of service.
So when we look at it, incident response is the organized
approach to handling those incidents. And just to kind of
take a step back and look at how incident response came
around. It's not something that us techies invented. It's actually heavily
based on the incident command system, or ics for short.
The incident command system was actually developed after
devastating wildfires in southern California in the 1970s.
What happened is thousands of firefighters responded
to these devastating wildfires, but they found it difficult to
work together. They actually knew how to fight fires individually or
fight smaller fires as a small group, but they actually
lacked a common framework to work as a larger group.
So when we talk about what incident response looks like in
our technical systems, as I mentioned, an incident is an unplanned
disruption of service. But even though it's unplanned, we still
prepare for incidents. And incident response has multiple
phases. So there is the preparation. That's the practice piece.
That's something that Anna and I would talk about when we talk about chaos engineering,
because it's actually a great way to practice your incident
response. The best teams out
there, they practice. So first week, we've got to detect
an incident, right? Our systems generally do this for us. We're hoping that our
customers are not telling us via email or twitter.
We have monitoring in place to understand if something is going awry.
Once we've detected that problem, then we want to assess it.
We want to understand what is the severity of this. Is this something that I
can wait until the morning to deal with or is this a sev one where
we've got to jump on the phone at 230 in the morning and we
alert the right people? So we mobilize folks so
that they can all help and respond to this incident,
and they work together to resolve the incident.
And sometimes during that response process, you've got to call in other folks because
as Anna talked about with those death balls,
there's a lot of dependencies, and you might need somebody
from another team. As you respond to the event.
Eventually you resolve the incident, and then
you focus on recovery of systems. So let's bring all the systems back up
to that nominal state or that steady state, and then it's
really important to take a step back and learn from that. And we generally
call that a post mortem. And we're going to talk more about that
now. When we talk about incident response in humans, well,
just like incidents, life happens.
There are unexpected unplanned disruptions.
I think we've all been through some recently. So whether
it's an illness or a family matter burnout
or maybe your power went out and you couldn't give a talk at
a conference that you needed to give, we can apply the
principles of incident response to these situations. And it just starts
with accepting that stuff's going to happen, and we
can apply those principles in personal settings.
So first we prepare, just like we would do with incidents.
We develop a personal incident response plan, and that includes
identifying potential incidents that could occur. And you can look back at incidents that have
occurred in your life in the know. Is this something that's likely to occur again?
Is it even a natural disaster? Do you know how you're going to
get a hold of all your friends in case of a zombie apocalypse? Where are
you going to meet them? A lot of people are coming to Idaho to meet
me, but there could be personal health emergencies or financial
crisis. You want to develop protocols and procedures for responding to
these. So once you've done that now, can you detect
an incident? Can you identify a personal incident as early as possible?
And you can do this by monitoring yourself for signals.
And you can look at those different signals which Anna is going to talk about,
and then you need to move to the assessment phase. So determine the severity
of that incident and the potential impact on your life. If my power goes
out and I can't give a talk at a conference in two big,
is that a Sev one? Maybe for those 2 hours, but that incident
will probably be short lived versus maybe a major
health crisis or something that's longer. Once you know
the severity, you can mobilize the right people. So, for example, I could call Anna
and say, hey, can you help step in and give this talk for me?
If it's a talk that she knew and I could
mobilize her for other incidents, I might mobilize others. This might be
family members, friends, medical professionals, then you respond.
So you're going to take the necessary steps to respond to that, which might be
stepping away from work, going for a walk, calling the
power company, whatever that incident requires,
and then we move on to recovery. So restoring normal
operations, as we talked about, without moving on
to that recovery phase, that truly recovering from that incident,
oftentimes you can fall back into it. So it's important to take that
recovery step and to take time to focus on it and make sure you fully
recover so that you don't fall back into failure, as we would talk about with
chaos engineering, and then take time to reflect. So conduct
a post mortem on that incident, take a time to look and
see how did this work? Did the right people get notified?
All of those steps you can apply to personal incident
response, because when we look at the goal of incident response,
it's to handle the situation in a way that limits damage and
reduces recovery time and costs. And this is exactly what we
want to apply to our personal lives as well, because when
we do this, it can reduce burnout, which, Anna,
I think you're going to talk to us about. Yeah, definitely.
I think you kind of covered a lot in incident response. So folks
might be like, whoa, never really consider that it could apply
what I do on my job in my personal life. But let's take,
for example, what is a common incident that can happen
to humans, that can happen in tech, that's burnout.
Maybe you're still kind of like. Like I hear about.
Like I hear that burnout in the technology space is on the rise. But what
is burnout? Well, when we ask
a psychologist by the name of Herbert Frudenberger,
who is actually one of the first to be known to be talking
about burnout, they wrote a book in 1980
around burnout. Their definition is that burnout
is the state of mental and physical exhaustion caused
by one's professional life. So imagine that
your work just took over your entire life, and your brain
can't work anymore and your body can't work anymore.
That's what they define as burnout in this book.
I personally believe that burnout does not only affect your
professional life, unless your professional life is
all the things that you love doing, such as engineering work,
diversity work, parenting, and things like it.
But I believe that burnout can also be very complex.
And you can be burnout out by doing multiple things,
such as doing a lot at work, having a social
life, your personal life, keeping all those things, and at
the same time, when you don't have certain aspects.
So when you don't have a social life but
parenting and work takes up your entire time, you can still end
up in a spot of burnout.
I also like saying that I am not a professional,
but I have burnt out of my job a few times, so I can speak
with that expertise. Burnout is not depression,
but it can definitely lead you there. So in my life,
I use burnout as a signal of things getting rough,
of I should really start talking to professionals.
And I looked at a study around burnout by
this company called Yerbo. They do surveys around how
employees are doing in different spaces. And two
in five tech workers are showing a high risk of burnout
due to a lot of stress, being exhausted due to their hours,
or just the lack of work life balance.
When we think of the world of DevOps, incident response site
reliability engineers, they have to be on the computer
or getting paged at random times a lot more that I'm
sure if we were to just look at a DevOps study of burnout,
those numbers actually might be way higher.
And burnout is really hard to deal with.
It's hard to identify that you're feeling burnt,
but it's hard to bounce back from burnout.
And some folks actually might leave technology, might leave
their jobs. I know a lot of folks that got burnt,
but, and are still burnt out.
So you might be wondering, Anna, what the hell does
burnout actually look like? Well, it looks very different to
a lot of folks. It could be exhaustion where you're tired
all the time. It could be that you're lacking motivation,
lacking concentration, and it's hard for you to go
out and see friends, see your family members, do your
work, which can lead to maybe you not wanting to
be around people that you know, people that care about you, and you alienate
yourself, which then might mean that you start getting frustrated.
For a lot of people, myself included. Some of these symptoms,
like some of these burnout thoughts, become physical symptoms and your
body starts feeling them. You can also take the approach where
your self esteem actually drops to zero or just gets really low,
and that could trigger a really bad depressive episode,
or it can also lead you down the route of substance
abuse.
So how is it that we can be more careful how
can we be more in tune with our systems to prevent
these incidents from happening? Well, that is where the practice
of observability comes in play. Observability in
our technical systems is the practice of understanding
our systems. This leverages telemetry
data techniques, and it allows for us to
set up new practices based on the insights that we've uncovered.
That telemetry data can be traces, can be logs,
can be metrics, it can be events. Just information that lets
us be aware of how a technology system is doing.
So how does observability look like in humans?
Well, after a human is deployed to production
out of the box, they actually have observability.
Our bodies have senses to observe the world around us.
Certain part of our systems have specific practices that
help us with us being able to define what
we feel and how we should act, such as our flight or fight
or flight response. But sometimes
these things are biased due to our perception or
bias due to the lack of personal insight.
So that can mean that you might actually have
an okay day, but you are still going through a panic
attack. That perception makes you feel like you're
in crisis, but in reality, you're not really in
danger. So the question becomes,
what are the traces, the logs, the metrics
that humans have around us, or that we can create,
that we can draw these correlations from the events
that we have in real time?
We have to work on these insights, and everyone's going
to work on them differently, because those signals and
insights are going to be different for what matters to you,
what you want to work on, what complex system
in your human system matters to you the most.
It could be making sure that you have habits set up,
making sure that you have your heart rate monitor always on,
or that you have a certain heart rate throughout the workday,
that you're exercising every single day,
and that you're doing 10 miles of exercise
on the weekends. It could also mean monitoring
your mood intensity, or making sure you're taking your medication, your blood
levels. We are at a point that we
can actually continue to increase our insights on our human
system. We have technologies such as
wearables, that allow for us to understand what's going on,
such as the apple Watch, fitbits, the oral rings.
They track our heart rate, our sleeping patterns.
They can also track things like glucose.
And there's a whole bunch of other biology
field with hardware that is fun to geek out in.
But as you can tell, we are
responding to things going wrong.
There is nothing that is that preventative about this.
So my other question becomes, how can we enable
logging for our humans system via food tracking,
mood tracking, habit or connection tracking?
Because we want to add as many signals and notes to
our day to day to actually understand how every single
change applies to your life. Like did eating a donut
at 11:00 p.m. Make me extremely cranky this morning.
Should I have not done that? Maybe. I wish I could
know. I mean, maybe or maybe not.
Maybe it made you happy this morning.
That is a very poor point. So we
are creating these insights, but where do we go from
there? What do we do? Well, when it comes
to our technical systems, we have capacity
planning. Capacity planning is the practice
of planning to set up your technical systems to be
successful in the future. You want to make sure that
you have enough infrastructure provision. You want to make sure
that your AWS account has all the limits raised
for your high traffic event. You also want to make sure
that you have an on call rotation for the
services that are going to be used a lot.
Or you want to make sure that you've trained the folks on
call to actually take the right actions when an incident
does occur. And that's where you do things like chaos engineering
and load testing. And hopefully you're doing them together because
that is what sets you up for success.
Agreed. When it comes to
humans, you got to do capacity planning,
too. Humans have a maximum capacity.
This can be in the form of physical,
emotional, financial awareness.
Educational. People have to be in tune with their
limit. How can you do that? Well, you ask
yourself some questions. What do you value?
Do you even know what your values are? Maybe think about doing
some values exercises to make sure that your day to day actions are
aligned with the version of you that you want
to be. You also might want to ask yourselves,
how many spoons do I have? How much bandwidth do I have?
How much energy do I have to either handle a
coworker currently giving me all this pr
feedback in all caps that I'm not really happy with,
or my toddler throwing their gerber in my face.
How much am I able to handle? You want to be in tune with yourself.
And with that comes that question, like, check in with yourself.
Have I prioritized myself? Because you can't
be there for others. You can't be there in situations if you haven't
taken care of your own needs. Kind of like what they tell us in planes.
Make sure to put on your oxygen mask first before you go
and you help someone. So how
do you apply it? Well, maybe remember that you're
not a computer. You are a human. Maybe that
means that you have to turn off the news, you have to turn
off technology, you have to request time off and go on
vacation, you have to see your loved ones.
Or maybe you have to make sure that you have a big group of subject
matter experts in your life and that you're spending time with know
other things that you can really do is focus on a blameless
culture. So when we talk about focusing on a blameless culture
with everything that Anna just told us, know, I want to thank her
for that time that you took and point out that when we talk
about Anna losing somebody in the ocean, we're not necessarily blaming her
for that, although it sounds like it. Sorry,
Anna. In complex systems of
software development, there are a variety of conditions that
interact that lead to failure, right? So as Anna was talking about the
ocean, there were frisbees, Ana Margarita, Medina,
sunscreen, and people, a variety of conditions.
So we want to perform a post mortem at the end of an incident.
Some folks may call that a retrospective, an RCA,
whatever it is that your team calls it. The goal is to understand
what were those systematic factors that led to the incident
and identify actions that can prevent this kind of failure from reoccurring
in the future. The thing with a blameless post mortem is it stays
focused on how a mistake was made instead of who made
the mistake, how it was made instead of why it
was made. This is a crucial mindset, and it's really important to understand.
We want to leverage that mindset. A lot of leading organizations
do. We have the benefit of hindsight when we're
in a post mortem? We want to be careful not to have
that turn into a bias. We want to ask how questions
and what questions instead of why. Questions? Why, Anna,
did you lose track of that? Why instead
of how, Anna, what was
going on at the time? What led you to make those decisions? Boy, I really
feel bad for picking on you in this, Anna, just for everybody to know.
Anna and I have known each other for years.
It's called a single point of failure. That's what went wrong
to begin with and actually probably shouldn't have been a
single point of failure. There should have been other people that were asked to keep
an eye on this person. But let's talk about a blameless culture
in humans. Look, humans are humans.
They are going to fail. We will fail. It is unavoidable.
So practice blameless to yourself.
When we talk about that blameless culture and not having
that hindsight bias when we look at a
post mortem that also happens in our lives. I can
tell you I've been guilty for laying in bed and
going over my day with, why did you do that? Why did you do that?
Why did you do that? Be kind to yourself.
Practice blamelessness. Use how and
what questions. Give yourself a break and realize that you're
not alone as well. Other people
feel stress and struggles. Folks just might not talk
about it. And it's okay to give yourself a break.
It's very rare that one of our actions is going to
cause the next zombie apocalypse. So remember that
and try to keep things in context.
And this is really important because we want to move in
our sociotechnical systems from that reactive approach where we're just all
scrambling because the phone rang, to a more proactive approach
where we are calm, we understand our systems,
and ideally we're preventing
situations from occurring. We look at that
with disaster, unexpected problems that result in
slowdowns and interruptions or network outages.
We can prepare for these by practicing,
by working with our teams, getting to understand our
systems better and preparing. And then we can move
that to the humans. Anna, talk to us about that. I love
talking about humans because we do have to
move from always being reactive and heading
into the emergency room because something went wrong or having
a meltdown and having to call a friend because things are not
working out. Fair point. There's nothing wrong with calling your friends
when you're having a meltdown. Number one, that's what you should be doing.
They're your friends to begin with. But what are some things
that we could be doing today that allow for us to live
a life worth living? Live a life with your loved one.
Live a life that you're happy with. And that
brings me to staying connected to people around you.
Human connection is so crucial. I'm sure all
of us in this pandemic world that we're still in feel
it. It's not the same to be with your loved ones on virtual
settings like this versus actually spending in person time with
you. And it also weekends on your human love language. Are you
someone that is happier when you're actually
spending physical time with someone in the same room?
Or do you need something like acts of service that someone helps
you do activities with you? We all want to feel
connected and the same way
that we want to continue moving from reactive to
proactive. Another way to do it is to set up
some goals, set up some north stars of where you want your life to
be. We're at the beginning of the year is the perfect time to
think about New Year's resolutions or think about where
you want to be at the beginning of next year. What is
it that you're doing today that allows for you to know that
you're working towards them? And the same way that
in incidents we are preparing people around us,
AWS subject matter experts, we could use the
same concept of subject matter experts in our personal life
and set up a board of directors, which includes
things like your past managers, your peers, leaders that
you look up to, just folks that you can ask,
hey, is this the right move for my career? How do
I deal with the coworker that's doing WXE?
And lastly, another way that we can continue
to be proactive is to plan for the future.
In psychology. In therapy, there's a specific type
of therapy called future directed therapy that is really cool.
It's all focused on making humans think about the
future and be content with the future. And it does
it by doing prospection, the action of
looking forward into the future. And that
is where you but that North Star and
you work towards it. That's where you put that reliability goal and
you work towards it.
And of course, I can't do a talk about
tech in humans and not
throw some yaml at you all. Don't forget to Kubectl
apply humans yaml.
And don't forget to set some resource limits and make sure that you
have everything set up properly.
So some self care tips to leave you all with.
Make sure to check in with yourself often, whatever often means
for you, whether it's weekly or monthly. I recommend
at least quarterly to take out the burnout survey that you can find
on Burnoutindex Yarbo Co. Check in on your teammates
and loved ones just to make sure that they're also hanging in there.
I love treating myself, but you should also do something
nice for yourself every once in a while. That might be getting a massage,
getting your nails done, booking a vacation with your loved ones, eating a
lot of ice cream, playing video games, you name it.
Make sure to unplug, turn off social media,
turn off the Internet, turn off the news, turn off technology,
be with yourself. What are you like when you don't
have all this stimulation that we have with electronics?
And as you work towards all this, be kind to yourself,
but set some deadlines and start small as you work
through these things. Remember, you can be flexible with yourself,
but you can work towards having a reliable humans systems too.
And we want to make sure that you have some resources.
So I'm going to pause here to let folks take a screenshot or
check out some of these resources that we have for you.
Don't forget, there's apps, there's open sourcing, mental health,
there's lifelines that might pertain to you
and your friends and family members. Ultimately,
it's okay to ask for help. And we're
all humans. Be kind. Now that
we've stated that we're all humans and that you should be kind, quick point to
note is that all images in this presentation were created by humans.
So thank you, Dolly. Any image that had this little
graphic on it was created by Dolly. And then
to kind of finish up Anna's story, I was the person that
she lost in the ocean. There were a couple
of things that led to this. I had shared
my location with Anna, but my phone was with Anna,
not with me in the ocean. I did end up about a quarter
of a mile away, and it was a fun walk back.
Ultimately, I'm alive, did not get eaten by a shark,
and still very much so.
Appreciate my friend for making sure that I'm alive.
She's never going to let me live it down that I forgot her
at the ocean. Where is that? I mean,
there is, but we've learned. We have learned from this. And other people
will monitor me when I'm in the ocean, or maybe I'll pay attention to
myself. Aws. Well, so thank you for joining us.
All right. We're great.