Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, we are going to talk today with you about how
we brought cows engineering in our organization in a fun and continuous way.
And my name is Christina, I am a three product manager at
Electrolux and I'm here in Stockholm office together with my colleague
Long. He is senior SRE engineer with PhD
in Cows Engineering in particular.
So let's start here is
the three section we are going to discuss today. And for
the first one I would like to jump in into the
context we work at Electrolux with a
highly complex system and by this I meant IoT.
So to explain what we
are doing, I'd like to say that we are working in
digital experience organization and here is where connectivity
start at Electrolux and being developed.
Our appliance is divided into three categories, taste,
wellbeing and care. These are connected appliances.
So just to provide you a simple use case, here is a lady
and let's just imagine that she bought her oven, then she
made it connected to the Internet via our mobile app.
Then she places some food in it and the camera recognized the food
and suggests the cooking mode, the temperature and duration.
Then she starts cooking and goes
out for a run. The oven keeps monitoring the cooking status
and turns off the heating when it's done.
And also she receives a notification from the app so
she knows she can go home and enjoy the meal.
To build all possible and impossible use cases, we have a firmware
team that makes the brain of appliances smarter. We have a team
of brilliant backend engineers that develops a connectivity cloud to
send data from the appliance to a mobile phone. And of course we have a
team of mobile developers together with designers
building the meaningful digital experience for our consumers.
And also here we are SRI team.
And to provide more context,
I also would love to say that we have a part of different types
of appliances like fridges, vacuum cleaners, dishwashers,
many third party company integrations
and we currently launch in over 60 countries and support
other 20 languages. We also run our services from
several different regions and a part of the general complexity
of the IoT. If you join as a developer
Electrolux IoT domain, so the first
day you can be overwhelmed with the
range of various tools and cloud vendors we use.
We have three main connectivity cloud platforms.
The majority of services use AWS, but we
have some services running on Azure, IBM and Google Cloud.
And it means that all of our developer
teams use different tools for monitoring.
When it comes to incident, our team sri
team needs to look into all the observability
tools that used and identify the root cause which
can take for a while. It has some
historical reasons, but before we
will move to the next session, it's important to note that when
we started our journey, we hadn't had such a plan to solve
all the complexity via chaos engineering. It just came after.
So as a first step, we decided to
address the obvious challenge first and
bring everyone and every application we
have into one observability platform. So we would have end to end tracing
from the appliance via our
connectivity platform to the app. So we'd
be able to trace and troubleshoot much
better. So as
I said, there was some historical reasons that every team had
their own preferences
and their own observability platform. But we decided to consolidate
everyone into one observability platform. So we
selected datadoc as this platform and
we started to onboard team by team.
Here is a few dashboards from the different teams and I
remember how one day engineer leads reach out
to our team and say that they had an ongoing incident
and they need our help. So it was a big change because
usually it's we who notice the incident and
now we started sometimes even not be
involved, but they still sometimes
needed our help to identify root cause and
fully resolve the issue.
So I would say that the first step
towards gamification was when we created
this page just to understand what
is the difference among our developer teams,
because for some teams they still use only some
basic functions like only checking logs. And I
just wanted to highlight with this example that if you
made some changes in your organization, you cannot accept
that everyone will be quickly adapt.
So I remember also there was product people,
they reached out to me and they asked how to get all the stars,
which was quite fun because I would say
that was the first step when they wanted
to play this game with us and start to learn about a
new observability tool. But in general
we started to brainstorm what we can do about this big
difference among different teams and how to
help those teams to learn about observability
tool. We provide more. So we got an idea to run
internal cows engineering game day to promote our tool set
and improve developers knowledge about monitoring.
So we tools our developers that we are going to attack their services
with many use cases that could happen in real life and we
have an impact on our end users.
We also tried to cover the different topics of troubleshooting,
a part of log searching like metrics analysis about performance and traffic
trace analysis and about latency. But let's
call the experts in cows engineering so you could listen a bit
more about it. Long perfect.
Hello everyone. Thanks Christina for the nice introduction.
This is long and I'm super proud to be here
for the third time to talk about chaos engineering at comfort
e two. This time we are going to share more about how we adopting chaos
engineering in a fine and continuous way. As you
already know that we have this one observability platform
and we are going to onboard developers
and enable them to learn more about how to use these observability
tools. And also we think it's good to extend this purpose
using chaos game day so that they can learn more about the
infrastructure and also improve the knowledge of
troubleshooting. So as a step two, how to make
it fun using chaos engineering for our developers
to learn and to improve troubleshooting shooting capabilities.
We came up with this idea, Chaos game day. If you have
a similar request and you want to adopting chaos engineering in
your organizations, we are super happy to share more about this journey.
First of all, I would like to share how we prepared for
the Chaos game day. Of course it's very important to
communicate and to decide the target environment beforehand
because as we know chaos engineering, the ultimate goal is to
improve the resilience and to directly conduct experiments
in production. However, we are not there yet
and we think it's good to begin with our staging
environment because the first Chaos game day will be focused.
We're focused on the education and also
the onboarding experience for developers.
So we need lots of communications with different teams
and we also need to decide and we finally decided to use
the staging environment. And second, we want to make it a
fun way to learn and to improve the
troubleshooting capabilities. And here we think it's good to
use the CTF form of Chaos game day, the capture
flags. This one is more used by
security developers I would say, but it's also a
perfect option for Chaos game days. So with
the context of chaos engineering experiments, we think everything
can be a flag, for example, the piece of logs,
name of metrics or the method that raises exception,
et cetera. We can predefined a set of flags
and then we trigger or inject some exceptions
on purpose and invite developers to figure out what
is the flag we injected into the target environment.
And based on this form, we decided to conduct
seven experiments in total, which are all related to troubleshooting
and also related to different aspects of observability
platform knowledges. I will share this more in the
next couple of slides. And finally, we also need some preparation
for the logistics and also for the player registration
and access control. So this is the
example about the flow for one experiment execution.
So before conducting any cost engineering experiments,
we need to prepare the experiment and also we need to place the
flag. For example, if I want to invite developers
to figure out what is the abnormal behavior that
outputs lots of error logs, we can pre inject
some errors and trigger this service to output some
specific error logs. And then we mark these error
logs as a flag and ask developers to dig into
their infrastructure and services to figure out which is a service that
outputs the errors. Of course, we can also conduct
experiments on the same Chaos game day and then report the
error and invite developers to do the troubleshooting.
So on the Chaos game day for each experiment, we will release
a set of instructions for developers and then they will know
what is the abnormal behavior from the end user's perspective.
For example, if I'm using the mobile app to control my fridge,
I will report as an end user, I cannot change
the temperature for my fridge, and I got this error in my mobile
app, then developers will use everything they can
use on the observability platform to troubleshooting
to do the troubleshooting. Of course,
considering this is an educational process,
we also prepare a set of hints for each experiment.
But the only difference is before releasing the hint,
developers or the teams will get a higher score if
they succeed in finding the flag. And after that we close the experiment.
And after the Chaos game day, we will do follow
up sessions with all the participants.
So here we would like to share a bit more about the tricks of experiment
design. Because for chaos engineering experiments,
we should always start from the hypothesis design,
or we should always consider the goal of this set of chaos
engineering experiments. And for us, for this specific Chaos game
day, the goal is to bring infrastructure knowledge and also
to enable developers to do troubleshooting using
this one observability platform. So we always
consider what kind of observability information
or metrics we can use for chaos game day experiments.
For example, we can inject error logs, we can inject
some abnormal behavior from metrics or from
some traces, et cetera. And then we will
enable developers to do the troubleshooting and at the same
time to improve their knowledge.
Secondly, it's totally fine to trigger a failure at
different levels because we would like to report this
error or the abnormal behavior from the end user's
perspective. No matter, this error is where this error
is injected. We can always report that from the consumer's
perspective, I cannot control something from my mobile app,
and then it will be more natural for developers like they receive some
tickets from the support team and they need to figure
out what is happening in their backend services. So it's okay.
For example, we can trigger this latency error from the database
infrastructure perspective, or we can also inject some exceptions
in the microservice and then report the error from end user's
perspective. And finally, it's suggested
to take advantages of various frameworks.
You don't need to write or implement everything by your own,
but actually there are lots of nice chaos engineering
frameworks in the world. For example AWS fault injection
simulator or chaos litmus as another open
sourced chaos engineering frameworks.
Both of them will be very helpful for you to conduct
experiments. So as a summary
for this Chaos game day, we have 41
developers from different twelve teams, four countries to
participate, and also we conducted seven experiments.
In total we received lots of submissions,
181 which was lot and
which caused some issues for us. I will share more
later on, but the good thing is we received lots of
good and positive feedback. And we even found something
extra as a surprise for our infrastructure.
Like we didn't inject any failures here,
but with the help of Kels game day
or with the participants from different teams, we managed
to find some more resilience issues in our infrastructure.
So regarding feedback, we got lots of positive feedback from
players. And here I would like to share one example that I
liked the most. There was one developer who was actually a
bit upset because the team she belonged to didn't win the match
and she couldn't see any logs or metrics from the mobile
app and she was kind of angry at herself because it's
their team who didn't prioritize to have it. After the
Chaos game day, the team actually started the real user monitoring
integration on Datadog, and all the other participants
also started to set up their monitors and alerts.
So Chaos game day became a nice motivation for teams to improve
their services observability. And from
the SRE team's perspective, we think it's
a very good approach to ship operations responsibilities
to developers because with the help of chaos engineering
experiments, they gain more knowledge about their infrastructure
and also improve their capability of troubleshooting using our
observability platforms. We also checked the number of incidents
before and after we having chaos engineering game days.
It's 33 percentage less. And of course it's
not only because the effort we made for
conducting chaos engineering experiments and also we
improved the incident management process with the help of Chaos
game days. Now developers and also
different team leads requested to continuously conduct
chaos engineering experiments, but what is the price for that?
So we considered again about the feedback and also
about the efforts we made for conducting the first Chaos game day.
And we think there are many things that can be further improved.
For example, many operations or the experiments can
be automated. Some of the review or the check of
the submitted flags can be automated as well.
We can also organize and promote or provide a
platform to conduct chaos engineering experiments instead
of organizations the game days with lots of effort for
the logistics. So how to do it in a continuous way?
We come up with the idea that chaos engineering operations can
be actually integrated with the platform engineering practices.
So now we actually had developed
our internal developer platform IDP around like two
years ago, and we think chaos engineering is just one
of the extra good aspects or good feature for
our internal developer platform. So to give you a bit more
context of our IDP, this is the overall design of
the IDP we have. So as an SRE
team member, we define a set of templates
for our infrastructures and these templates are currently
implemented using terraform. We define the
set of standardized options for different resources like
EKS, databases, et cetera. And then we also provide
one single entry for our developers to actually
create and manage all the infrastructure using backstage.
So imagine if there is a new joiner to
Electrolog's IoT team and he needs to create
some infrastructure for daily tax, for example the eks.
Instead of exploring the AWS console or
ask around about what is the configurations for
different infrastructures, she can simply visit
the backstage IDP plugin and
select eks and then she will get all the
recommendations and all the options ready for creating eks
cluster for her. This is a screenshot from
our IDP and the first picture shows the
Electrolux catalog for all of our infrastructure resources.
And here you can see there is a list of resources
like databases, cluster, EKS clusters,
msks, et cetera. And as long as the developer
has access to these resources,
the developer is able to check all these details
on the single platform, like the details of
the infrastructure configuration. And if it is the eks,
developers are also able to check, for example,
the deployment in this eks cluster.
So in order to integrate chaos engineering operations
with our IDP, the first version we did we implemented
is chaos engineering experiment shadowing
plugin. There is a button for developers on different
resources. For example, for a microservice which is deployed in
Kubernetes clusters, we provide this button and
there is the dialog for configuring cost engineering
experiments. Developers can choose the fault models and
also some configurations for this specific fault model, like the
value of latencies or the type of errors, et cetera,
then developers are able to trigger these experiments
in a specific environment. They are also required to document
the hypotheses because the experiment is done
by IDP and most of the information
are analyzed on the observability platform. So developers
needs to cross compare the findings with the predefined hypothesis.
This is a nice approach, but to make it
really more scalable and more flexible, we think it's
even better to consider the chaos engineering frameworks and also
to consider some plugins on backstage so that we can actually
provide a tape for different resources and also provide
more richer fault models for chaos engineering
experiments. This is the current plugin
we have. We use litmus chaos backstage plugin together
with the Litmus Chaos engineering frameworks version three.
In this way, we actually provide a multitenancy setup
for our chaos engineering experiments, and we can also
adopt different fault models from different layers, like cloud
provider fault models and also like Kubernetes fault models,
et cetera. So considering the future
plans, we think it's better to provide a multilevel
and automated experiments platform for
developers. For example, we can provide more fault models
based on the type of infrastructure or the type of
services. We can also improve the full feedback loop with
the help of IDP, because currently we have one single entry for
infrastructure and service management, we have
another platform for observability management,
and we can somehow automate the loop of these
chaos engineering experiments. Okay,
this is a talk for today. Considering the complexity of
IoT systems, chaos engineering is definitely a good
approach for resiliency improvements. However, we don't want to
overwhelm our developers, and we don't want to add an extra task
for them to do. Chaos engineering. We come up with the
idea chaos engineering gamification, and also the integration
with platform engineering approaches. If you would
like to go deeper in platform engineering, feel free to give us
a thumb and we will give another talk maybe in the
recent future. Thank you guests and enjoy at comfort.
Bye.