Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE?
A developer? A quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account at Chaos
native. Litmus Cloud hi
everyone, welcome to this talk on engineering reliable
mobile applications. My name is Pranjal. I'm speaking to
you from San Francisco, California, and I'm
an SRE program manager at Google. So I've been in Google
for about five years and I've worked on Android
SRE client infrastructure, SRE, Firebase SRE, and also
Gmail, spam and abuse SRE. In my previous years
I've co authored two external publications,
blameless post mortem culture. I've given many talks on
the topic and this talk is going to be about an article
I co authored called Engineering Reliable mobile applications.
Before joining Google, I was sort of
a jack of all trades sort of a role at a company
called Bright Idea, where I was fortunate to
juggle a lot of different roles and that got me
a lot of experience into different arenas. Besides work,
I also love traveling, especially solo, and I also love
exploring cuisines and different sort of alcoholic beverages.
So here's our agenda for today. I'll briefly cover the basics
of traditional SRE and compare and contrast it
with what it means to be SRE for mobile. I'll go deeper
and talk about several distinct challenges such as
scale, monitoring, control and challenges management.
Then I'll go into high level strategies for developing resilient native mobile
apps. This will be followed by three case studies,
actual outages that happened with our apps, and then what we
learned. In the end, I'll summarize some key takeaways,
some things to keep in mind to evaluate the reliability
of your current and future mobile applications.
Now, I assume most of you already know this,
but to set some context, if you are new to the industry,
if you don't know what SRE is, SRE think of
it as a specialized job function that focuses on reliability
and maintainability of large systems. It's also
a mindset. The approach is that of constructive pessimism.
So where you hope for the best, but you plan for the worst,
because hope is not a strategy in terms of people.
While SRE is a distinct job role, they partner closely
and collaboratively with our partner product dev teams throughout
the product lifecycle. So now when I talk about availability,
it's commonly assumed that we're talking about server side availability.
Most slis have also traditionally been bound by server side availability.
So, for example, two to three, nine of uptime
but the problem with relying on server side
availability as an indicator of service health is
that the users perceive the reliability of a service through the devices
in their hands. We have seen a number of production incidents
in which server side instrumentation taken by itself
would have shown no trouble. But a view inclusive of client
side reflected user problems. So, for example,
your services stack is successfully returning what it thinks
are perfectly valid responses. But users of your
app see blank screens, or you're in a new city and
you open the maps app and the app just
keeps crashing. Or maybe your app receives an update.
And although nothing is visibly changed in the app itself,
you experience significantly worse battery life than
before. So these are all issues that cannot be
detected by just monitoring the servers and data centers.
Now we can compare a mobile app to a distributed system that
has billions of machines. So this scale is
just one of the unique challenges of mobile. Things we
take for granted in the server world today have before
very complicated to accomplish for mobile, if not impossible for native
mobile apps. And so here sre just some of the few challenges
in terms of scale. So think of this. There are billions
of devices with thousands of device models,
hundreds of apps running on them, and each app
has its own multiple versions. It becomes more
difficult to accurately attribute,
degrading ux to unreliable network connections,
service reliability or external factors.
All of this combined can make for a very complex ecosystem.
Now, challenge number two is monitoring. We need to tolerate
potential inconsistency in the mobile world because we're relying
on a piece of hardware that's beyond our control.
There's very little we can do when an app is in a state in
which it cannot send information back to you. So in
this diverse and complex ecosystem, the task of monitoring every
single metric has many possible dimensions with
many possible values. So it's infeasible to monitor
every combination independently. And then we must also
consider the effects of logging and monitoring on the end user,
given that they pay the price of resource usage.
So let's say battery and network, for example. Now on
serverside we can change binaries and update configs on demand.
In the mobile world, this power lies with the users.
In the case of native apps, after an update is available to
users, we cannot force them to download a binary or config.
Users may sometimes consider upgrades as indication of poor
software quality and assume that all upgrades are just bug fixes.
It's also important to remember that upgrades can have a very tangible cost.
So, for example, network metered network usage
to the end user, ondevice storage might be constrained
and data connection may be either sparse or completely nonexistent.
Now, if there's a bad change, one's immediate response
is to roll it back. We can quickly roll back serverside
and we know that we will no longer be on the bad version
after the rollback is complete. On the other hand,
it's quite difficult, if not impossible, to roll
back a binary for a native mobile app for Android and
iOS. Instead, the current standard is to roll forward
and hope that affected users will upgrade to the newest version.
Now, considering the lack scale and lack of
control in the mobile environment, managing changes in
a safe and reliable manner is one of the most critical
pieces of managing a reliable mobile app.
So in this next section, I'll talk about some concepts that are not
foreign to us as sres, but I'll put them in the context
of mobile and then discuss some strategies to tackle them.
Now availability is one of the most important measures of reliability,
so think of a time when this happened to you. You tapped an
app icon, the app was about to load and then it immediately
vanished. Or a message displayed saying your application
has stopped. Or you tapped a button, waited for
something to load, and eventually abandoned it by clicking the back button.
Now, these are all examples of your app being effectively unavailable
to you. You the user interacted
with the app and it did not perform the way you expected.
So one way to think about mobile app reliability is its
ability to be available servicing interactions
consistently well relative to your expectations
or the user's expectations. Now users are constantly
interacting with their apps and to understand how available these
apps are, we need on device client side telemetry
to measure and gain visibility, privacy and
security SRE topmost concerns, so we need
to keep those in mind while developing a monitoring strategy.
Now, one example of gathering availability data is gathering
crash reports, so when an app is crashing,
it's a clear sign of unavailability. A user's
experience might be interrupted by a crash dialogue.
The application may close unexpectedly and you
may be prompted to notify via a bug.
Crashes can occur for a number of reasons, but what's important
is that we log them and we monitor them. Now, who doesn't
love real time metrics? The faster one is able to alert
and recognize a problem, the faster one can begin investigating.
Real time metrics also help us sre the results of our fixes and
get feedback on product changes. Most incidents
have a minimum resolution time that is driven more
by humans organizing themselves around problems and then determining
how to fix them. For mobile, this resolution
is also affected by how fast we can push
the fix to services. So most mobile
experimentation and config at Google is polling oriented
with devices updating themselves during opportune times of
battery and bandwidth conservation.
This means that after submitting a fix, it still
may take several hours before client side metrics can
be expected to normalize. Now telemetry for widely
deployed apps is constantly arriving, so despite this
latency, some devices may have picked up the fix
and metrics from them will arrive shortly. One approach
to design metrics derived from services telemetry is to
include config state as a dimension. You can then
constrain your view of the error metrics to only the
devices that are using your fix. This becomes easier
if client changes are being rolled out through launch experiments,
and you can look at the experiment population and experiment id
to monitor the effect of your changes. When developing an
app, one does have the luxury of opening a
bug console and then looking at fine grained lock statements to
inspect app execution and state. However, when it's
deployed to an end user's device, we have visibility into
only what we've chosen to measure and then transport back to us.
So measuring counts of attempts, errors, states or
timers in the code, particularly around key entry or exit points
or business logic, provides indications of the app's usage
and correct functioning. Another idea could be to
run UI test probes to monitor critical user journeys.
This may entail launching the current version of a binary
on a simulated device or emulated device,
inputting actions as a user would, and then asserting that certain
properties hold true throughout the test. Now, what enables
your phone to be mobile in the first place?
So it's the battery, the resource, which is one
of the most valuable resource in a mobile device.
Then there are other precious shared resources like cpu,
memory, storage, network, and no
application wants to misuse any of these resources and get negative
reviews. Efficiency is
particularly important on lower end devices and
in markets where Internet or data usage is high cost.
The platform as a whole suffers when shared resources are misused,
and increasingly the platform needs to place restrictions
to ensure common standards. So one such example is
a program called Android Vitals, where within the
company that helps attribute problems in the app itself.
So our apps have sometimes been blocked from releasing for battery regressions
as low as 0.1% because of negative effect
on user happiness. Any expected regressions
need to be released only after careful considerations of
the trade offs, let's say, or the benefits a feature would provide.
Now, change management is really important, so I'm going to spend a little bit
longer on this Slis because there sre just lots to talk about.
So many of these practices while releasing client
apps are based on general SRE best practices.
Change management is extremely important in
mobile ecosystem because a new single faulty
version can erode users trust and they may decide
to never install or upgrade your app again.
Rollbacks in the mobile world still sort of near impossible and
problems found in product are irrecoverable,
which makes them or which makes proper change management very
important. So the first strategy here is a
stage rollout. This simply means gradually increasing the
number of users the app is being released to.
This also allows you to gather feedback on production changes
gradually instead of hurting all of your users at once.
Now one thing to keep in mind here is internal testing, and testing
on developer devices does not provide a representative
sample of the device diversity that's out there.
If you have the option to release to a selected few beta users,
it will help widen the pool of devices and users also.
Second way to do it is experimentation. Now one issue
with stage rollouts is the bias towards newest
and most upgraded devices. So metrics such as latency
can look great on newer devices, but then the trend becomes
worse over time and then you're just wondering what happened.
Folks with better networks may also upgrade
earlier, which is another example of this bias. Now these
factors make it difficult to accurately measure the effects of a new release.
It's not a simple before and after comparison and may
require some manual intervention and interpretation.
A better way to release metrics or release changes, sorry,
is to release via experiments and conduct ab analysis as
shown in the figure. Whenever possible,
randomize the control and treatment population to ensure
the same group of people are not repeatedly upgrading the wraps.
Another way to do this is feature flag.
Feature flags is another option to manage production changes.
General best practice is to release a
new binary with new code and then turn feature flag
off by default. This gives more control over determining
release radius and timelines. Rolling back
can be as simple as ramping down the feature flag instead of
building an entire binary and then releasing it again. Now this
is with the assumption that rolling back your
flag is not breaking your app. One quick word
of caution in terms of accuracy of metrics is to
take into account the effect of upgrade side effects and noise
between the control and experiment population. The simple
act of upgrading an app itself can introduce noise as
the app restarts. One way to address this
is to release a placebo binary to a control group, doesn't do
anything but will also upgrade or restart,
thereby giving you a more apples to apples comparison of metrics this
is, however, not ideal and has to be done with a lot of caution
to avoid negative effects such as battery consumption or
network usage. Now, even after doing all
of the above and putting all of this governance in place,
the power to update their apps still lies with the user.
This means that as new versions keep releasing, older versions
will continue to coexist. Now, there can be many reasons that
people don't upgrade their apps. Some users disable auto
updates, sometimes there's just not enough storage, et cetera.
On the other hand, privacy and security fixes are strong motivators
for upgrades. Sometimes users are shown a dialog box
and forced to update. Now, as client apps grow and
more and more features and functionalities are added,
some things may not be doable in older versions.
So, for example, if you add a new critical metric to your
application in version 20, that metric will only
be available in version 20 and up.
It may not be sustainable for SRE to support
all possible versions of a client, especially when only
a small fraction of the user population is on the older versions.
So possible tradeoff is developer teams can continue to
support older versions and make business or product decisions
around supporting timeline in general.
Now, we've focused a lot on the client side so far.
This slide is a quick reminder that making client changes to an
app can have server side consequences as well.
Now your servers have, let's say, are already ready
and are scaled to accommodate daily requests. But problem
will arise when unintended consequences occur.
Some examples could be time synchronization requests or
a large global event causing a number of requests to your serverside.
Sometimes bugs in new releases can suddenly increase the
number of requests and overload your servers.
We'll see an example of one such outage in the coming section,
which brings me to case studies.
Now in this section, we address topics covered in previous sections
through several concrete examples of client side issues that
have happened in real life and that our teams have
encountered. The slides also note practical
takeaways for each case study, which you can then apply to your
own apps. This first case study is
from 2016, when AGSA,
which means the Android Google Search app, tried to
etc the config of a new doodle from the server,
but the doodle did not have its color field set.
The code expected the color field to be set and this in turn
resulted in the app crashing. Now the
font is pretty small, but the graph shows crash rate along
a time services. The reason the graph looks like that
is because Doodle launches are happening at midnight for every
time zone, resulting in crash rate spikes at regular intervals.
So the higher the traffic in the region, the higher was the spike.
This graph actually made it easy for the team to correlate issues to
a live Google versus an infrastructure change.
Now, in terms of fixes, it took a series of steps to implement
an all round concrete solution.
So in order to fix the outage first, the config was
fixed and released immediately, but the error did not go down right
away. This was because to avoid causing
avoid calling server backend, the client code
cached the doodle config for a set period of time before
calling the server again for a new config. For this particular
incident, this meant that the bad config was still on services
until the cache expired. Then a second
client side fix was also submitted to prevent the app from crashing
in this situation. But unfortunately, a similar outage
happened again after a few months, and this time only
older versions were without. Older versions were affected,
which did not have the fix I just described. After this
outage reoccurrence, a third solution in the form of server
side guardrails were also put in place to prevent bad
config from being released in the first place.
So quick key takeaways here were client side
fixes don't fix everything. In particular, there are always
older versions out there in the wild that never get updated.
So whenever possible, it's also important to add a services side
fix to increase coverage and prevent a repeat incident.
And also good to know the app's dependencies. It can
help to understand what changes external to the app may be causing
failures. It's also good to have clear documentation
on your client's dependencies like server backends and
config mechanisms.
So here's another case study again from AGSA.
My team on call for AGSA that happened
in 2017. It was one of our largest outages,
but a near miss of a truly disastrous outage,
as it didn't affect the latest production version.
Now, this outage was triggered by a simple four character change
to a legacy config that was unfortunately
not canaryed. It was just tested on a local device,
run through automated testing, committed to production, and then rolled
out. So two unexpected issues occurred due to
a build error. The change was picked up by an old app
version that could not support it, or rather multiple older app
versions that could not support it. And secondly, the change was
rolled out to a wider population than originally intended.
Now, when the change was picked up by older versions, they started failing
on startups and once they failed, they would continue
to fail by reading the cached config. Even after
the config was fixed and rolled out, the app would keep crashing
before it could etc it so this graph that you see on
your screen shows daily active users of the AGSA
app on just the affected version range over
a two week period leading up to and then after the outage.
After scrambling for days. The only recovery
was manual user action where we had to request
our users to upgrade via sending them a notification, which prolonged
the outage and then also resulted in really bad user experience.
So lots of key takeaways on this one.
Anything that causes users to not upgrade is similar to
accumulating tech debt. Older apps are highly
susceptible to issues, so the less successful app updates
are, the larger the probability that a bigger population
will get impacted by a backward incompatible change.
Then always validate before committing, that is, caching a new config
configs can be downloaded and successfully parsed,
but an app should interpret and exercise a new version before
it becomes the active one. Cached config, especially when
RedX startup, can make recovery difficult if it's not
handled correctly. If an app caches a config,
it must be capable of expiring, refreshing or even discarding
the cache without needing the user to step in and don't
rely on recovery outside the app. Sending notifications,
which we did in this case, has limited utility and
instead regularly refreshing configs and having a crash
recovery mechanism is better, but at the
same time, similar to backups, case recovery mechanism
is valid only if it's tested.
So when apps exercise crash recovery, it's still a
warning sign. Crash recovery can conceal real problems.
If your app would have failed if it was not for the crash
recovery, you're again in an unsafe condition because the crash recovery
thing is the only thing that's preventing failure.
So it's a good idea to monitor your case recovery rates also
and treat high rates of recoveries as problems in their own right,
requiring investigation and correlation.
So this last case study demonstrates how apps
sometimes do things in response to inorganic phenomena,
and capacity planning can go for a toss if
proper throttling mechanisms are not in place.
So this graph is a picture of the rate at which certain mobile apps
were registering to receive messages through firebase cloud messaging.
This registration is usually done in the background, that is,
if users install apps for the first time when tokens
need to be refreshed. This is usually a gentle curve
which organically follows the world's population as they wake
up. So what we noticed one day were two spikes
followed by Plateau. The first plateau went back
to the normal trend, but the second spike in Plateau that
looked like teeth of a comb represented the app
repeatedly exhausting its quota and then trying again.
The root cause was identified as a rollout of a new version
of Google Play services, and the new version kept making
FCM registration calls. So the teeth of the
comb that you see on the graph were from the app repeatedly exhausting
its quota and then repeatedly being topped off.
FCM was working fine for other apps, thankfully,
because of the poor app throttling mechanism in place
and service availability wasn't at risk. But this is an
example of the consequences of scale mobile our
devices have limited computing power, but there are billions
of them, and when they accidentally do things together,
it can go downhill really fast. So the key takeaway
here was avoid upgrade proportional load
in terms of release management and uptake rates for installation.
We created a new principle. Mobile apps must not make
service calls during upgrade time, and releases must prove,
via release regression metrics, that they haven't introduced new service calls
purely as a result of the upgrade. If new service calls are
being deliberately introduced, they must be enabled via
an independent ramp up of an experiment and not a
binary review. In terms of scale,
a developer writing and testing their code for
them, a one time service call feels like nothing.
However, two things can change that. So when the app
has a very large user base, and when the service is
scaled only for steady state demand of your app.
If a service exists primarily to support a specific app,
over time, its capacity management optimizes for a footprint that
is close to the app's steady state needs.
So avoiding upgrade proportional load is valuable at large as
well as smaller scales. So this brings us
to the last slide of the stack. To quickly summarize
this talk, a modern product stack is only reliable and
supportable if it's engineered for reliable
operation, all the way from server backends to the app's
user interface. Mobile environments are very different from
server environments and also browser based clients presenting
a unique set of behaviors, failure modes, and management challenges.
Engineering reliability into our mobile apps is as
crucial as building reliable serverside.
Users perceive the reliability of our products based
on the totality of the system, and their mobile device is
their most tangible experience of the service.
So I have five key takeaways, more than my usual three,
but I'll speak through all five of them. So number one is
design mobile apps to be resilient to unexpected inputs,
to recover from management errors, and to roll
out changes in a controlled, metric driven way.
Second thing is monitor the app in production by measuring
critical user interactions and other key health metrics,
for example, responsiveness, data freshness, and, of course,
crashes. Design your success criteria to relate
directly to the expectations of your users as they move through your
app's critical journeys. Then, number three,
release changes carefully by a feature flag so that they
can be evaluated using experiments and rollback independently
of binary releases. Number four,
understand and prepare for the app's impact on servers,
including preventing known bad behaviors.
Establish development and release practices that avoid
problematic feedback patterns between apps and services.
And then lastly, it's also ideal
to proactively create playbooks and incident management processes
in anticipation of a client side outage and the impact
it might have on your servers.
Thank you.