Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Are you an SRE?
            
            
            
              A developer? A quality
            
            
            
              engineer who wants to tackle the challenge of improving reliability
            
            
            
              in your DevOps? You can enable your DevOps for reliability
            
            
            
              with chaos native. Create your free account at Chaos
            
            
            
              native. Litmus Cloud hi
            
            
            
              everyone, welcome to this talk on engineering reliable
            
            
            
              mobile applications. My name is Pranjal. I'm speaking to
            
            
            
              you from San Francisco, California, and I'm
            
            
            
              an SRE program manager at Google. So I've been in Google
            
            
            
              for about five years and I've worked on Android
            
            
            
              SRE client infrastructure, SRE, Firebase SRE, and also
            
            
            
              Gmail, spam and abuse SRE. In my previous years
            
            
            
              I've co authored two external publications,
            
            
            
              blameless post mortem culture. I've given many talks on
            
            
            
              the topic and this talk is going to be about an article
            
            
            
              I co authored called Engineering Reliable mobile applications.
            
            
            
              Before joining Google, I was sort of
            
            
            
              a jack of all trades sort of a role at a company
            
            
            
              called Bright Idea, where I was fortunate to
            
            
            
              juggle a lot of different roles and that got me
            
            
            
              a lot of experience into different arenas. Besides work,
            
            
            
              I also love traveling, especially solo, and I also love
            
            
            
              exploring cuisines and different sort of alcoholic beverages.
            
            
            
              So here's our agenda for today. I'll briefly cover the basics
            
            
            
              of traditional SRE and compare and contrast it
            
            
            
              with what it means to be SRE for mobile. I'll go deeper
            
            
            
              and talk about several distinct challenges such as
            
            
            
              scale, monitoring, control and challenges management.
            
            
            
              Then I'll go into high level strategies for developing resilient native mobile
            
            
            
              apps. This will be followed by three case studies,
            
            
            
              actual outages that happened with our apps, and then what we
            
            
            
              learned. In the end, I'll summarize some key takeaways,
            
            
            
              some things to keep in mind to evaluate the reliability
            
            
            
              of your current and future mobile applications.
            
            
            
              Now, I assume most of you already know this,
            
            
            
              but to set some context, if you are new to the industry,
            
            
            
              if you don't know what SRE is, SRE think of
            
            
            
              it as a specialized job function that focuses on reliability
            
            
            
              and maintainability of large systems. It's also
            
            
            
              a mindset. The approach is that of constructive pessimism.
            
            
            
              So where you hope for the best, but you plan for the worst,
            
            
            
              because hope is not a strategy in terms of people.
            
            
            
              While SRE is a distinct job role, they partner closely
            
            
            
              and collaboratively with our partner product dev teams throughout
            
            
            
              the product lifecycle. So now when I talk about availability,
            
            
            
              it's commonly assumed that we're talking about server side availability.
            
            
            
              Most slis have also traditionally been bound by server side availability.
            
            
            
              So, for example, two to three, nine of uptime
            
            
            
              but the problem with relying on server side
            
            
            
              availability as an indicator of service health is
            
            
            
              that the users perceive the reliability of a service through the devices
            
            
            
              in their hands. We have seen a number of production incidents
            
            
            
              in which server side instrumentation taken by itself
            
            
            
              would have shown no trouble. But a view inclusive of client
            
            
            
              side reflected user problems. So, for example,
            
            
            
              your services stack is successfully returning what it thinks
            
            
            
              are perfectly valid responses. But users of your
            
            
            
              app see blank screens, or you're in a new city and
            
            
            
              you open the maps app and the app just
            
            
            
              keeps crashing. Or maybe your app receives an update.
            
            
            
              And although nothing is visibly changed in the app itself,
            
            
            
              you experience significantly worse battery life than
            
            
            
              before. So these are all issues that cannot be
            
            
            
              detected by just monitoring the servers and data centers.
            
            
            
              Now we can compare a mobile app to a distributed system that
            
            
            
              has billions of machines. So this scale is
            
            
            
              just one of the unique challenges of mobile. Things we
            
            
            
              take for granted in the server world today have before
            
            
            
              very complicated to accomplish for mobile, if not impossible for native
            
            
            
              mobile apps. And so here sre just some of the few challenges
            
            
            
              in terms of scale. So think of this. There are billions
            
            
            
              of devices with thousands of device models,
            
            
            
              hundreds of apps running on them, and each app
            
            
            
              has its own multiple versions. It becomes more
            
            
            
              difficult to accurately attribute,
            
            
            
              degrading ux to unreliable network connections,
            
            
            
              service reliability or external factors.
            
            
            
              All of this combined can make for a very complex ecosystem.
            
            
            
              Now, challenge number two is monitoring. We need to tolerate
            
            
            
              potential inconsistency in the mobile world because we're relying
            
            
            
              on a piece of hardware that's beyond our control.
            
            
            
              There's very little we can do when an app is in a state in
            
            
            
              which it cannot send information back to you. So in
            
            
            
              this diverse and complex ecosystem, the task of monitoring every
            
            
            
              single metric has many possible dimensions with
            
            
            
              many possible values. So it's infeasible to monitor
            
            
            
              every combination independently. And then we must also
            
            
            
              consider the effects of logging and monitoring on the end user,
            
            
            
              given that they pay the price of resource usage.
            
            
            
              So let's say battery and network, for example. Now on
            
            
            
              serverside we can change binaries and update configs on demand.
            
            
            
              In the mobile world, this power lies with the users.
            
            
            
              In the case of native apps, after an update is available to
            
            
            
              users, we cannot force them to download a binary or config.
            
            
            
              Users may sometimes consider upgrades as indication of poor
            
            
            
              software quality and assume that all upgrades are just bug fixes.
            
            
            
              It's also important to remember that upgrades can have a very tangible cost.
            
            
            
              So, for example, network metered network usage
            
            
            
              to the end user, ondevice storage might be constrained
            
            
            
              and data connection may be either sparse or completely nonexistent.
            
            
            
              Now, if there's a bad change, one's immediate response
            
            
            
              is to roll it back. We can quickly roll back serverside
            
            
            
              and we know that we will no longer be on the bad version
            
            
            
              after the rollback is complete. On the other hand,
            
            
            
              it's quite difficult, if not impossible, to roll
            
            
            
              back a binary for a native mobile app for Android and
            
            
            
              iOS. Instead, the current standard is to roll forward
            
            
            
              and hope that affected users will upgrade to the newest version.
            
            
            
              Now, considering the lack scale and lack of
            
            
            
              control in the mobile environment, managing changes in
            
            
            
              a safe and reliable manner is one of the most critical
            
            
            
              pieces of managing a reliable mobile app.
            
            
            
              So in this next section, I'll talk about some concepts that are not
            
            
            
              foreign to us as sres, but I'll put them in the context
            
            
            
              of mobile and then discuss some strategies to tackle them.
            
            
            
              Now availability is one of the most important measures of reliability,
            
            
            
              so think of a time when this happened to you. You tapped an
            
            
            
              app icon, the app was about to load and then it immediately
            
            
            
              vanished. Or a message displayed saying your application
            
            
            
              has stopped. Or you tapped a button, waited for
            
            
            
              something to load, and eventually abandoned it by clicking the back button.
            
            
            
              Now, these are all examples of your app being effectively unavailable
            
            
            
              to you. You the user interacted
            
            
            
              with the app and it did not perform the way you expected.
            
            
            
              So one way to think about mobile app reliability is its
            
            
            
              ability to be available servicing interactions
            
            
            
              consistently well relative to your expectations
            
            
            
              or the user's expectations. Now users are constantly
            
            
            
              interacting with their apps and to understand how available these
            
            
            
              apps are, we need on device client side telemetry
            
            
            
              to measure and gain visibility, privacy and
            
            
            
              security SRE topmost concerns, so we need
            
            
            
              to keep those in mind while developing a monitoring strategy.
            
            
            
              Now, one example of gathering availability data is gathering
            
            
            
              crash reports, so when an app is crashing,
            
            
            
              it's a clear sign of unavailability. A user's
            
            
            
              experience might be interrupted by a crash dialogue.
            
            
            
              The application may close unexpectedly and you
            
            
            
              may be prompted to notify via a bug.
            
            
            
              Crashes can occur for a number of reasons, but what's important
            
            
            
              is that we log them and we monitor them. Now, who doesn't
            
            
            
              love real time metrics? The faster one is able to alert
            
            
            
              and recognize a problem, the faster one can begin investigating.
            
            
            
              Real time metrics also help us sre the results of our fixes and
            
            
            
              get feedback on product changes. Most incidents
            
            
            
              have a minimum resolution time that is driven more
            
            
            
              by humans organizing themselves around problems and then determining
            
            
            
              how to fix them. For mobile, this resolution
            
            
            
              is also affected by how fast we can push
            
            
            
              the fix to services. So most mobile
            
            
            
              experimentation and config at Google is polling oriented
            
            
            
              with devices updating themselves during opportune times of
            
            
            
              battery and bandwidth conservation.
            
            
            
              This means that after submitting a fix, it still
            
            
            
              may take several hours before client side metrics can
            
            
            
              be expected to normalize. Now telemetry for widely
            
            
            
              deployed apps is constantly arriving, so despite this
            
            
            
              latency, some devices may have picked up the fix
            
            
            
              and metrics from them will arrive shortly. One approach
            
            
            
              to design metrics derived from services telemetry is to
            
            
            
              include config state as a dimension. You can then
            
            
            
              constrain your view of the error metrics to only the
            
            
            
              devices that are using your fix. This becomes easier
            
            
            
              if client changes are being rolled out through launch experiments,
            
            
            
              and you can look at the experiment population and experiment id
            
            
            
              to monitor the effect of your changes. When developing an
            
            
            
              app, one does have the luxury of opening a
            
            
            
              bug console and then looking at fine grained lock statements to
            
            
            
              inspect app execution and state. However, when it's
            
            
            
              deployed to an end user's device, we have visibility into
            
            
            
              only what we've chosen to measure and then transport back to us.
            
            
            
              So measuring counts of attempts, errors, states or
            
            
            
              timers in the code, particularly around key entry or exit points
            
            
            
              or business logic, provides indications of the app's usage
            
            
            
              and correct functioning. Another idea could be to
            
            
            
              run UI test probes to monitor critical user journeys.
            
            
            
              This may entail launching the current version of a binary
            
            
            
              on a simulated device or emulated device,
            
            
            
              inputting actions as a user would, and then asserting that certain
            
            
            
              properties hold true throughout the test. Now, what enables
            
            
            
              your phone to be mobile in the first place?
            
            
            
              So it's the battery, the resource, which is one
            
            
            
              of the most valuable resource in a mobile device.
            
            
            
              Then there are other precious shared resources like cpu,
            
            
            
              memory, storage, network, and no
            
            
            
              application wants to misuse any of these resources and get negative
            
            
            
              reviews. Efficiency is
            
            
            
              particularly important on lower end devices and
            
            
            
              in markets where Internet or data usage is high cost.
            
            
            
              The platform as a whole suffers when shared resources are misused,
            
            
            
              and increasingly the platform needs to place restrictions
            
            
            
              to ensure common standards. So one such example is
            
            
            
              a program called Android Vitals, where within the
            
            
            
              company that helps attribute problems in the app itself.
            
            
            
              So our apps have sometimes been blocked from releasing for battery regressions
            
            
            
              as low as 0.1% because of negative effect
            
            
            
              on user happiness. Any expected regressions
            
            
            
              need to be released only after careful considerations of
            
            
            
              the trade offs, let's say, or the benefits a feature would provide.
            
            
            
              Now, change management is really important, so I'm going to spend a little bit
            
            
            
              longer on this Slis because there sre just lots to talk about.
            
            
            
              So many of these practices while releasing client
            
            
            
              apps are based on general SRE best practices.
            
            
            
              Change management is extremely important in
            
            
            
              mobile ecosystem because a new single faulty
            
            
            
              version can erode users trust and they may decide
            
            
            
              to never install or upgrade your app again.
            
            
            
              Rollbacks in the mobile world still sort of near impossible and
            
            
            
              problems found in product are irrecoverable,
            
            
            
              which makes them or which makes proper change management very
            
            
            
              important. So the first strategy here is a
            
            
            
              stage rollout. This simply means gradually increasing the
            
            
            
              number of users the app is being released to.
            
            
            
              This also allows you to gather feedback on production changes
            
            
            
              gradually instead of hurting all of your users at once.
            
            
            
              Now one thing to keep in mind here is internal testing, and testing
            
            
            
              on developer devices does not provide a representative
            
            
            
              sample of the device diversity that's out there.
            
            
            
              If you have the option to release to a selected few beta users,
            
            
            
              it will help widen the pool of devices and users also.
            
            
            
              Second way to do it is experimentation. Now one issue
            
            
            
              with stage rollouts is the bias towards newest
            
            
            
              and most upgraded devices. So metrics such as latency
            
            
            
              can look great on newer devices, but then the trend becomes
            
            
            
              worse over time and then you're just wondering what happened.
            
            
            
              Folks with better networks may also upgrade
            
            
            
              earlier, which is another example of this bias. Now these
            
            
            
              factors make it difficult to accurately measure the effects of a new release.
            
            
            
              It's not a simple before and after comparison and may
            
            
            
              require some manual intervention and interpretation.
            
            
            
              A better way to release metrics or release changes, sorry,
            
            
            
              is to release via experiments and conduct ab analysis as
            
            
            
              shown in the figure. Whenever possible,
            
            
            
              randomize the control and treatment population to ensure
            
            
            
              the same group of people are not repeatedly upgrading the wraps.
            
            
            
              Another way to do this is feature flag.
            
            
            
              Feature flags is another option to manage production changes.
            
            
            
              General best practice is to release a
            
            
            
              new binary with new code and then turn feature flag
            
            
            
              off by default. This gives more control over determining
            
            
            
              release radius and timelines. Rolling back
            
            
            
              can be as simple as ramping down the feature flag instead of
            
            
            
              building an entire binary and then releasing it again. Now this
            
            
            
              is with the assumption that rolling back your
            
            
            
              flag is not breaking your app. One quick word
            
            
            
              of caution in terms of accuracy of metrics is to
            
            
            
              take into account the effect of upgrade side effects and noise
            
            
            
              between the control and experiment population. The simple
            
            
            
              act of upgrading an app itself can introduce noise as
            
            
            
              the app restarts. One way to address this
            
            
            
              is to release a placebo binary to a control group, doesn't do
            
            
            
              anything but will also upgrade or restart,
            
            
            
              thereby giving you a more apples to apples comparison of metrics this
            
            
            
              is, however, not ideal and has to be done with a lot of caution
            
            
            
              to avoid negative effects such as battery consumption or
            
            
            
              network usage. Now, even after doing all
            
            
            
              of the above and putting all of this governance in place,
            
            
            
              the power to update their apps still lies with the user.
            
            
            
              This means that as new versions keep releasing, older versions
            
            
            
              will continue to coexist. Now, there can be many reasons that
            
            
            
              people don't upgrade their apps. Some users disable auto
            
            
            
              updates, sometimes there's just not enough storage, et cetera.
            
            
            
              On the other hand, privacy and security fixes are strong motivators
            
            
            
              for upgrades. Sometimes users are shown a dialog box
            
            
            
              and forced to update. Now, as client apps grow and
            
            
            
              more and more features and functionalities are added,
            
            
            
              some things may not be doable in older versions.
            
            
            
              So, for example, if you add a new critical metric to your
            
            
            
              application in version 20, that metric will only
            
            
            
              be available in version 20 and up.
            
            
            
              It may not be sustainable for SRE to support
            
            
            
              all possible versions of a client, especially when only
            
            
            
              a small fraction of the user population is on the older versions.
            
            
            
              So possible tradeoff is developer teams can continue to
            
            
            
              support older versions and make business or product decisions
            
            
            
              around supporting timeline in general.
            
            
            
              Now, we've focused a lot on the client side so far.
            
            
            
              This slide is a quick reminder that making client changes to an
            
            
            
              app can have server side consequences as well.
            
            
            
              Now your servers have, let's say, are already ready
            
            
            
              and are scaled to accommodate daily requests. But problem
            
            
            
              will arise when unintended consequences occur.
            
            
            
              Some examples could be time synchronization requests or
            
            
            
              a large global event causing a number of requests to your serverside.
            
            
            
              Sometimes bugs in new releases can suddenly increase the
            
            
            
              number of requests and overload your servers.
            
            
            
              We'll see an example of one such outage in the coming section,
            
            
            
              which brings me to case studies.
            
            
            
              Now in this section, we address topics covered in previous sections
            
            
            
              through several concrete examples of client side issues that
            
            
            
              have happened in real life and that our teams have
            
            
            
              encountered. The slides also note practical
            
            
            
              takeaways for each case study, which you can then apply to your
            
            
            
              own apps. This first case study is
            
            
            
              from 2016, when AGSA,
            
            
            
              which means the Android Google Search app, tried to
            
            
            
              etc the config of a new doodle from the server,
            
            
            
              but the doodle did not have its color field set.
            
            
            
              The code expected the color field to be set and this in turn
            
            
            
              resulted in the app crashing. Now the
            
            
            
              font is pretty small, but the graph shows crash rate along
            
            
            
              a time services. The reason the graph looks like that
            
            
            
              is because Doodle launches are happening at midnight for every
            
            
            
              time zone, resulting in crash rate spikes at regular intervals.
            
            
            
              So the higher the traffic in the region, the higher was the spike.
            
            
            
              This graph actually made it easy for the team to correlate issues to
            
            
            
              a live Google versus an infrastructure change.
            
            
            
              Now, in terms of fixes, it took a series of steps to implement
            
            
            
              an all round concrete solution.
            
            
            
              So in order to fix the outage first, the config was
            
            
            
              fixed and released immediately, but the error did not go down right
            
            
            
              away. This was because to avoid causing
            
            
            
              avoid calling server backend, the client code
            
            
            
              cached the doodle config for a set period of time before
            
            
            
              calling the server again for a new config. For this particular
            
            
            
              incident, this meant that the bad config was still on services
            
            
            
              until the cache expired. Then a second
            
            
            
              client side fix was also submitted to prevent the app from crashing
            
            
            
              in this situation. But unfortunately, a similar outage
            
            
            
              happened again after a few months, and this time only
            
            
            
              older versions were without. Older versions were affected,
            
            
            
              which did not have the fix I just described. After this
            
            
            
              outage reoccurrence, a third solution in the form of server
            
            
            
              side guardrails were also put in place to prevent bad
            
            
            
              config from being released in the first place.
            
            
            
              So quick key takeaways here were client side
            
            
            
              fixes don't fix everything. In particular, there are always
            
            
            
              older versions out there in the wild that never get updated.
            
            
            
              So whenever possible, it's also important to add a services side
            
            
            
              fix to increase coverage and prevent a repeat incident.
            
            
            
              And also good to know the app's dependencies. It can
            
            
            
              help to understand what changes external to the app may be causing
            
            
            
              failures. It's also good to have clear documentation
            
            
            
              on your client's dependencies like server backends and
            
            
            
              config mechanisms.
            
            
            
              So here's another case study again from AGSA.
            
            
            
              My team on call for AGSA that happened
            
            
            
              in 2017. It was one of our largest outages,
            
            
            
              but a near miss of a truly disastrous outage,
            
            
            
              as it didn't affect the latest production version.
            
            
            
              Now, this outage was triggered by a simple four character change
            
            
            
              to a legacy config that was unfortunately
            
            
            
              not canaryed. It was just tested on a local device,
            
            
            
              run through automated testing, committed to production, and then rolled
            
            
            
              out. So two unexpected issues occurred due to
            
            
            
              a build error. The change was picked up by an old app
            
            
            
              version that could not support it, or rather multiple older app
            
            
            
              versions that could not support it. And secondly, the change was
            
            
            
              rolled out to a wider population than originally intended.
            
            
            
              Now, when the change was picked up by older versions, they started failing
            
            
            
              on startups and once they failed, they would continue
            
            
            
              to fail by reading the cached config. Even after
            
            
            
              the config was fixed and rolled out, the app would keep crashing
            
            
            
              before it could etc it so this graph that you see on
            
            
            
              your screen shows daily active users of the AGSA
            
            
            
              app on just the affected version range over
            
            
            
              a two week period leading up to and then after the outage.
            
            
            
              After scrambling for days. The only recovery
            
            
            
              was manual user action where we had to request
            
            
            
              our users to upgrade via sending them a notification, which prolonged
            
            
            
              the outage and then also resulted in really bad user experience.
            
            
            
              So lots of key takeaways on this one.
            
            
            
              Anything that causes users to not upgrade is similar to
            
            
            
              accumulating tech debt. Older apps are highly
            
            
            
              susceptible to issues, so the less successful app updates
            
            
            
              are, the larger the probability that a bigger population
            
            
            
              will get impacted by a backward incompatible change.
            
            
            
              Then always validate before committing, that is, caching a new config
            
            
            
              configs can be downloaded and successfully parsed,
            
            
            
              but an app should interpret and exercise a new version before
            
            
            
              it becomes the active one. Cached config, especially when
            
            
            
              RedX startup, can make recovery difficult if it's not
            
            
            
              handled correctly. If an app caches a config,
            
            
            
              it must be capable of expiring, refreshing or even discarding
            
            
            
              the cache without needing the user to step in and don't
            
            
            
              rely on recovery outside the app. Sending notifications,
            
            
            
              which we did in this case, has limited utility and
            
            
            
              instead regularly refreshing configs and having a crash
            
            
            
              recovery mechanism is better, but at the
            
            
            
              same time, similar to backups, case recovery mechanism
            
            
            
              is valid only if it's tested.
            
            
            
              So when apps exercise crash recovery, it's still a
            
            
            
              warning sign. Crash recovery can conceal real problems.
            
            
            
              If your app would have failed if it was not for the crash
            
            
            
              recovery, you're again in an unsafe condition because the crash recovery
            
            
            
              thing is the only thing that's preventing failure.
            
            
            
              So it's a good idea to monitor your case recovery rates also
            
            
            
              and treat high rates of recoveries as problems in their own right,
            
            
            
              requiring investigation and correlation.
            
            
            
              So this last case study demonstrates how apps
            
            
            
              sometimes do things in response to inorganic phenomena,
            
            
            
              and capacity planning can go for a toss if
            
            
            
              proper throttling mechanisms are not in place.
            
            
            
              So this graph is a picture of the rate at which certain mobile apps
            
            
            
              were registering to receive messages through firebase cloud messaging.
            
            
            
              This registration is usually done in the background, that is,
            
            
            
              if users install apps for the first time when tokens
            
            
            
              need to be refreshed. This is usually a gentle curve
            
            
            
              which organically follows the world's population as they wake
            
            
            
              up. So what we noticed one day were two spikes
            
            
            
              followed by Plateau. The first plateau went back
            
            
            
              to the normal trend, but the second spike in Plateau that
            
            
            
              looked like teeth of a comb represented the app
            
            
            
              repeatedly exhausting its quota and then trying again.
            
            
            
              The root cause was identified as a rollout of a new version
            
            
            
              of Google Play services, and the new version kept making
            
            
            
              FCM registration calls. So the teeth of the
            
            
            
              comb that you see on the graph were from the app repeatedly exhausting
            
            
            
              its quota and then repeatedly being topped off.
            
            
            
              FCM was working fine for other apps, thankfully,
            
            
            
              because of the poor app throttling mechanism in place
            
            
            
              and service availability wasn't at risk. But this is an
            
            
            
              example of the consequences of scale mobile our
            
            
            
              devices have limited computing power, but there are billions
            
            
            
              of them, and when they accidentally do things together,
            
            
            
              it can go downhill really fast. So the key takeaway
            
            
            
              here was avoid upgrade proportional load
            
            
            
              in terms of release management and uptake rates for installation.
            
            
            
              We created a new principle. Mobile apps must not make
            
            
            
              service calls during upgrade time, and releases must prove,
            
            
            
              via release regression metrics, that they haven't introduced new service calls
            
            
            
              purely as a result of the upgrade. If new service calls are
            
            
            
              being deliberately introduced, they must be enabled via
            
            
            
              an independent ramp up of an experiment and not a
            
            
            
              binary review. In terms of scale,
            
            
            
              a developer writing and testing their code for
            
            
            
              them, a one time service call feels like nothing.
            
            
            
              However, two things can change that. So when the app
            
            
            
              has a very large user base, and when the service is
            
            
            
              scaled only for steady state demand of your app.
            
            
            
              If a service exists primarily to support a specific app,
            
            
            
              over time, its capacity management optimizes for a footprint that
            
            
            
              is close to the app's steady state needs.
            
            
            
              So avoiding upgrade proportional load is valuable at large as
            
            
            
              well as smaller scales. So this brings us
            
            
            
              to the last slide of the stack. To quickly summarize
            
            
            
              this talk, a modern product stack is only reliable and
            
            
            
              supportable if it's engineered for reliable
            
            
            
              operation, all the way from server backends to the app's
            
            
            
              user interface. Mobile environments are very different from
            
            
            
              server environments and also browser based clients presenting
            
            
            
              a unique set of behaviors, failure modes, and management challenges.
            
            
            
              Engineering reliability into our mobile apps is as
            
            
            
              crucial as building reliable serverside.
            
            
            
              Users perceive the reliability of our products based
            
            
            
              on the totality of the system, and their mobile device is
            
            
            
              their most tangible experience of the service.
            
            
            
              So I have five key takeaways, more than my usual three,
            
            
            
              but I'll speak through all five of them. So number one is
            
            
            
              design mobile apps to be resilient to unexpected inputs,
            
            
            
              to recover from management errors, and to roll
            
            
            
              out changes in a controlled, metric driven way.
            
            
            
              Second thing is monitor the app in production by measuring
            
            
            
              critical user interactions and other key health metrics,
            
            
            
              for example, responsiveness, data freshness, and, of course,
            
            
            
              crashes. Design your success criteria to relate
            
            
            
              directly to the expectations of your users as they move through your
            
            
            
              app's critical journeys. Then, number three,
            
            
            
              release changes carefully by a feature flag so that they
            
            
            
              can be evaluated using experiments and rollback independently
            
            
            
              of binary releases. Number four,
            
            
            
              understand and prepare for the app's impact on servers,
            
            
            
              including preventing known bad behaviors.
            
            
            
              Establish development and release practices that avoid
            
            
            
              problematic feedback patterns between apps and services.
            
            
            
              And then lastly, it's also ideal
            
            
            
              to proactively create playbooks and incident management processes
            
            
            
              in anticipation of a client side outage and the impact
            
            
            
              it might have on your servers.
            
            
            
              Thank you.