Conf42 Site Reliability Engineering 2021 - Online

Engineering Reliable Mobile Applications

Video size:

Abstract

Why Mobile and SRE (Site Reliability Engineering)?

  • Mobile is nonuniform and uncontrollable
  • SRE responsibilities differ in critical ways from infrastructure or server-side application reliability engineering
  • Focus on where your end users are (i.e. today many users access services through mobile applications)
  • Specific challenges of mobile * Monitoring * Release management * Incident management

Case studies

  • Doodle outage, etc.

Future of SRE for Mobile

  • Visibility into mobile application performance
  • Find and react to issues before the user does
  • Measure SLIs throughout the product stack (from client to service)

Summary

  • You can enable your DevOps for reliability with chaos native. Create a free account at Chaos native.
  • Pranjal is an SRE program manager at Google. This talk is going to be about an article he co authored called Engineering Reliable mobile applications. Besides work, Pranjal also loves traveling, especially solo.
  • I'll cover the basics of traditional SRE and compare and contrast it with what it means to be SRE for mobile. Then I'll go into high level strategies for developing resilient native mobile apps. In the end I'll summarize some key takeaways to evaluate the reliability of your current and future mobile applications.
  • A mobile app is a distributed system that has billions of machines. Things we take for granted in the server world are difficult to accomplish for mobile, if not impossible for native mobile apps. Managing changes in a safe and reliable manner is one of the most critical pieces of managing a reliable mobile app.
  • availability is one of the most important measures of reliability. To understand how available these apps are, we need on device client side telemetry. Real time metrics also help us sre the results of our fixes and get feedback on product changes.
  • There are other precious shared resources like cpu, memory, storage, network. No application wants to misuse any of these resources and get negative reviews. Efficiency is particularly important on lower end devices and in markets where Internet or data usage is high cost. The platform needs to place restrictions to ensure common standards.
  • Change management is extremely important in mobile ecosystem. A new single faulty version can erode users trust and they may decide to never install or upgrade your app again. The first strategy here is a stage rollout. Second way to do it is experimentation.
  • The power to update apps still lies with the user. This means that as new versions keep releasing, older versions will continue to coexist. Making client changes to an app can have server side consequences as well.
  • In this section, we address topics covered in previous sections through several concrete examples of client side issues. The slides also note practical takeaways for each case study, which you can then apply to your own apps.
  • Mobile apps must not make service calls during upgrade time, and releases must prove that they haven't introduced new service calls purely as a result of the upgrade. Avoid upgrade proportional load in terms of release management and uptake rates for installation.
  • A modern product stack is only reliable and supportable if it's engineered for reliable operation, all the way from server backends to the app's user interface. Engineering reliability into our mobile apps is as crucial as building reliable serverside.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hi everyone, welcome to this talk on engineering reliable mobile applications. My name is Pranjal. I'm speaking to you from San Francisco, California, and I'm an SRE program manager at Google. So I've been in Google for about five years and I've worked on Android SRE client infrastructure, SRE, Firebase SRE, and also Gmail, spam and abuse SRE. In my previous years I've co authored two external publications, blameless post mortem culture. I've given many talks on the topic and this talk is going to be about an article I co authored called Engineering Reliable mobile applications. Before joining Google, I was sort of a jack of all trades sort of a role at a company called Bright Idea, where I was fortunate to juggle a lot of different roles and that got me a lot of experience into different arenas. Besides work, I also love traveling, especially solo, and I also love exploring cuisines and different sort of alcoholic beverages. So here's our agenda for today. I'll briefly cover the basics of traditional SRE and compare and contrast it with what it means to be SRE for mobile. I'll go deeper and talk about several distinct challenges such as scale, monitoring, control and challenges management. Then I'll go into high level strategies for developing resilient native mobile apps. This will be followed by three case studies, actual outages that happened with our apps, and then what we learned. In the end, I'll summarize some key takeaways, some things to keep in mind to evaluate the reliability of your current and future mobile applications. Now, I assume most of you already know this, but to set some context, if you are new to the industry, if you don't know what SRE is, SRE think of it as a specialized job function that focuses on reliability and maintainability of large systems. It's also a mindset. The approach is that of constructive pessimism. So where you hope for the best, but you plan for the worst, because hope is not a strategy in terms of people. While SRE is a distinct job role, they partner closely and collaboratively with our partner product dev teams throughout the product lifecycle. So now when I talk about availability, it's commonly assumed that we're talking about server side availability. Most slis have also traditionally been bound by server side availability. So, for example, two to three, nine of uptime but the problem with relying on server side availability as an indicator of service health is that the users perceive the reliability of a service through the devices in their hands. We have seen a number of production incidents in which server side instrumentation taken by itself would have shown no trouble. But a view inclusive of client side reflected user problems. So, for example, your services stack is successfully returning what it thinks are perfectly valid responses. But users of your app see blank screens, or you're in a new city and you open the maps app and the app just keeps crashing. Or maybe your app receives an update. And although nothing is visibly changed in the app itself, you experience significantly worse battery life than before. So these are all issues that cannot be detected by just monitoring the servers and data centers. Now we can compare a mobile app to a distributed system that has billions of machines. So this scale is just one of the unique challenges of mobile. Things we take for granted in the server world today have before very complicated to accomplish for mobile, if not impossible for native mobile apps. And so here sre just some of the few challenges in terms of scale. So think of this. There are billions of devices with thousands of device models, hundreds of apps running on them, and each app has its own multiple versions. It becomes more difficult to accurately attribute, degrading ux to unreliable network connections, service reliability or external factors. All of this combined can make for a very complex ecosystem. Now, challenge number two is monitoring. We need to tolerate potential inconsistency in the mobile world because we're relying on a piece of hardware that's beyond our control. There's very little we can do when an app is in a state in which it cannot send information back to you. So in this diverse and complex ecosystem, the task of monitoring every single metric has many possible dimensions with many possible values. So it's infeasible to monitor every combination independently. And then we must also consider the effects of logging and monitoring on the end user, given that they pay the price of resource usage. So let's say battery and network, for example. Now on serverside we can change binaries and update configs on demand. In the mobile world, this power lies with the users. In the case of native apps, after an update is available to users, we cannot force them to download a binary or config. Users may sometimes consider upgrades as indication of poor software quality and assume that all upgrades are just bug fixes. It's also important to remember that upgrades can have a very tangible cost. So, for example, network metered network usage to the end user, ondevice storage might be constrained and data connection may be either sparse or completely nonexistent. Now, if there's a bad change, one's immediate response is to roll it back. We can quickly roll back serverside and we know that we will no longer be on the bad version after the rollback is complete. On the other hand, it's quite difficult, if not impossible, to roll back a binary for a native mobile app for Android and iOS. Instead, the current standard is to roll forward and hope that affected users will upgrade to the newest version. Now, considering the lack scale and lack of control in the mobile environment, managing changes in a safe and reliable manner is one of the most critical pieces of managing a reliable mobile app. So in this next section, I'll talk about some concepts that are not foreign to us as sres, but I'll put them in the context of mobile and then discuss some strategies to tackle them. Now availability is one of the most important measures of reliability, so think of a time when this happened to you. You tapped an app icon, the app was about to load and then it immediately vanished. Or a message displayed saying your application has stopped. Or you tapped a button, waited for something to load, and eventually abandoned it by clicking the back button. Now, these are all examples of your app being effectively unavailable to you. You the user interacted with the app and it did not perform the way you expected. So one way to think about mobile app reliability is its ability to be available servicing interactions consistently well relative to your expectations or the user's expectations. Now users are constantly interacting with their apps and to understand how available these apps are, we need on device client side telemetry to measure and gain visibility, privacy and security SRE topmost concerns, so we need to keep those in mind while developing a monitoring strategy. Now, one example of gathering availability data is gathering crash reports, so when an app is crashing, it's a clear sign of unavailability. A user's experience might be interrupted by a crash dialogue. The application may close unexpectedly and you may be prompted to notify via a bug. Crashes can occur for a number of reasons, but what's important is that we log them and we monitor them. Now, who doesn't love real time metrics? The faster one is able to alert and recognize a problem, the faster one can begin investigating. Real time metrics also help us sre the results of our fixes and get feedback on product changes. Most incidents have a minimum resolution time that is driven more by humans organizing themselves around problems and then determining how to fix them. For mobile, this resolution is also affected by how fast we can push the fix to services. So most mobile experimentation and config at Google is polling oriented with devices updating themselves during opportune times of battery and bandwidth conservation. This means that after submitting a fix, it still may take several hours before client side metrics can be expected to normalize. Now telemetry for widely deployed apps is constantly arriving, so despite this latency, some devices may have picked up the fix and metrics from them will arrive shortly. One approach to design metrics derived from services telemetry is to include config state as a dimension. You can then constrain your view of the error metrics to only the devices that are using your fix. This becomes easier if client changes are being rolled out through launch experiments, and you can look at the experiment population and experiment id to monitor the effect of your changes. When developing an app, one does have the luxury of opening a bug console and then looking at fine grained lock statements to inspect app execution and state. However, when it's deployed to an end user's device, we have visibility into only what we've chosen to measure and then transport back to us. So measuring counts of attempts, errors, states or timers in the code, particularly around key entry or exit points or business logic, provides indications of the app's usage and correct functioning. Another idea could be to run UI test probes to monitor critical user journeys. This may entail launching the current version of a binary on a simulated device or emulated device, inputting actions as a user would, and then asserting that certain properties hold true throughout the test. Now, what enables your phone to be mobile in the first place? So it's the battery, the resource, which is one of the most valuable resource in a mobile device. Then there are other precious shared resources like cpu, memory, storage, network, and no application wants to misuse any of these resources and get negative reviews. Efficiency is particularly important on lower end devices and in markets where Internet or data usage is high cost. The platform as a whole suffers when shared resources are misused, and increasingly the platform needs to place restrictions to ensure common standards. So one such example is a program called Android Vitals, where within the company that helps attribute problems in the app itself. So our apps have sometimes been blocked from releasing for battery regressions as low as 0.1% because of negative effect on user happiness. Any expected regressions need to be released only after careful considerations of the trade offs, let's say, or the benefits a feature would provide. Now, change management is really important, so I'm going to spend a little bit longer on this Slis because there sre just lots to talk about. So many of these practices while releasing client apps are based on general SRE best practices. Change management is extremely important in mobile ecosystem because a new single faulty version can erode users trust and they may decide to never install or upgrade your app again. Rollbacks in the mobile world still sort of near impossible and problems found in product are irrecoverable, which makes them or which makes proper change management very important. So the first strategy here is a stage rollout. This simply means gradually increasing the number of users the app is being released to. This also allows you to gather feedback on production changes gradually instead of hurting all of your users at once. Now one thing to keep in mind here is internal testing, and testing on developer devices does not provide a representative sample of the device diversity that's out there. If you have the option to release to a selected few beta users, it will help widen the pool of devices and users also. Second way to do it is experimentation. Now one issue with stage rollouts is the bias towards newest and most upgraded devices. So metrics such as latency can look great on newer devices, but then the trend becomes worse over time and then you're just wondering what happened. Folks with better networks may also upgrade earlier, which is another example of this bias. Now these factors make it difficult to accurately measure the effects of a new release. It's not a simple before and after comparison and may require some manual intervention and interpretation. A better way to release metrics or release changes, sorry, is to release via experiments and conduct ab analysis as shown in the figure. Whenever possible, randomize the control and treatment population to ensure the same group of people are not repeatedly upgrading the wraps. Another way to do this is feature flag. Feature flags is another option to manage production changes. General best practice is to release a new binary with new code and then turn feature flag off by default. This gives more control over determining release radius and timelines. Rolling back can be as simple as ramping down the feature flag instead of building an entire binary and then releasing it again. Now this is with the assumption that rolling back your flag is not breaking your app. One quick word of caution in terms of accuracy of metrics is to take into account the effect of upgrade side effects and noise between the control and experiment population. The simple act of upgrading an app itself can introduce noise as the app restarts. One way to address this is to release a placebo binary to a control group, doesn't do anything but will also upgrade or restart, thereby giving you a more apples to apples comparison of metrics this is, however, not ideal and has to be done with a lot of caution to avoid negative effects such as battery consumption or network usage. Now, even after doing all of the above and putting all of this governance in place, the power to update their apps still lies with the user. This means that as new versions keep releasing, older versions will continue to coexist. Now, there can be many reasons that people don't upgrade their apps. Some users disable auto updates, sometimes there's just not enough storage, et cetera. On the other hand, privacy and security fixes are strong motivators for upgrades. Sometimes users are shown a dialog box and forced to update. Now, as client apps grow and more and more features and functionalities are added, some things may not be doable in older versions. So, for example, if you add a new critical metric to your application in version 20, that metric will only be available in version 20 and up. It may not be sustainable for SRE to support all possible versions of a client, especially when only a small fraction of the user population is on the older versions. So possible tradeoff is developer teams can continue to support older versions and make business or product decisions around supporting timeline in general. Now, we've focused a lot on the client side so far. This slide is a quick reminder that making client changes to an app can have server side consequences as well. Now your servers have, let's say, are already ready and are scaled to accommodate daily requests. But problem will arise when unintended consequences occur. Some examples could be time synchronization requests or a large global event causing a number of requests to your serverside. Sometimes bugs in new releases can suddenly increase the number of requests and overload your servers. We'll see an example of one such outage in the coming section, which brings me to case studies. Now in this section, we address topics covered in previous sections through several concrete examples of client side issues that have happened in real life and that our teams have encountered. The slides also note practical takeaways for each case study, which you can then apply to your own apps. This first case study is from 2016, when AGSA, which means the Android Google Search app, tried to etc the config of a new doodle from the server, but the doodle did not have its color field set. The code expected the color field to be set and this in turn resulted in the app crashing. Now the font is pretty small, but the graph shows crash rate along a time services. The reason the graph looks like that is because Doodle launches are happening at midnight for every time zone, resulting in crash rate spikes at regular intervals. So the higher the traffic in the region, the higher was the spike. This graph actually made it easy for the team to correlate issues to a live Google versus an infrastructure change. Now, in terms of fixes, it took a series of steps to implement an all round concrete solution. So in order to fix the outage first, the config was fixed and released immediately, but the error did not go down right away. This was because to avoid causing avoid calling server backend, the client code cached the doodle config for a set period of time before calling the server again for a new config. For this particular incident, this meant that the bad config was still on services until the cache expired. Then a second client side fix was also submitted to prevent the app from crashing in this situation. But unfortunately, a similar outage happened again after a few months, and this time only older versions were without. Older versions were affected, which did not have the fix I just described. After this outage reoccurrence, a third solution in the form of server side guardrails were also put in place to prevent bad config from being released in the first place. So quick key takeaways here were client side fixes don't fix everything. In particular, there are always older versions out there in the wild that never get updated. So whenever possible, it's also important to add a services side fix to increase coverage and prevent a repeat incident. And also good to know the app's dependencies. It can help to understand what changes external to the app may be causing failures. It's also good to have clear documentation on your client's dependencies like server backends and config mechanisms. So here's another case study again from AGSA. My team on call for AGSA that happened in 2017. It was one of our largest outages, but a near miss of a truly disastrous outage, as it didn't affect the latest production version. Now, this outage was triggered by a simple four character change to a legacy config that was unfortunately not canaryed. It was just tested on a local device, run through automated testing, committed to production, and then rolled out. So two unexpected issues occurred due to a build error. The change was picked up by an old app version that could not support it, or rather multiple older app versions that could not support it. And secondly, the change was rolled out to a wider population than originally intended. Now, when the change was picked up by older versions, they started failing on startups and once they failed, they would continue to fail by reading the cached config. Even after the config was fixed and rolled out, the app would keep crashing before it could etc it so this graph that you see on your screen shows daily active users of the AGSA app on just the affected version range over a two week period leading up to and then after the outage. After scrambling for days. The only recovery was manual user action where we had to request our users to upgrade via sending them a notification, which prolonged the outage and then also resulted in really bad user experience. So lots of key takeaways on this one. Anything that causes users to not upgrade is similar to accumulating tech debt. Older apps are highly susceptible to issues, so the less successful app updates are, the larger the probability that a bigger population will get impacted by a backward incompatible change. Then always validate before committing, that is, caching a new config configs can be downloaded and successfully parsed, but an app should interpret and exercise a new version before it becomes the active one. Cached config, especially when RedX startup, can make recovery difficult if it's not handled correctly. If an app caches a config, it must be capable of expiring, refreshing or even discarding the cache without needing the user to step in and don't rely on recovery outside the app. Sending notifications, which we did in this case, has limited utility and instead regularly refreshing configs and having a crash recovery mechanism is better, but at the same time, similar to backups, case recovery mechanism is valid only if it's tested. So when apps exercise crash recovery, it's still a warning sign. Crash recovery can conceal real problems. If your app would have failed if it was not for the crash recovery, you're again in an unsafe condition because the crash recovery thing is the only thing that's preventing failure. So it's a good idea to monitor your case recovery rates also and treat high rates of recoveries as problems in their own right, requiring investigation and correlation. So this last case study demonstrates how apps sometimes do things in response to inorganic phenomena, and capacity planning can go for a toss if proper throttling mechanisms are not in place. So this graph is a picture of the rate at which certain mobile apps were registering to receive messages through firebase cloud messaging. This registration is usually done in the background, that is, if users install apps for the first time when tokens need to be refreshed. This is usually a gentle curve which organically follows the world's population as they wake up. So what we noticed one day were two spikes followed by Plateau. The first plateau went back to the normal trend, but the second spike in Plateau that looked like teeth of a comb represented the app repeatedly exhausting its quota and then trying again. The root cause was identified as a rollout of a new version of Google Play services, and the new version kept making FCM registration calls. So the teeth of the comb that you see on the graph were from the app repeatedly exhausting its quota and then repeatedly being topped off. FCM was working fine for other apps, thankfully, because of the poor app throttling mechanism in place and service availability wasn't at risk. But this is an example of the consequences of scale mobile our devices have limited computing power, but there are billions of them, and when they accidentally do things together, it can go downhill really fast. So the key takeaway here was avoid upgrade proportional load in terms of release management and uptake rates for installation. We created a new principle. Mobile apps must not make service calls during upgrade time, and releases must prove, via release regression metrics, that they haven't introduced new service calls purely as a result of the upgrade. If new service calls are being deliberately introduced, they must be enabled via an independent ramp up of an experiment and not a binary review. In terms of scale, a developer writing and testing their code for them, a one time service call feels like nothing. However, two things can change that. So when the app has a very large user base, and when the service is scaled only for steady state demand of your app. If a service exists primarily to support a specific app, over time, its capacity management optimizes for a footprint that is close to the app's steady state needs. So avoiding upgrade proportional load is valuable at large as well as smaller scales. So this brings us to the last slide of the stack. To quickly summarize this talk, a modern product stack is only reliable and supportable if it's engineered for reliable operation, all the way from server backends to the app's user interface. Mobile environments are very different from server environments and also browser based clients presenting a unique set of behaviors, failure modes, and management challenges. Engineering reliability into our mobile apps is as crucial as building reliable serverside. Users perceive the reliability of our products based on the totality of the system, and their mobile device is their most tangible experience of the service. So I have five key takeaways, more than my usual three, but I'll speak through all five of them. So number one is design mobile apps to be resilient to unexpected inputs, to recover from management errors, and to roll out changes in a controlled, metric driven way. Second thing is monitor the app in production by measuring critical user interactions and other key health metrics, for example, responsiveness, data freshness, and, of course, crashes. Design your success criteria to relate directly to the expectations of your users as they move through your app's critical journeys. Then, number three, release changes carefully by a feature flag so that they can be evaluated using experiments and rollback independently of binary releases. Number four, understand and prepare for the app's impact on servers, including preventing known bad behaviors. Establish development and release practices that avoid problematic feedback patterns between apps and services. And then lastly, it's also ideal to proactively create playbooks and incident management processes in anticipation of a client side outage and the impact it might have on your servers. Thank you.
...

Pranjal Deo

Engineering Program Manager @ Google

Pranjal Deo's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)