Building Observability into Your iOS App Platform: Tools, Techniques, and Best Practices

Video size:

Abstract

Dive into the essential world of Observability and discover the pivotal differences from Monitoring. From development analytics to the art of log management, learn how to collect, store, and analyze data like a pro.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, my name is Roger Misko and welcome to building observability in iOS applications. Today I will briefly tell you about observability in mobile applications and then we'll tell you what you can do in your iOS applications to make implementation of observability easier. I am a US team lead with more than 80 years of experience in development, especially iOS applications. I worked for different projects from local to worldwide user bases. So I would say that observability is my good friend now that helps me to improve quality of projects I'm working on. How do we ensure that our application is stable and functioning properly? We test it. More often, this is done by the QA team using various test cases, test environments, and specific devices. But such checks are inherently synthetic, and don't always reflect the real world scenario. In reality, things are much more complex, and observability can help organize real time testing. Observability is a term that originated from the fields of engineering and system analysis. In the context of software development, especially mobile applications, observability can be described as the ability of a system to provide information about its general or its internal state and operation. Simply put, it's a property of a system like to stability or security. We need to design the system in such a way that it benefits from observability, meaning the system itself reports its state and provides details in case of incidents. It's important to understand that traces, metrics, and logs by themselves are not observability. These are just examples. tools that can help achieve this property. True observability is built upon aggregating all collected information in one place where it can be analyzed, visualized through graphs and dashboards, and where alerts can be set up to trigger when the number of incidents increases. Good observability allows a team not only to quickly respond to emerging issues. but also to anticipate them. This means the system itself hints at what is wrong and helps identify the cause immediately. Imagine not having to figure out where exactly an error occurred and what caused it. The system directly informs you about that. Observability is a key tool for creating reliable and resilient mobile applications. It minimizes the time needed to detect and fix issues. thereby improving the overall experience. You can learn about a problem from users when they start complaining, but observability helps solve two crucial aspects. First, you learn about the problem before it becomes critical. And the second, you have complete information and can immediately start resolving the issue without needing to gather data from various sources. To summarize, with good observability in your project, you don't just find out that Oh, a problem occurred, but you immediately get all necessary information to solve it. The system fully reports its state and causes of any issues, allowing you to take an action promptly. However, it is important to remember that observability does not replace testing synthetic checks. I made a mistake. However, it is important to remember that observability does not replace classic testing. Synthetic checks are still necessary. What observability can look like in a mobile application? The topic of observability is widely known among backend developers and DevOps engineers. However, in mobile development, this topic is often overlooked. Even though many teams and projects have already made steps towards observability, perhaps we without even realizing it. Let's imagine a hypothetical project, an iOS application. The team that developed it didn't integrate analytics, login, crashlytics, or other monitoring tools. So it's just a project with a set of features. After releasing the new version, users began to report that the app isn't working, or it crashes by itself, or when I open the second screen, the app crashes. Indicates that something went wrong, but don't provide clarity on what exactly and why. The team tried to reproduce the problem but found nothing unusual. The app seemed to work fine in their environment, and however user continued to complain, suggesting a possible crash, but since the team couldn't reproduce it. They were left to work blindly, release a new version, and hope this helps. As example, this situation could be Could have been easily solved if the app had been set up with Crashlytics to catch crashes and Send information about them. In the world of iOS development Apple already provides such functionality out of the box, but with certain limitations For greater flexibility, developers often turn to third party services like Firebase Crashlytics. Today, collecting crash information has become a standard practice that allows teams to learn about issues before users start complaining in large numbers, while also providing data on the causes of these issues, such as stack traces and technical information about the device. By working with Crashlytics, teams are already taking that step toward observability. However, crashes are far from the only problems that can lead to user complaints, or worse, cause a drop in key metrics and revenue. The more complex a system becomes, the more dependent parameters it includes, the more diverse the devices and OS versions it supports, The less transparent it becomes in understanding how the app behaves in real world conditions. For example, if a project uses feature flags to manage the availability of features and the configuration system doesn't validate changes, a product manager or business analyst could accidentally make an error, such as miscast the data format or skip an important field. This could result in users not receiving certain functionalities. even though the server and Crashlytics don't register any issues. Moreover, given the complexity of modern systems, it's impossible to test and simulate every possible usage scenario. For example, you might release several new features simultaneously. Everything works fine in the test environment, so you release the update. However, on real devices, under certain conditions, such as a specific phone model, iOS version, or internal app settings, many users might experience issues. For instance, App may take a long time to load, or critical flows like registration may start to fail. And you will see nothing on Crashlytics, because there are no crashes. So just like the crashes, you need to build metrics for other aspects of the app's performance and stability. To determine where to start building metrics, I suggest focusing on three key aspects. So first is important user stories. Examples like registration, login, payment processing, sending messages, generating resources, search results, and etc. It's important to consider not only whether these flows are functioning, but also how well they're performing. For example, how quickly payments are processed or messages are sent. Then, Paths, flows, or features that depend heavily on frequently changing system parameters. It's crucial to account for dynamic changes that can affect app behavior. And flows, features, stories dependent on external systems. This could be a third party APIs or services as well as internal systems developed by another team within your company. It's not guaranteed that these themes consider potential issues that might affect your specific app, as opposed to the entire system. To achieve deeper data integration, additional systems should be used to build charts based on the data and provide breakdowns by key parameters. This part of the talk provides a general idea of what observability can look like in a mobile application. In the next sections, we'll explore architectural solutions on the iOS side that can help you implement observability in your app. I recommend exploring articles, talks, and other materials, not limiting yourself to mobile development. The experiences and approaches of DevOps and backend developers can be incredibly valuable. Achieving good observability means not only knowing that a problem has occurred in your system. But also providing details that explain what exactly went wrong. One way to gather such information is by using logs. While metrics provide a general overview of the system state, logs collect detailed information about what happened on the user device, what actions they took, and what might have caused the problem. It's crucial to remember that Any tool for achieving observability adds overhead to your application. And logs are likely one of the heaviest tools. So let's see what strategies for collecting and delivering logs we can have. For the log delivery, we can have live or online mode. Logs are sent immediately. right after they are recorded in the application in real time. This is often used in the backend development, but may not be the best option for mobile applications. as you'll receive logs even when everything is working fine for the user. This creates a large amount of unnecessary data which can be difficult to process and store. However, it's possible to configure the system so that logs are only sent when a specific flag is activated in the app or when a trigger event occurs. It's essential to test this logic to avoid errors that could lead to either losing all logs or being overwhelmed with unnecessary data. Next, we can have on demand delivering. And there are two sub approaches, automatic, through a login flag, which the user receives upon entering the app. indicating that logs should be sent or via push notifications that prompts the app to send logs and also manual. The user can select an option in the menu to send logs upon your request and Triggered. Logs are automatically sent when a specific event occurs in the application, such as a crash or a critical error. And the log collection. We can have continuous collection. There logs are recorded continuously and it's important to consider. So that you should avoid recording too many logs at once, to prevent overloading the disk with read write operations. But if you still want to write a lot of logs per second to a file, for example, for very detailed debug or trace, you should implement buffering writing to file. So instead of opening, writing, closing files for every log line, it's better to write accumulated buffer of lines. In this case, performance will significantly speed up. Also, you should implement log file rotation to prevent them from becoming too large. For example, you can set a file size limit of 20 megabytes, after which old data is either cleared or overwritten. And you should carefully manage login levels. Debug logs, which are useful for testing and beta versions, should not be included in continuous logging for release versions. Also on demand, collecting. Log collection can be activated manually by the user through settings or automatically via logging flag or push notification. And trigger it. Just like with the delivery, a trigger event can start the log recording process. The last two approaches are quite convenient, especially when dealing with specific issues encountered by individual users. However, these methods have drawbacks. For example, you might miss the initial problem because the logs will only start recording after you have identified the issue and ask the user to enable log collection. In such cases, the root cause of the problem may be lost. But, you can implement a combination of logging approaches. For example, logging at error, warning, and critical levels can happen continuously, while more detailed logs, like info and notice, can be enabled when a trigger occurs. upon user request or via a login flag. This approach helps maintain a balance between having detailed information and optimizing the cost of storing and processing logs. But how you can write logs in iOS devices? First, it's Apple's OS lock. Apple offers OS lock, a powerful tool that includes many optimizations for fast performance with deep integration into the operating system and support for user privacy tools. However, for our purposes, it may be inconvenient. These locks can be difficult to access. There are three ways for iOS. First, connect the phone to a computer and view the console. Second, collect a sys diagnose of the entire device, which takes up to 15 minutes and results in more than 200 megabytes archive. And the last. problematically from the app. But there is a significant drawback. You can only get logs from the current process. This means you can't retrieve logs from extensions or from previous launches of the application. Next, we can write logs to a text file. You can start by using an existing framework, such as Swifty Beaver, which writes logs to a file. Later, when you become more familiar with logging and understand your needs, you can modify it, since it's open source. Or implement your own solution, as it's quite simple, it's just writing to a file. And also you can write logs to Firebase Crash Lytics by default. Firebase Crash Lytics is used to catch crashes and send stack traces and device information, but you can also log additional data, such as information about subscriptions or user rules. This allows you to more accurately identify the context in which the error occurred and potentially find solution more quickly. You can also write logs to Firebase Crashlytics, which will be sent along with the crash report. You can also send a non fatal event that includes all this information. However, since Firebase is a free service, it has limitations, such as data sampling, not all events are saved, and a limited number of logs lines before an error occurs. Additionally, you cannot run SQL queries to retrieve the specific data you need. Therefore, it's not advisable to rely on free services for too long. You should consider more specialized solutions or build your own. However, Firebase Crashlytics can be sufficient for a start. If you choose the second method, writing logs to a text file, you should consider the following tips. Define a logging policy and use logging levels appropriately. It's important to consider how these levels will be used and to monitor the amount of logging during code review. This helps prevent situations where too many logs are recorded at once. Which can negatively impact the system and the disk. This task is as crucial as controlling the number of network requests or database queries as the application functionality grows. To prevent locks from being included in release builds and causing unnecessary overhead, you can use preprocessor directives like ifdef. Instead of adding ifdef to every login statement, you can optimize the login function itself. If the ifdef directive is enabled, the login code is executed. If not, the function remains empty. For example, this can be done for the debug level. This allows Swift compiler to exclude unnecessary calls in the release build, ensuring high performance. In your login functions, specify the argument that passes the log text as auto closure. This allows you to delay the evaluation of the argument until it's actually needed. If debugging is not enabled, the evaluation of debug descriptions will not occur. reducing the system load. While logs indeed provide a lot of useful information, they can be cumbersome due to their big volume. Therefore, it's important to consider virus optimization. One such approach is to transmit a detailed state of the application instead of full logs. For example, you could send information about which screens the user navigated through. Which API requests were made? What settings were configured and which key business logic flags were set? Let's consider an example of notification service extension in iOS. This extension is triggered when a push notification with a specific flag is received. The system then allows us to modify the content of this notification, or even generate multiple notifications instead of one. For our business logic, we make an HTTP request to our API, query a database, save changes, and much more. However, there is a catch. iOS strictly enforces a 24MB RAM limit. limit for your extension. If you exceeded this limit, the system will immediately terminate your process without any notification. This means the user won't even know that something went wrong and no events will be locked in your metrics or crash reporting tools. Knowing that the notification service extension is not fully executing is not enough. It's crucial to know at which stage it fails. So we need to track every step of our process, send some kind of result and ensure that we don't introduce so much overhead that we ourselves become the cause of exceeding the RAM limit. A simple solution is to use an option set, where each bit represents a specific step in the notification service extension process, and one of these bits indicates that the extension has reached its logical conclusion. If we complete a step, we set a corresponding bit to 1. Each time a change occurs, we should save the option set value to disk. Then on next launch, we can check the value of the last option set to determine whether all necessary steps were completed and if the process reached It's end. If we detect any deviation, we can send a metric. This way you can see if your extension frequently encounters error or is being terminated by the system due to exceeding the RAM limit. Those, instead of large volume of logs, you can focus on capturing specific key events and states. which will not only save resources, but also simplify data analysis. With a well structured architecture in your application, implementing various metrics becomes much easier. Honestly, there is nothing fundamentally new here. It's important to follow established patterns, with an emphasis on adhering to solid principles. Especially the single responsibility principle. This will help you better monitor the areas in your code that require attention. For example, let's consider how the coordinator pattern can help us. Using the coordinating pattern for navigation allows you to clearly track the user path through the scenes and their current state. Each coordinator. calls another, and those you can easily compile all the information about the user's journey into the optimized string, such as registration, intercode, main screen, more screen settings, or even minimize it to something like this, where each designation represents a specific screen or transition. This data can be easily decoded for analysis. You can also use the coordinator pattern to determine the last screen the user was on before closing the app. You can send such a metric based on the hypothesis that if many users close the app while on a specific screen, there is likely something wrong with it, or users are unable to exit the screen, prompting them to restart the app. To achieve this, you simply need to identify the active coordinator in the application build terminate method of the AppDelegate, ask it for the current screen, log this information, and then send it to Analytics on the next app launch. With the coordinator pattern, this can be done quite easily while keeping your code clean. We can also build an app start time metric based on how quickly the user gains access to the app after launching it. Again, using the coordinator, you can measure the time difference from your app startup to when the app reaches to a specific state according to your business logic. Similarly, you can monitor how quickly users complete important flows in your application, such as payments. If users suddenly start taking significantly longer to complete a payment flow, you will notice it right away. Additionally, you can work with the API layer by recording which requests were made during AppStartup or during its operation. This data can be useful for understanding the sequence of user actions and identifying potential issues, as well as spotting bottlenecks during AppStartup. This will be easy to implement if you have clearly defined your API and network entities. Another interesting and useful metric I recommend is Annoyed Tab. This metric tracks situations where the user repeatedly taps on the same UI element, which may indicate frustration due to something not working properly. Such a metric will be easy to implement if you have a clearly defined entity that handles the business logic of the screen. Such as Interactor controller, view model, and at sra. That's it for today. Thank you for your time and interest in the topic. I hope you get a better understanding of how to implement observability in your IUS applications. Best of luck with your huge developments and thank you once again for being here today. Bye-Bye.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Building Observability into Your iOS App Platform: Tools, Techniques, and Best Practices

Video size:

Abstract

Summary

Transcript

Slides

Siarhei Misko

iOS Team Lead @ Rakuten Viber

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Building Observability into Your iOS App Platform: Tools, Techniques, and Best Practices

Video size:

Abstract

Summary

Transcript

Slides

Siarhei Misko

iOS Team Lead @ Rakuten Viber

Join the community!