Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Roger Misko and welcome to building
observability in iOS applications.
Today I will briefly tell you about observability in mobile applications
and then we'll tell you what you can do in your iOS applications to make
implementation of observability easier.
I am a US team lead with more than 80 years of experience in development,
especially iOS applications.
I worked for different projects from local to worldwide user bases.
So I would say that observability is my good friend now that helps me to improve
quality of projects I'm working on.
How do we ensure that our application is stable and functioning properly?
We test it.
More often, this is done by the QA team using various test cases, test
environments, and specific devices.
But such checks are inherently synthetic, and don't always
reflect the real world scenario.
In reality, things are much more complex, and observability can
help organize real time testing.
Observability is a term that originated from the fields of
engineering and system analysis.
In the context of software development, especially mobile applications,
observability can be described as the ability of a system to provide
information about its general or its internal state and operation.
Simply put, it's a property of a system like to stability or security.
We need to design the system in such a way that it benefits from
observability, meaning the system itself reports its state and provides
details in case of incidents.
It's important to understand that traces, metrics, and logs by
themselves are not observability.
These are just examples.
tools that can help achieve this property.
True observability is built upon aggregating all collected information
in one place where it can be analyzed, visualized through graphs and dashboards,
and where alerts can be set up to trigger when the number of incidents increases.
Good observability allows a team not only to quickly respond to emerging issues.
but also to anticipate them.
This means the system itself hints at what is wrong and helps
identify the cause immediately.
Imagine not having to figure out where exactly an error
occurred and what caused it.
The system directly informs you about that.
Observability is a key tool for creating reliable and
resilient mobile applications.
It minimizes the time needed to detect and fix issues.
thereby improving the overall experience.
You can learn about a problem from users when they start
complaining, but observability helps solve two crucial aspects.
First, you learn about the problem before it becomes critical.
And the second, you have complete information and can immediately start
resolving the issue without needing to gather data from various sources.
To summarize, with good observability in your project, you don't just find
out that Oh, a problem occurred, but you immediately get all
necessary information to solve it.
The system fully reports its state and causes of any issues, allowing
you to take an action promptly.
However, it is important to remember that observability does not
replace testing synthetic checks.
I made a mistake.
However, it is important to remember that observability does
not replace classic testing.
Synthetic checks are still necessary.
What observability can look like in a mobile application?
The topic of observability is widely known among backend
developers and DevOps engineers.
However, in mobile development, this topic is often overlooked.
Even though many teams and projects have already made steps towards observability,
perhaps we without even realizing it.
Let's imagine a hypothetical project, an iOS application.
The team that developed it didn't integrate analytics, login,
crashlytics, or other monitoring tools.
So it's just a project with a set of features.
After releasing the new version, users began to report that the app isn't
working, or it crashes by itself, or when I open the second screen, the app crashes.
Indicates that something went wrong, but don't provide
clarity on what exactly and why.
The team tried to reproduce the problem but found nothing unusual.
The app seemed to work fine in their environment, and however user continued
to complain, suggesting a possible crash, but since the team couldn't reproduce it.
They were left to work blindly, release a new version, and hope this helps.
As example, this situation could be Could have been easily solved if the app had
been set up with Crashlytics to catch crashes and Send information about them.
In the world of iOS development Apple already provides such functionality
out of the box, but with certain limitations For greater flexibility,
developers often turn to third party services like Firebase Crashlytics.
Today, collecting crash information has become a standard practice that allows
teams to learn about issues before users start complaining in large numbers, while
also providing data on the causes of these issues, such as stack traces and
technical information about the device.
By working with Crashlytics, teams are already taking that
step toward observability.
However, crashes are far from the only problems that can lead to
user complaints, or worse, cause a drop in key metrics and revenue.
The more complex a system becomes, the more dependent parameters it includes,
the more diverse the devices and OS versions it supports, The less transparent
it becomes in understanding how the app behaves in real world conditions.
For example, if a project uses feature flags to manage the availability
of features and the configuration system doesn't validate changes,
a product manager or business analyst could accidentally make
an error, such as miscast the data format or skip an important field.
This could result in users not receiving certain functionalities.
even though the server and Crashlytics don't register any issues.
Moreover, given the complexity of modern systems, it's impossible to test and
simulate every possible usage scenario.
For example, you might release several new features simultaneously.
Everything works fine in the test environment, so you release the update.
However, on real devices, under certain conditions, such as a
specific phone model, iOS version, or internal app settings, many
users might experience issues.
For instance, App may take a long time to load, or critical flows
like registration may start to fail.
And you will see nothing on Crashlytics, because there are no crashes.
So just like the crashes, you need to build metrics for other aspects of
the app's performance and stability.
To determine where to start building metrics, I suggest
focusing on three key aspects.
So first is important user stories.
Examples like registration, login, payment processing, sending messages, generating
resources, search results, and etc.
It's important to consider not only whether these flows are functioning,
but also how well they're performing.
For example, how quickly payments are processed or messages are sent.
Then, Paths, flows, or features that depend heavily on frequently
changing system parameters.
It's crucial to account for dynamic changes that can affect app behavior.
And flows, features, stories dependent on external systems.
This could be a third party APIs or services as well as
internal systems developed by another team within your company.
It's not guaranteed that these themes consider potential issues
that might affect your specific app, as opposed to the entire system.
To achieve deeper data integration, additional systems should be used to
build charts based on the data and provide breakdowns by key parameters.
This part of the talk provides a general idea of what observability
can look like in a mobile application.
In the next sections, we'll explore architectural solutions
on the iOS side that can help you implement observability in your app.
I recommend exploring articles, talks, and other materials, not limiting
yourself to mobile development.
The experiences and approaches of DevOps and backend developers
can be incredibly valuable.
Achieving good observability means not only knowing that a problem
has occurred in your system.
But also providing details that explain what exactly went wrong.
One way to gather such information is by using logs.
While metrics provide a general overview of the system state, logs collect detailed
information about what happened on the user device, what actions they took,
and what might have caused the problem.
It's crucial to remember that Any tool for achieving observability
adds overhead to your application.
And logs are likely one of the heaviest tools.
So let's see what strategies for collecting and
delivering logs we can have.
For the log delivery, we can have live or online mode.
Logs are sent immediately.
right after they are recorded in the application in real time.
This is often used in the backend development, but may not be the
best option for mobile applications.
as you'll receive logs even when everything is working fine for the user.
This creates a large amount of unnecessary data which can be
difficult to process and store.
However, it's possible to configure the system so that logs are only sent
when a specific flag is activated in the app or when a trigger event occurs.
It's essential to test this logic to avoid errors that could lead
to either losing all logs or being overwhelmed with unnecessary data.
Next, we can have on demand delivering.
And there are two sub approaches, automatic, through a login flag, which
the user receives upon entering the app.
indicating that logs should be sent or via push notifications that prompts
the app to send logs and also manual.
The user can select an option in the menu to send logs upon
your request and Triggered.
Logs are automatically sent when a specific event occurs in the application,
such as a crash or a critical error.
And the log collection.
We can have continuous collection.
There logs are recorded continuously and it's important to consider.
So that you should avoid recording too many logs at once, to prevent overloading
the disk with read write operations.
But if you still want to write a lot of logs per second to a file,
for example, for very detailed debug or trace, you should
implement buffering writing to file.
So instead of opening, writing, closing files for every log line, it's better
to write accumulated buffer of lines.
In this case, performance will significantly speed up.
Also, you should implement log file rotation to prevent
them from becoming too large.
For example, you can set a file size limit of 20 megabytes, after which old
data is either cleared or overwritten.
And you should carefully manage login levels.
Debug logs, which are useful for testing and beta versions, should
not be included in continuous logging for release versions.
Also on demand, collecting.
Log collection can be activated manually by the user through
settings or automatically via logging flag or push notification.
And trigger it.
Just like with the delivery, a trigger event can start the log recording process.
The last two approaches are quite convenient, especially
when dealing with specific issues encountered by individual users.
However, these methods have drawbacks.
For example, you might miss the initial problem because the logs
will only start recording after you have identified the issue and ask
the user to enable log collection.
In such cases, the root cause of the problem may be lost.
But, you can implement a combination of logging approaches.
For example, logging at error, warning, and critical levels can
happen continuously, while more detailed logs, like info and notice,
can be enabled when a trigger occurs.
upon user request or via a login flag.
This approach helps maintain a balance between having detailed
information and optimizing the cost of storing and processing logs.
But how you can write logs in iOS devices?
First, it's Apple's OS lock.
Apple offers OS lock, a powerful tool that includes many optimizations
for fast performance with deep integration into the operating system
and support for user privacy tools.
However, for our purposes, it may be inconvenient.
These locks can be difficult to access.
There are three ways for iOS.
First, connect the phone to a computer and view the console.
Second, collect a sys diagnose of the entire device, which takes
up to 15 minutes and results in more than 200 megabytes archive.
And the last.
problematically from the app.
But there is a significant drawback.
You can only get logs from the current process.
This means you can't retrieve logs from extensions or from previous
launches of the application.
Next, we can write logs to a text file.
You can start by using an existing framework, such as Swifty Beaver,
which writes logs to a file.
Later, when you become more familiar with logging and understand your needs, you
can modify it, since it's open source.
Or implement your own solution, as it's quite simple, it's just writing to a file.
And also you can write logs to Firebase Crash Lytics by default.
Firebase Crash Lytics is used to catch crashes and send stack traces and
device information, but you can also log additional data, such as information
about subscriptions or user rules.
This allows you to more accurately identify the context in which the
error occurred and potentially find solution more quickly.
You can also write logs to Firebase Crashlytics, which will be sent
along with the crash report.
You can also send a non fatal event that includes all this information.
However, since Firebase is a free service, it has limitations, such
as data sampling, not all events are saved, and a limited number of
logs lines before an error occurs.
Additionally, you cannot run SQL queries to retrieve the specific data you need.
Therefore, it's not advisable to rely on free services for too long.
You should consider more specialized solutions or build your own.
However, Firebase Crashlytics can be sufficient for a start.
If you choose the second method, writing logs to a text file, you
should consider the following tips.
Define a logging policy and use logging levels appropriately.
It's important to consider how these levels will be used and to monitor the
amount of logging during code review.
This helps prevent situations where too many logs are recorded at once.
Which can negatively impact the system and the disk.
This task is as crucial as controlling the number of network
requests or database queries as the application functionality grows.
To prevent locks from being included in release builds and causing
unnecessary overhead, you can use preprocessor directives like ifdef.
Instead of adding ifdef to every login statement, you can optimize
the login function itself.
If the ifdef directive is enabled, the login code is executed.
If not, the function remains empty.
For example, this can be done for the debug level.
This allows Swift compiler to exclude unnecessary calls in the release
build, ensuring high performance.
In your login functions, specify the argument that passes the
log text as auto closure.
This allows you to delay the evaluation of the argument until it's actually needed.
If debugging is not enabled, the evaluation of debug
descriptions will not occur.
reducing the system load.
While logs indeed provide a lot of useful information, they can be
cumbersome due to their big volume.
Therefore, it's important to consider virus optimization.
One such approach is to transmit a detailed state of the
application instead of full logs.
For example, you could send information about which screens
the user navigated through.
Which API requests were made?
What settings were configured and which key business logic flags were set?
Let's consider an example of notification service extension in iOS.
This extension is triggered when a push notification with
a specific flag is received.
The system then allows us to modify the content of this notification,
or even generate multiple notifications instead of one.
For our business logic, we make an HTTP request to our API, query a
database, save changes, and much more.
However, there is a catch.
iOS strictly enforces a 24MB RAM limit.
limit for your extension.
If you exceeded this limit, the system will immediately terminate
your process without any notification.
This means the user won't even know that something went wrong and
no events will be locked in your metrics or crash reporting tools.
Knowing that the notification service extension is not
fully executing is not enough.
It's crucial to know at which stage it fails.
So we need to track every step of our process, send some kind of result
and ensure that we don't introduce so much overhead that we ourselves become
the cause of exceeding the RAM limit.
A simple solution is to use an option set, where each bit represents a
specific step in the notification service extension process, and one of
these bits indicates that the extension has reached its logical conclusion.
If we complete a step, we set a corresponding bit to 1.
Each time a change occurs, we should save the option set value to disk.
Then on next launch, we can check the value of the last option set to determine
whether all necessary steps were completed and if the process reached It's end.
If we detect any deviation, we can send a metric.
This way you can see if your extension frequently encounters error or
is being terminated by the system due to exceeding the RAM limit.
Those, instead of large volume of logs, you can focus on capturing
specific key events and states.
which will not only save resources, but also simplify data analysis.
With a well structured architecture in your application, implementing
various metrics becomes much easier.
Honestly, there is nothing fundamentally new here.
It's important to follow established patterns, with an emphasis on
adhering to solid principles.
Especially the single responsibility principle.
This will help you better monitor the areas in your code that require attention.
For example, let's consider how the coordinator pattern can help us.
Using the coordinating pattern for navigation allows you to clearly
track the user path through the scenes and their current state.
Each coordinator.
calls another, and those you can easily compile all the information about the
user's journey into the optimized string, such as registration, intercode, main
screen, more screen settings, or even minimize it to something like this,
where each designation represents a specific screen or transition.
This data can be easily decoded for analysis.
You can also use the coordinator pattern to determine the last screen
the user was on before closing the app.
You can send such a metric based on the hypothesis that if many users close the
app while on a specific screen, there is likely something wrong with it, or
users are unable to exit the screen, prompting them to restart the app.
To achieve this, you simply need to identify the active coordinator
in the application build terminate method of the AppDelegate, ask it
for the current screen, log this information, and then send it to
Analytics on the next app launch.
With the coordinator pattern, this can be done quite easily
while keeping your code clean.
We can also build an app start time metric based on how quickly the user gains
access to the app after launching it.
Again, using the coordinator, you can measure the time difference
from your app startup to when the app reaches to a specific state
according to your business logic.
Similarly, you can monitor how quickly users complete important flows in
your application, such as payments.
If users suddenly start taking significantly longer to complete a payment
flow, you will notice it right away.
Additionally, you can work with the API layer by recording
which requests were made during AppStartup or during its operation.
This data can be useful for understanding the sequence of user actions and
identifying potential issues, as well as spotting bottlenecks during AppStartup.
This will be easy to implement if you have clearly defined
your API and network entities.
Another interesting and useful metric I recommend is Annoyed Tab.
This metric tracks situations where the user repeatedly taps on the same UI
element, which may indicate frustration due to something not working properly.
Such a metric will be easy to implement if you have a clearly defined entity that
handles the business logic of the screen.
Such as Interactor controller, view model, and at sra.
That's it for today.
Thank you for your time and interest in the topic.
I hope you get a better understanding of how to implement observability
in your IUS applications.
Best of luck with your huge developments and thank you once
again for being here today.
Bye-Bye.