Abstract
There are 2 challenges in observability. Uncertainty in prognosis decisions (false+ and false- in failure predictions) and discovering causal connections in diagnosis. We address this by modeling spatio-temporal uncertainty for prognosis& knowledge representation/ graph database for causal diagnosis
Distributed systems are complex and are prone to failures. With more and more enterprises migrating their on-prem systems to cloud, there is also increased risk of failures. These failures often happen owing to unpredictability of production system workload and usage patterns and the consequent emergent response of distributed systems which cannot be easily envisaged during design and implementation. Principles of observability built on three pillars namely, logging, monitoring and metrics attempt to observe the internal state of the system in order to perform prognosis of failure modes and post event failure diagnosis. Observability & chaos engineering techniques are usually combined by conducting planned and thoughtful experiments (inject chaos) and uncovering weaknesses in the systems by analyzing the runtime data of the system (observability).
There are multiple challenges in proactively discovering failure modes during these experiments. First, the logging and monitoring data and their visualization from observability tools is often too overwhelming and voluminous for them to be ‘actionable’. There is data deluge leading to ‘data fatigue’. Secondly, there is significant uncertainty in decisions to classify an observed response behavior as either normal or anomalous. Both false positives and false negatives have impact. During a course of series of experiments this uncertainty manifests itself in both temporal and spatial dimensions. The earlier the decision point in time (before the actual expected failure), the more possibility of false alarms and possible ‘alert fatigue’; The later the decision point (i.e closer to the expected failure in time), the less the usefulness of the decision since it is most likely too late to take action. Moreover, the longer the causal chain (spatial separation of original cause and its later effect) the more the uncertainty. We propose to use spatio-temporal models to address this uncertainty in prognosis
Another challenge in prognosis and diagnosis is determining causality and connections between events. Often, with huge amount of observability data, the causal connections between various connected events are not very clear. We propose to use knowledge representation and graph databases to automate discovery of such causal connections
Transcript
This transcript was autogenerated. To make changes, submit a PR.
You the title of my
talk is taming the spatio temporal uncertainty in
observability. Distributed systems are complex and are
prone to failures. These failures often occur
due to unpredictability of production, system workload and
usage patterns in production, and the consequent emergent
runtime response of distributed systems,
which cannot be easily predictions during design,
observability and chaos. Engineering techniques are usually
combined by conducting planned and thoughtful experiments
to uncover weaknesses in the system by analyzing
the runtime data of the system. Outcomes of observability
observability is a measure of how well the internal states of
a system can be inferred from knowledge of its external
outputs. In practice, however, observability must
enable us to accomplish two objectives.
First is failure diagnosis, which is essentially inferring
these cause from effects, and secondly failure
prognosis, which means enabling early warnings
of any impending failure based on current observed
behavior and projected failure pathways. So the objectives
of observability are prognosis and diagnosis.
So data must be interpretable and actionable.
Every tool promises more observability,
but often we see that more data and interactive
dashboards leads to data deluge and decision
dilemma. The challenge is to digest massive quantities
of data to find patterns and correlate seemingly
unrelated events for performing prognosis and
prognosis. The challenges there are multiple challenges
in proactively discovering failure modes and performing diagnosis.
During these experiments. Logging and monitoring data
and their visualization from observability tools is
often too overwhelming and voluminous for them to
be actionable. There is very low signal to noise
ratio since many systems interact.
Semantic reconciliation and correlation of all that data is
very difficult. Large volume of events,
determining and reuse of third party components
aggravate the challenge. While many may argue that more
the data these merrier in reality, however, the more
data you have, the less insights you can discover due to
noise and nondeterminism. This diagram gives
an idea of how one can take an integrated view
of prognosis and diagnosis. As you can see,
there is a fault or an anomaly at a particular instant.
Initially, the subsequent event pathways after this
event could take three possibilities. First, it could lead
to failure. Second, it's a transient fault and modes
not lead to any serious failure. Third, it's actually a
false alarms and really not a fault as originally
believed. As you can see, there is significant uncertainty
here. There is significant uncertainty in deciding to classify
an observed behavior, an observed response behavior,
as either normal or anomalous. Both false
positives and false negatives have impact during a
course of series of experiments. This uncertainty manifests itself
in both temporal and spatial dimensions. The earlier the decision
point in time before the actual expected failure,
the more the possibility of false alarms and possible alert
fatigue, the later the decision point that is closer
to the expected failure in time, the less the usefulness
of the decisions, since it is most likely too late to
take action. Moreover, the longer the causal
chain, that is, the spatial separation of original cause
and its later effect, the more the uncertainty. In causal
analysis, there is always confusion and uncertainty between
what is cause, what is correlation, what is consequence,
what is confounder, identifying coincidences
and what is association. There is always confusion between these
various aspects and concepts, often with huge amount
of observability data. These causal connections
between various connected events are not very clear.
What chain of events led to the failure event? Often a
failure is never the result of a single chain of events,
it's a network. Multiple conditions lead to a failure event.
The causality challenge has been a classical one from
aristotelian times to now. It cannot be easily
solved and often this problem is underestimated.
There is an assumption that data analysis
can determine causality. This is far from true
in any prognosis and diagnosis process. There is uncertainty,
as we saw earlier in prognosis, the uncertainty that
given current state of these system, what pathways will the
system state traverse in time? Will it lead to failure
or a normal behavior? If it is going to fail,
where will it fail, which layer will fail, what component
and when is it likely to fail? But there is uncertainty in
space, the space here being architecture layer space,
and there is uncertainty in time as to when the failure will
occur. For example, if there is a spike in cpu and memory
usage accompanied by other events, will that likely
cause a failure of a service through a long causal chain
of connected events? If so, when and where will that occur?
Note this problem can be modeled as a spatiotemporalcausal
problem. We can draw inspirations from other disciplines
like city traffic modeling, weather modeling,
epidemiology, cancer treatment, and social networks.
For example, in cancer prognosis, one cloud do prognosis
as to where the source of cancer is,
how the metastatic pathways do take
place in the patient, and that's a spatiotemporalcausal
problem modeling. Similarly, other disciplines
like social networks, the problem is very similar in observability.
Spatiotemporalcausal data analysis is an emerging research area due
to the development and application of novel computational techniques,
allowing for the analysis of large spatial temporal databases.
Spatial temporal models arise when data are collected
across time as well as space and has at least one
spatial and one temporal property. This is true for observability
data. Every data has a space, a property of space
and a property of time. Here is an approach we suggest
we provide a very high level approach these to taming the spatio
temporal causal uncertainty from system under normal conditions is
ingested into a spatiotemporalcausal database. The data
under fault injection condition is also ingested into
these database. The spatiotemporalcausal model consists of multiple
techniques using statistical techniques, time series
analyzing, association rules, data driven
predictions techniques, bayesian networks, pattern recognition,
Markov's model, cluster analysis, etc. All this
analysis is done in both time and space. In addition,
there is also a knowledge graph using discovery of
semantic connections, what we call the qualitative reasoning
derived from text outputs like logs which are
very important for some kind of reasoning and causal chain
analyzing. Also, it is to be emphasized that it is
important not to aim at 100% automation
of prognosis and prognosis, but cause
these prognosis and diagnosis engine as
complementary to these human expertise. The engine provides
various recommendation to these human user like most likely fault
pathways, anomalous behavior alerts, imminent likely
failures, events, time to failure, hierarchical failures,
alarms, failures, recovery recommendations and event correlation.
As you can see, all these are recommendations that the
human expert cloud interact with and dig
deeper and explore further to enable better
prognosis or better diagnosis.
In summary, data deluge is a huge challenge in
observability. Integrated prognosis and diagnosis
should be the outcome of observability. The problem
here is modeled as spatio temporal causal
uncertainty predictions and in causal analysis. And last,
machines should complement human expertise and not
replace and cannot replace human expertise
in conducting intelligent prognosis and
diagnosis.