Rules of Observability: The Data Edition

Video size:

Abstract

There are immutable laws and rules for many things, from nature to physics to karma. Observability, with its innate complexity, also has its own immutable rules, which transform elastic and ephemeral rote reactions to a clear and concise approach and understanding of your environment.

Join us to learn the rules you should understand with your approach to observability, including:

The impact of open and flexible data ingest and instrumentation
The impact of data retention and aggregation on blind spot analysis
The impact of data accuracy and precision on observability
The innate drive for seamless workflow integration, from alerts to resolution
Why drift and skew could lead to erroneous conclusions

Summary

Dave McAllister talks about living with observability, especially in production systems. He works for Nginx, the open source web server and reverse proxy model. Talks about some of the data science that you may need to consider when you're looking at an observability solution.
Every customer is on this cloud native journey here, whether their goal is migration, modernization, new app development. Cloud native technologies help them do so, but could increases complexity. Monitoring involves to use usability, observability here.
We look at three classes of data and observability metrics, traces and logs. Each of these pieces becomes equally important. The real answer is improving the visualization. Use the data, but improve the visualization so that you can drill the data into what you need it to be.
Trace sampling is a common approach to doing this. Traces are noisy. They produce a tremendous amount of data. This routinely misses troubleshooting edge cases. It can also miss intermittent failures because it's not being sampled correctly. All these things become a problem with observability sampling.
So rule two states, very simply, operate at the speed and resolution of your app and infrastructure. How fast we can get things in is usually less than the resolution of our data. That impacts the way we aggregate, analyze and visualize this data.
Aggregation and analysis can actually skew our precision and accuracy categories. There's a trade off between granularity, data resolution, porting resolution that you need to consider. The finer your granularity is, the more you can have potential precision.
Most of what we do for the observability and monitoring space is be predictive on alerts, safety too. But predictions are data intensive. Precision and accuracy. Can you trust the data? And is the data telling you the truth?

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, and welcome to this talk today on observability and some of the data science that you may need to consider when you're looking at an observability solution. My name is Dave McAllister and I'd like to thank 42 for letting me come and share some of my observations about living with observability, especially in production systems. I currently work for Nginx, the open source web server and reverse proxy model. I focus on a lot of outreach. I've been open source geek from way back, actually before it was even named open source, and have continued through that period of time for quite a while now. So with that, let's get started on talking about some of the data side. Well, first of all, let's point out that quite often you hear a number of rules or laws that fall into place. The 22 immutable laws of marketing by trout and rays for has the twelve immutable rules for observability, or the 70 maxims of effectively maximally effective mercenaries, which is really quite hard to say. And of course, some of these laws are enforced a little harder than others, such as the law of gravity is pretty strictly enforced for this. But we're going to talk about just a couple of rules when it comes to observability. So let's get started. Well, first we need to sort of talk about why we're even interested in this space. And the short answer is that we're all on this cloud native journey, and we're designed that way to be able to increase the philosophy, the way things are transforming, keeping up with customer demands, outpacing our competition. We need to be agile and we need to be responsive. This has rapidly accelerated, especially over the last couple of years, where we've had even more of an online presence in mind. And more and more, we are engaging our users and our customers through these digital channels. So every customer is on this cloud native journey here, whether their goal is migration, modernization, new app development, the companies are doing to increase that velocity, and they're finding that cloud native technologies help them do so, but could increases complexity. There are more things to monitor, if you will. Cloud is an enabler of transformation, but how do you maximize this investment you're making in cloud, and make sure you have visibility into everything that's happening in that environment. So the farther along you are in adopting kubernetes and using microservices, looking at service meshes or other cloud native technologies that don't even fit on this slide, there's thousands of potentials. It's really hard to have the visibility into everything that's happening. You can't just monitor a monolithic stack. You can't just look at the flashing lights on the server. You've got to look at the hybrid, the ephemeral, the abstracted infrastructure. And this becomes very hard to manage and understand. And so that visibility, what we call observability, says is table stakes. It is the basis that we need to make sure that we are modernizing successfully and efficiently. And because of that, while we've been monitoring for quite some time, we have to make sure that monitoring involves to use usability, observability here. The reason is that we need to be able to look at these new complexities. Okay, you've heard of observability, obviously, but what is it? Well, it's the unique data that allows us to understand just what the heck's going on in these apps and infrastructures. And it's a proxy for our customer experience, which we've learned are more important than ever. So monitoring kind of keeps an eye on things that we know could fail, or the known knowns. And observable systems allow teams to look at issues that can occur, including unknown failures and failures with root causes that are buried somewhere in a giant field of maze passageways that all are identical for here, it alerts us, ask questions that we didn't think of before it goes wrong. So real ability goes beyond monitoring to detect these unexpected failure conditions, as well as making sure that we have the data behind the scenes to be able to figure out what went wrong. And so getting these insights, taking those reactions, are something traditional. Monitoring tools, and Floyd products in particular, really weren't designed to handle. So keep in mind that observability and monitoring work hand in hand. Observability is the data. It allows us to find things that are unexpecting. Monitoring keeps an eye on things that we know can go wrong. They both work closely together. But observability is built for certain challenges. Here and over here, you're looking at what's called the Knievan framework. And hat tip to Kevin Brockhoff, particularly for this drawing here. When we look at a monolith, we're in a simple environment. You look at it, you see what's going on, and this is the best practice sense. Categorize and respond. It's broken. Fix it, if you will. But we've added two new dimensions. One, we've made things more complicated. We've added microservices. We've added loosely coupled communications pathways. We're building these things so that there are lots of building blocks, and those building blocks may interact in very different and complicated ways. Likewise, at the same time, we've added another totally different environment, which is the ephemeral or the elastic approach here. Now things scale as needed and go away. So when you go to look for something, it may no longer be there, it's ephemeral. Think serverless, for instance. A serverless function does its job and gets out of the way here. And so putting those two together, we end up in a complex environment. Now, complex environments are not just sense, it's not passive. We want to actually probe into what's going on inside of this environment, then we sense what's going on and respond to it. Our could native world lives in this complex world, this emergent world of functionality. And so when we look through all these certain pieces, it's very clear that we have some very unique capabilities. Traditional monitoring cannot save us alone here. Multitenancy can be really painful inside of here. And honestly, failures don't exactly repeat. In fact, it's quite unusual for a failure to repeat exactly the same way. But of course, observability, and you've heard this before, better visibility, precise alerting, and end to end causality. So we know exactly what happened, where it happened, why, what the system believes was happening, when it happens, which gives us a reduced meantime to clue and a meantime to resolution here. So when this happens, though, we also can get answers when something goes wrong. Be proactive in adding that to the monitoring stack so that we can watch for it in the future. But this is all data intensive, and the amount of data that's flowing into our systems now is massively increased. There's just so much more data coming into that between looking at metrics, looking at distributed traces, and looking at logs, which make up those three categories. So rule one, use all of your data to avoid blind spots. And I kind of mentioned this, but observability really is a data problem. We have a ton of data that's coming into here. Generally speaking, we look at three classes of data and observability metrics, traces and logs. Or do I have a problem? Where is the problem, why is the problem happening? And each of these pieces becomes equally important. They all actually overlap. So metrics can yield, or logs can yield metrics, traces can yield metrics. Metrics can point you at the right trace of the right log. And each of these pieces gives us additional information for the monitoring and assist in our recognition of issues to help finding that underlying root causes here today, we generate tons of data. I've got hundreds of services that are calling it here. Every transaction generates orders of kilobytes of metadata about that transaction. Multiply that for a number of small number of concurrencies, and you can suddenly have megabytes per second or 300gb per day. That's coming in for a single concise application space. You need all this data to be able to make, to inform your decision making. In fact, yeah, it can be a terabytes of data, but you need to be able to be aware of where your data limitations are, and your data decisions can make you have problems. So going forward here, data is this driving factor, and it drives things like our new artificial intelligence and machine learning directed troubleshooting. It also has this thing called cardinality. So cardinality is the ability to slice and dice. So it's not just, I got a metric that says my cpus are running at 78%. It can be broken down into that this specific cpu is running at this, or even this specific core, or this core running in this virtual machine on this particular infrastructure environment is running in this. We also are now living with streaming data. Batch just doesn't cut it when we're looking at analyzing our data for this. And we want as much data, full fidelity metrics and as much information our traces we can get, while at the same point in time. We also want it to be standard space and open source available so that we're never locked into those various pieces. But what happens if you miss ephemeral data? What happens if the data that you need is no longer available, or here, or your data shows something that you can't drill into? Well, there's some interesting things. First of all, let's take a look at this massive amount of data coming in here. And quite often you'll hear people talk about that. Data in mass is noise. Well, and there are ways of dealing with that noise here and in this observability space, they really come down to either filtering the signals, which can be linear, which smooths it, destroys sharp edges around here, blind pass, which removes the outliers for this, or even smart aggregations pieces. We'll go a little bit more into sampling. And sampling can be random. Grab one headbase, tail, base, post predictive dimensionality reductions. But the real answer is improving the visualization. Use the data, but improve the visualization so that you're not lost in the complexity of the data and that you can drill the data into what you need it to be. So I mentioned sampling. So let's talk a little bit about sampling. So trace sampling is a common approach to doing this. Traces are noisy. They produce a tremendous amount of data because the trace is actually looking at your request as it moves through your system. Especially if you move from a front end environment, your client's laptop, for example, all the way through the back end, through the various infrastructure pieces, there can be a huge amount of data here. And so quite often you see people do trace sampling, and this routinely misses troubleshooting edge cases. It can also miss intermittent failures because it's not being sampled correctly here. If you don't understand the service dependencies in a microservices environment, in particular, you can get alert stars because several things may fail because of one point. You also need to be able to look at not being simplistic on your triage. So you're seeing an error happen. You need the data to be able to determine where the error is actually occurring. And so if you have too little data, then you end up with a simple approach to this and you may find yourself spending a lot of time actually looking for where the underlying cause lives here. And then if you're just looking at observability as sampling, you're separating your application environment and your infrastructure environment and has, we all know those two pieces work pretty closely together and can have impacts beyond the traditional oh, I'm out of space, I'm out of cpus. But it can even have things such as noisy neighbor problems or communications bandwidth problems. So all these things become a problem with observability sampling. So here we're looking at a typical tray sampling. And there are two types of things here. There's headbase and tail base. No one in the right mind really totally goes after some of these others. It's easier not to do this than looking at dimensionality collapse. So, head base sample, let's look at the sample and then decide whether to keep it or not. And so the head based sample catches things that are outliers and can catch things that, you know, are errors. However, the problem is that a trace may not lead to an obvious known error. Remember, we're looking at unknown unknowns. And so when we do this, all of a sudden a head based sample, while reducing the data flow, may remove that single sample. That would answer your question. When you're looking into the underlying causes here, tail based sampling, which is a little more random. Okay, I've got 100 traces. I'm going to save these. Ten is a little worse at this. So think about it. 100 traces. I'm going to save five. That means I have a 95% chance of missing that outlier. So head based sampling catches my outliers, catches my known errors. Tail based sampling may catch it, but probably misses it, depending on how much you sample. But when we start pulling all the data into place, we've got the ability to go back and reconstruct, if you will, the scene of the crime. So another example. So this is sampling, and in the first instance here, data one is sampled, data two, no sampling obvious for this, and they are running the same environment. They are different times for this. So in sample one, I'm actually pulling a sample rate that's only looking at pieces of this sampling is giving me a selection bias. It's not showing me necessarily things that are out of range for this. So in this particular case, sampling came back and said that my duration for my traces is in the one to two second range, and that even that high end range, I've got one trace that fell outside of that. I'm sure you can simply look at the data there and start getting some interesting error indications about how much is actually not being looked at. Take the exact same thing where we're not sampling, this is showing the same application and impact. And all of a sudden you can start seeing some of those errors creeping into here. Now, my latency distribution is showing that I have a trace that's in the 29 to 42nd range. In other words, this is causing you to have a blind spot. And in an ecommerce world, 29 seconds is known as a lost sale. I've given up purchases for less time than that while waiting for them. But nonetheless, no sampling is giving me better results and a better understanding of my user experience than the sampling environments. But I can clear you right now. Wait, my metrics tell me everything, and that can be true. Your metrics are not usually sampled, particularly for your infrastructure. They may still end up being sampled for your tracing data, because tracing data is, in its own right, ephemeral. It stops when the request stops, as well as massive. And it comes in not necessarily in a metrics basis. It comes in, in a form that you can extract metrics from, usually rate, error duration, or red monitoring here leading to this problem is that you start missing those duration results and you can even miss some of the alert structures less so if you're looking at a metric space and you're capturing all the metrics data that's inside your traces. But your duration data may be impacted by seeing this. After all, when does a trace end is one of those always interesting questions. So one whole third of this observability space resides on tracing data. Here, duration is probably the single most important element for a user experience to be looking at inside that trace data. So, TLDR, your ability to use observability is dependent on your data source, and your data source needs to be as complexity as possible. But don't let the chosen data bias your results. Don't force yourself into selection bias before you have a chance to understand what selection bias is meaning to you, and keep it all, understand it all. Otherwise you don't track customer happiness as a proxy here. And finally, getting the data in real time matters because we need to know when things go wrong as soon as they possibly can go wrong. And we'll find out more about that when we hit rule two. So rule two states, very simply, operate at the speed and resolution of your app and infrastructure. And again, this starts falling into a data problem. But we're going to start by talking about this thing. Some of you are probably familiar with what's known as the von Neumann Bottleneck and the blind Newman bottleneck, very simply is the idea that computer system's throughput is limited due to relative ability of processors to the top rates of data transfer. And so our cpus have gotten a whole lot faster. Our memory is not that much faster, so the processor can sit idle while it's waiting for memory to be accessed. And yeah, there are ways around that, but this is a fairly standard model for computer systems today, is the memory and the cpu are basically competing for a bottleneck of transfer. When we look at that from observability, there's a similar issue here, a little different, but the resolution of our data, the precision and accuracy coming in is massive. The speed of our data is the deterministic response. How fast we can get things in is usually less than the resolution of our data, and that impacts the way we aggregate, analyze and visualize this data. And honestly, by the way, data is pretty worthless unless you can do that aggregation, analysis and visualization. And that resolution and speed together impact the insights you get from the data that's coming in. Pretty straightforward, makes perfect sense. But when you start looking at this at volume is where life gets interesting. So I've mentioned precision and accuracy, and we need to talk a little bit about this. We discuss these things as if they were the same, but quite often we need to understand that they're not. Accuracy states that the measure is correct. Precision says it's consistent with the other measurements. So if I measure the cpu each time I measure the cpu, I want the number percentage of its in use category to be precise. I mean accurate. I want it to tell me what's actually happening. It's running at 17%. Is it running at 78%? I also want it to tell me when the cpu is in the same state. Give me that number again. So if it looks at this and says it's 17, now it's 78, now it's back to 17. I want to trust that that is actually consistent with the measurements that are seeing here. Aggregation and analysis can actually skew our precision and accuracy categories. So here's another example. And what this is, is looking at the number of requests per second going through a system. So this one is not duration based. This is literally the number of requests going through here. And as you can see over the first ten second viewpoint, my aggregations are going to happen. If I look at a ten second aggregation, my average is 13.9 and my 95th percentile is 27.5. If I look at just the first 5 seconds, I have a 16.4 and my 2nd 5 seconds I has an 11.4. Both of these miss the fact that one of my measurements has gone over 30 requests per second, which happens to be where I had set the alert. Now again, easy enough to say bring in the data, show me things when it crosses 30, alert me. But your aggregation may never show that. The aggregation may show you at under 20, nice safe number in all these categories. And so when you start looking at that functionality, you can see that aggregation can change that aggregation. So, simple example, but now multiply this times the hundreds of thousands of requests that can be rolling through your system at any given moment. So, data resolution, the speed at which we pull data in and reporting resolution are never, well, not never, seldom ever going to be the same thing here. They both can be problematic. So throwing away data points means that you can't go back and reconstruct. So always deliver those data points regardless of what your reporting structure looks like here. And the finer your granularity is, the more you can have potential precision, the more you're likely to get the same number for the same results. So this is a simple little kind of drawing. When you measure something, when you measure it, it's actually in the middle of your granularity point somewhere, but you're not quite sure where. So in things case, if I'm looking at that, I'm measuring on a second by second basis where it falls relative to that second boundary, varies by the size of that boundary. So milliseconds give me finer granularity, and picoseconds give me finer granularity. That granularity may become more and more important, especially as we scale things and especially as things get complex, but at the same point in time, it creates more data. So there's a trade off between granularity, data resolution, porting resolution that you need to consider. And when we bring this in, we have native resolution. How often do we collect data and chart resolution, which is the aggregation points that we use. So in this particular case here, I can be showing you that I'm bringing in requests per second, but I'm actually aggregating these on a ten second basis. And so we want to be able to look at this so they make sense. We get to watch things move, we can look for this. Humans are incredibly good at pattern matching, and so we need to be able to break it down and take a look at what's going on here. So, native resolution, data collection, chart resolution, the aggregation that we're using for our charts and graphs, we want speedy data collection and sufficient chart resolution so that we can understand what's going on. Add to that complexity. We talked a little about this complexity. We now have compute cloud elasticity, cloud compute elasticity. Spin up more kubernetes, spin up more, spin down more serverless functionality, pull in the functions when they need them, and have those pieces happen. That's the ephemeral side of this. But we can also have drift and SKU. We are now running in infrastructure that is not all single based. These are virtual machines in general in a cloud environment, and we're running wherever, we actually don't kind of know where we're running inside of here. So when we start bringing data in to aggregate it, we can have drift and SKU. And so we need to kind of measure how far ahead or behind the data source coming in is. Behind where we currently are. Drift is a continual something, is showing it becoming faster and faster. SKU is something that's just out of line with everything else. And keep in mind, this is not just the computer infrastructure, it can also be the networking infrastructure. So it's worthwhile looking for ways to manage drift and SKU so that we understand the functionality. Easiest way, once again, is to keep all the stupid data so that you can keep track of where your drifts and skus are handling. And when we get into this data, we really want to understand one basic construct. Most of what we do for the observability and monitoring space in particular is be predictive on alerts, safety too, or have an alert based on something. But predictions are data intensive. Again here, if something's stationary, straight line. Oh, yeah, piece of cake. We know what the distribution curve is going to look like. Yeah, you can set a strategic threshold. If not, we look for sudden change. If we have a linear trend, things are always going to be sloping up. And if they don't, then all of a sudden we want to do this, then we can start looking for things like resources running out. We're approaching the end of our block of capabilities for this. If it's seasonal, we need to look at historical anomalies and historic anomaly can have some interesting impacts. Here's one where I'm going to say, don't use the mean, use the median for here. And then all these pieces give us the ability to do some level of predicting and alerting. Sorry. So what we really want to know is things predictive behavior. We want to know what's coming. We not necessarily really care as much about what's happened. We can usually track that, but we want to know what's coming as well. Prediction is only good as that. Precision and accuracy. Can you trust the data? And is the data telling you the truth? Right here. When we look at this, at the historical change environments, we want to make sure that we're using the right thing. So a sudden change basically says, I was running along fine and all of a sudden I've got 70% more demand coming in, let me know. Historical says on Tuesdays at noon, my workload drops down so I can shut things down. But at 01:00 it comes back up. So let me plan to bring things back up. Oh, look, this Tuesday at 12:00 the workload didn't fall off for that. And of course, trend lines being stationary. In any case, with predictive behavior, you can expect to see some level of false positives as well as false negatives inside of here. Does it make sense to compare current signals to the observed value last week? Or could values from the preceding hour make a better baseline? Sort of that historic change versus historic anomaly versus sudden change structures. However, whenever you're looking at this, particularly at historic data, use the median. The means will float up and down. Should you have outliers, the medians pretty much have a much more comfortable range that they stay in. So again, summing up observability is only as useful as your data's precision and accuracy. If you can't trust it and can't trust it each time, it's worthless here. And you need to consider the elastic, ephemeral, and SKU aspects of these complex infrastructural environments here. And while we look at prediction has a target, we need to keep in mind that there's a difference between extrapolation and interpolation, and keep in mind that in any case we may end up with false positives or false negatives. And finally, a closing thought here. The most effective debugging tool is still careful thought coupled with judiciously praised print statements. Brian Kernegan said this back in 1979 in Unix for beginners, observability is the new print statement. And with that, thanks for listening to me today. Thanks for letting me come to the Conf 42 conference and enjoy the rest of your conference.

See all 50 talks at this event!

Conf42 Cloud Native 2022 - Online

April 28 2022

Rules of Observability: The Data Edition

Video size:

Abstract

Summary

Transcript

Dave McAllister

Senior OSS Technical Evangelist @ NGINX

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2022 - Online

April 28 2022

Rules of Observability: The Data Edition

Video size:

Abstract

Summary

Transcript

Dave McAllister

Senior OSS Technical Evangelist @ NGINX

Join the community!