Conf42 Chaos Engineering 2022 - Online

Glitches in the Matrix, or Taming Agent Chaos

Video size:

Abstract

This talk is accessible to all levels, and will include specific examples from Fleet’s codebase that will engage even advanced practitioners. We will also touch on the capabilities in osquery that help to improve reliability.

Questions posed and answered from Fleet’s experience include:

  • Where to focus efforts when considering performance and chaos?
  • How does Fleet engineer for resiliency when our users self-host the software and have hundreds of thousands of (potentially misbehaving) agents checking in for * coordination?
  • What Go tooling can we use to improve monitoring of the Fleet server both during internal performance testing and while running in production?
  • How do we collect information about the infrastructure dependencies (MySQL & Redis) to understand the entire system beyond the code we control?
  • How can we ensure osquery agents don’t bring down the endpoints they are deployed to protect?

Summary

  • Zach Wasserman: I'm the CTO of Fleet, which is the open code osquery management server. Organizations are using fleet to manage Osquery across hundreds of thousands of devices. How do we engineer this whole agent server system for resilience? Wasserman: Top priorities are availability and the integrity of the data collected.
  • Osquery uses OS query watchdog and C groups to mitigate production availability. The company also uses common server scalability practices. These common practices help to mitigate the monitoring availability, the latency integrity and the compute costs.
  • A really big challenge for us at fleet is this self managed infrastructure and self managed deployment of the fleets server. The more heterogeneous the deployments, the more edge cases that users and operators will run into. Testing is, of course, also really important.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Code Samatrix or taming agent chaos for Conf 42 Chaos Engineering 2022 I'm Zach Wasserman. I'm the CTO of Fleet and a member of the OS Query technical steering committee, and I'm excited to be presenting this to you today. So for a bit of background, I'll start by explaining what Osquery is. Osquery is an agent that endeavors to bring visibility to all of our endpoints. We're talking macOS, Linux and Windows. And the way that it does this is it exposes the operating system internals and the state of the system as though they were a relational database. So I helped co create OS query at Facebook. It's since been transitioned to a project with the Linux foundation. It's all open source, so go check it out on GitHub. Like I said, Oscar is cross platform. It's also read only by design, which is something that's a bit helpful in constraining the possible problem space when we're thinking about the possible consequences of running this agent. And we deploy it across quite a heterogeneous infrastructure, lots of different devices. We're talking production servers, corporate servers, workstations across all of these major platforms. So we're seeing lots of different deployments for lots of use cases across security, it and operations. And then we have fleet, which is the open code osquery management server. And organizations are using fleet to manage Osquery across hundreds of thousands of devices. Fleet is open core, so part of its MIT license, but the source code is all available there on GitHub, so feel free to also check that out if you're interested in things I'm talking about today. And Fleet is primarily a Linux server, and then it has a CLI tool that's cross platform and used typically on macOS or Linux, but also sometimes on Windows. And I think it's useful to understand that when we're thinking about fleet as a server. Fleets has two different major clients. One is the OS create agent client. These are agents checking in, looking for the configurations that they should run, sending data that they've collected, and updating their status with the fleets server. This is where we're talking about hundreds of thousands of devices checking in. And then we have the API clients, which tend to be a human clicking around in the UI, or maybe a script that's modifying the configurations that are sent to the agents or looking at the data that's been collect. And I want to highlight that there's kind of an order of magnitude difference in the traffic, or perhaps multiple orders of magnitude difference in the traffic generated by each of these clients. So we're going to treat them very differently as we look at the reliability characteristics of our system. How do we engineer this whole agent server system for resilience? Well, I think it's really important to focus for doing this. We need to identify our key areas of risk, decide which of those we want to prioritize, and then apply mitigations that are focused on those identified areas. And so when looking at our system that I just described, this fleet and os query server agent system, I see the following priorities. I think availability of the production workloads almost always has to be the top priority of our system. So we would need to be really careful that the production workloads in our environments, the workstations that folks are using to get their jobs done, are not impacted negatively by the agent. Our next priority is the integrity of the data that we collect, because our security programs and our operations teams rely on the integrity of this data in order to do their jobs effectively. And then as a third priority, I'll put cost. And of course cost is very important and it's going to depend on the organization and the particular scenarios. But typically it's the availability and the integrity that are going to be higher priorities, where cost is only going to vary by some more limited amount. And when thinking about all of these things and the focus that we want to take on risk, I think it's useful to look at the paredo principle or Omdahl's law. This idea of a small production of the clients are going to generate an outsized amount of the traffic, or a small part of the system is going to generate an outsized amount of the load. And if we want to optimize for our system, for whether it's reliability or performance or some other metric like that, we need to be looking at the places that we can have the highest impact and that are the greatest contributors in order to be the most efficient with our efforts. So looking at this a little bit more concretely, with our agent OS query and our server fleet, starting with OS query, I think one of our top risks is that OS query runs often on production servers. These are servers that are doing things that actually make the company money or serve the main purpose of the organization. So these production workloads and their availability is extremely important. A next top priority is the usability of workstations. We have people who we pay and who are dedicated to getting their work done, and they need to have usable workstations. So we can't have this agents taking down their workstations, even if this is perhaps just a slight bit less critical than those production workloads. And then of course, we've got the monitoring integrity. We don't want to miss out on critical security events, we don't want to miss out on operational events. And of course, as mentioned before, the cost of compute can actually really add up here, because when we're talking about hundreds of thousands of hosts that we're running an agent on, a small percentage increase in the resources consumed by OS query can lead CTO a pretty huge amount of extra compute cost. So when I look at all of these factors together and I think about our agent server system, I think the agent is pretty high risk. This is going to be a big area of focus for mitigating risk. Then if we look at fleets, the server, well, the monitoring availability of fleet is quite important. Again, this is because of those security teams and those questions, teams that are relying on being able to get that osquery data in a timely manner. Speaking of timely manager, the latency is important, but the latency is perhaps a bit less important than just getting the data at some point. And so that's a monitoring integrity. We know that we need to get those critical security events, otherwise we can be missing out on threats within our infrastructure. And then again, the compute cost can be a factor here. But I'd say this is typically a much lower consideration for fleet deployed, as we're talking about just a handful of servers to perhaps a dozen or a couple of dozen servers, plus some MySQL and redis infrastructure dependencies. So even if we have to do a pretty massive scale up here, this compute cost is not going to be very significant compared to even a few percentage endpoints in an increase in the production compute costs. So when I look at all of this together, I think this feels like medium risk. So I'll talk now about some of the mechanisms that we have in OS query for mitigating these risks and for preventing the proverbial chaos. So osquery has got something called the watchdog, which is a multiple process worker watcher model. So one process watches the other OS query process, and checking for how much cpu is being used, how much memory is being used. And if those limits are exceeded, then the work that the agent was doing at that time is blocked for 24 hours. So if we find ourselves in a situation where unexpected high levels of load are occurring, we can actually sort of self recover by saying that's not work that we're going CTo do in the next 24 hours. And this strategy helps to mitigate those production availability and workstation availability or usability issues, meaning we prioritize the primary computing tasks of these systems. Of course it's also worth looking at the trade offs and the downside of this one is that blocking those queries does potentially reduce the integrity of the monitoring that we'd like to achieve. Another mechanism that's really common to use with osquery, and this is actually one that's available for lots of services because it's not specific to OS query, is c groups. So of course we can ask the Linux kernel to maintain strict limits on the amount of cpu and memory utilized. And this is something that is commonly used in combination with the OS query watchdog. And we see this as a way to both have something kind of application specific that understands the concept of queries and knows how to block them. That's the watchdog. Whereas C groups are kind of a final backstop helping us be careful about how much of the resources on a system we use. And this really just mitigates the production availability. Although acknowledging that some users do have Linux workstations, mostly we're talking about productions here. And so that downside is that this is only compatible with Linux. We also think it's important to do profiling and estimate the performance characteristics of the queries that are deployed before doing so. And OS query has some application specific profiling tooling that's available. And we can see over here on the right, an example of given some input queries, OS query generates some sort of benchmarks for how much user time, system time, memory file accesses, and those sort of metrics are used when executing these queries. And that can give us a preview of how expensive this query is and allow us to gain some predictability before we go out utilizing those resources. And so then this is a mitigation for the monitoring integrity because we can be sure of which queries will and won't set off that watchdog. And also we can understand in advance what is the actual cost in terms of compute of deploying this kind of thing. Then there's monitoring that we can get in osquery as well. And so this is something that fleet does in combination with OS query is recording statistics for the actual execution of the queries across all of the hosts and then I've just provided an example of here of how we render this kind of information within the fleet UI. Just trying to give sort of a subjective assessment. And this allows an operator to get an idea of does the performance of this query match what I expected in the real world? And is this something that's worth me continuing to pursue? This again mitigates the monitoring integrity and the compute cost issues. Then over on the fleets server side, of course, it's worth mentioning that we use common server scalability practices. So multiple fleet server processes run behind a load balancer. The MySQL and redis dependencies are clustered with ready to failover. And this is often, we'll often deploy this across multiple regions or that sort of thing. We'll utilize auto scaling for efficient infrastructure sizing when this is deployed within the cloud. So the AWS or GCP kind of auto scaling and these common practices help to mitigate the monitoring availability, the latency integrity and the compute costs. But I think the downside to be aware of in particular with some dependencies that are less elastically scalable, like MySQL or redis, these kind of fixed infrastructure dependencies is that if these aren't properly sized, then you can really overestimate the compute costs. Fleets also relies on back pressure strategy that's supported by the OS query agent. By the agents will actually buffer data on each individual endpoint, attempt CTO, write that data to the server and it will not clear from the buffer on the local endpoint until the server has acknowledged successful receipt of the messages. I think this is a really great approach to use because we have this scenario where the servers are of course vastly outnumbered by the clients. If we can push a very small amount of storage work off to the clients, we can actually really benefit from the availability characteristics that this provides. And so this really mitigates the monitoring integrity problem because we know that except in extreme cases, no matter what happens, that data will eventually make it to the server and be processed properly. Of course, under bad scenarios, this can increase latency and like I said, the integrity can be compromised in extreme cases when these internal buffers overflow. And of course that's yet another level of consideration that we have to take into account, is that we don't want to end up overwhelming the agents by buffering too much data there. So that old data is dropped. So back to this idea of engineering resilience. What are the other challenges that we have faced and how have we approached them? So a really big challenge for us at fleet is this self managed infrastructure and self managed deployment of the fleets server. So all of the fleet software, both the agents that are running on the individual endpoints and the servers, the fleet servers plus their infrastructure dependencies, they're all in the customer environment. So we don't have control over those environments. Those environments can end up being very inconsistent. Deploys can be slow and are completely out of our control. So we might have outdated versions of the software running. And when we get into debugging scenarios, there's a very slow feedback loop. So these are some challenges that are introduced by that, and I'll talk about some strategies that we use to manage that. First, I'll note that it's really important, as much as possible, to strive for consistency. The more heterogeneous, the more different the deployments are, the more edge cases that users and operators will run into. And for us, this looks like things like MySQL could be MySQL could be MySQL five, seven, MySQL eight could also be MariaDB or Aurora on AWS. Redis can be running in cluster mode and redis sentinel mode, or just in single host mode. And so there's all of these different permutations of infrastructures that could be encountered. And these are all things that we sort of account for as chaotic factors in our engineering work. So we still want to strive for consistency as much as possible. And we've achieved that a couple of ways. One, by using infrastructure as code and trying to push folks to deploy basically a reference architecture of fleet. And just as an example, this is kind of how we specify the reference architectures down to a very specific version number of mysql, and the exact number of tasks, and the memory that each task should have for the compute. And we found that the more that we can get folks to adhere to these recommendations, the easier it is for us to work with them, anticipate the problems that they'll have, and reproduce problems that are encountered in the wild. Testing is, of course, also really important. And I think the biggest takeaway for us on the chaos front is that automation is really important. So we've identified a metric of how many experiments can we run per week, because we see this idea that the more experiments that we run, the more different combinations of things we can try, things going wrong, things going right, the more edge cases we'll encounter, and then the more issues that we'll detect before they go into production or before they are encountered in production. And this is also a place where the infrastructure, as code really comes into play. If we can spin up and down environments easily and we can modify their parameters easily, then this will help us increase this metric of experiments per week. In terms of testing, the tooling is also really important. So generic HTTP testing tools probably won't hit your edge cases, and they definitely don't know what the hot paths are for you in production, but you can probably identify those or at least make some guesses. And in our experience, it's really useful to build custom tooling. And so for us at fleet, thinking back to where we want to focus our efforts in terms of optimizing and mitigating risk, it's this agent server coordination. And so for the fleets server, we've created simulations of the OS query agent so that we can start up a lot of agents, users them in different scenarios and test how the server responds to these things, so that we've identified that hot path, that's the agent check ins and the processing of the received data. We can simulate these agents efficiently, meaning tens or hundreds of thousands of simulated agents running on a single actual virtual machine. And this is something that can be done pretty easily with amazing support for high concurrency HTTP when using go, the go programming language. Just for an example, here's a template that we use to kind of simulate the data that's returned from a host. And you can see that there are some templated parameters in here where it says cache, bring host name and uuid within the curly braces. These tests are opportunities to inject more randomness and see how the system responds. So what happens when we run a bunch of simulated agents that look very similar when we run a bunch that look very different when they all start up at the same time, or if they're all really spread out? These are all different factors in how the system will behave. What happens if our network access becomes compromised? If we can run these kind of simulations on a relatively cheap infrastructure, and we can have good control over those factors, then we hit more of those edge cases. And then debugging plays a really important, I call it dual role here between staging or load testing and production. Debugging helps. Having good debugging tooling helps us detect issues before we ship them or before they become actual incidents. And it also helps us CTO resolve incidents more quickly approach that we found CTO be useful for. This is what I like to call collect first, ask questions later. And this is something that we achieve with our fleet control debug archive command. So I'll just show an example of the execution of this command. With one command, we run all of these different profiles and debugging steps, finding out the allocations of memory, the errors that have been taking place, getting information about the concurrency profiling the cpu do, getting information from the database, including the status of the underlying database, the locking that's happening there. And then by doing this, when we simulate a scenario that we find causes undesirable behavior, we can easily get a whole snapshot of these things and then take some time later to analyze the data that was collected. On the other hand, we talked earlier about the difficulty of the feedback loops when debugging in a self managed environment. And this is also something that really mitigates that when we have a single command that we can ask our customers and our open source users to run, and that gives us a whole bunch of data that we can look at to minimize the round trips there. So, to kind of sum up what we've discussed here, I think it's really important to focus on where your highest impact work is going to be, get an intuitive sense of where your systems will break, what the highest risks are, and how you can exercise your systems and your code to trigger those things, understand them, and then have mitigations. Consistency is super important in infrastructure, and it's also challenged if you're in a self managed environment like we are. But I encourage using whatever tools are at your disposal, in particular infrastructure, as code, to try to achieve consistency and then testing automation, really important, because this will help you up that experiment per week metric, or whatever the similar metric is in your environment. The more tests you can run, and the more easily you can run those tests, the more you'll discover these strange edge cases, and the more you'll anticipate the problems before you run into them in production. And then debug tooling again is really important. This is something that helps in those dual roles of both during staging and testing and in production. So hopefully that was helpful. Today, CTO, understand about the work that we've been doing to reduce risk and improve performance at fleets. And thank you for listening to my talk.
...

Zach Wasserman

CTO @ Fleet

Zach Wasserman's LinkedIn account Zach Wasserman's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)