Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and
observing changes exceptions errors
in real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
Code Samatrix
or taming agent chaos for Conf 42 Chaos Engineering
2022 I'm Zach Wasserman. I'm the
CTO of Fleet and a member of the OS Query technical steering committee,
and I'm excited to be presenting this to you today. So for
a bit of background, I'll start by explaining what Osquery is.
Osquery is an agent that endeavors to bring visibility to
all of our endpoints. We're talking macOS,
Linux and Windows. And the way that it does this
is it exposes the operating system internals and
the state of the system as though they were a relational database.
So I helped co create OS query at Facebook.
It's since been transitioned to a project with the Linux foundation.
It's all open source, so go check it out on GitHub.
Like I said, Oscar is cross platform. It's also read
only by design, which is something that's a bit helpful in constraining
the possible problem space when we're thinking about the
possible consequences of running this agent.
And we deploy it across quite a heterogeneous
infrastructure, lots of different devices.
We're talking production servers, corporate servers,
workstations across all of these major platforms.
So we're seeing lots of different deployments for lots
of use cases across security, it and operations.
And then we have fleet, which is the open code
osquery management server. And organizations
are using fleet to manage Osquery across hundreds
of thousands of devices.
Fleet is open core, so part of its
MIT license, but the source code is all available
there on GitHub, so feel free to also check that out if you're
interested in things I'm talking about today. And Fleet
is primarily a Linux server, and then it has a
CLI tool that's cross platform and used
typically on macOS or Linux, but also sometimes on Windows.
And I think it's useful to understand that when we're thinking about fleet as a
server. Fleets has two different major clients.
One is the OS create agent client. These are
agents checking in, looking for the configurations that they should
run, sending data that they've collected,
and updating their status with the fleets server.
This is where we're talking about hundreds of thousands of devices checking in.
And then we have the API clients, which tend to be a human clicking around
in the UI, or maybe a script that's
modifying the configurations that are sent to the agents or looking
at the data that's been collect. And I want to highlight that there's kind
of an order of magnitude difference in the traffic,
or perhaps multiple orders of magnitude difference in the traffic
generated by each of these clients. So we're going to treat them very
differently as we look at the reliability characteristics of our
system. How do we engineer this whole
agent server system for resilience?
Well, I think it's really important to focus
for doing this. We need to identify our key
areas of risk, decide which of those we want
to prioritize, and then apply mitigations that are
focused on those identified areas. And so when
looking at our system that I just described, this fleet and
os query server agent system,
I see the following priorities.
I think availability of the production workloads
almost always has to be the top priority of our
system. So we would need to be really careful that
the production workloads in our environments, the workstations that
folks are using to get their jobs done, are not impacted negatively
by the agent. Our next priority is the integrity of the
data that we collect, because our security programs
and our operations teams rely on the integrity
of this data in order to do their jobs effectively.
And then as a third
priority, I'll put cost. And of course cost is very important and
it's going to depend on the organization and the particular scenarios.
But typically it's the availability and the integrity
that are going to be higher priorities, where cost is only
going to vary by some more limited amount. And when thinking
about all of these things and
the focus that we want to take on risk, I think
it's useful to look at the paredo principle or
Omdahl's law. This idea of a small
production of the clients are going
to generate an outsized amount of the traffic,
or a small part of the system is going to generate
an outsized amount of the load. And if we want
to optimize for our system,
for whether it's reliability or performance or
some other metric like that, we need to be
looking at the places that we can have the highest impact and
that are the greatest contributors in order to be
the most efficient with our efforts. So looking at
this a little bit more concretely, with our agent OS query
and our server fleet, starting with OS query, I think
one of our top risks is that
OS query runs often on production servers.
These are servers that are doing things that actually make the company
money or serve the main purpose of the organization.
So these production workloads and their availability is
extremely important. A next top priority
is the usability of workstations. We have
people who we pay and who are dedicated to getting
their work done, and they need to have usable workstations.
So we can't have this agents taking down their workstations, even if this
is perhaps just a slight bit less critical than those
production workloads. And then of course, we've got the monitoring integrity.
We don't want to miss out on critical security events,
we don't want to miss out on operational
events. And of course, as mentioned before,
the cost of compute can actually really add up
here, because when we're talking about hundreds of thousands of hosts that we're running an
agent on, a small percentage increase in the resources
consumed by OS query can lead CTO a pretty
huge amount of extra compute cost. So when I look
at all of these factors together and I think about
our agent server system, I think the agent is
pretty high risk. This is going to be a big area of focus
for mitigating risk. Then if we look at
fleets, the server, well, the monitoring availability of
fleet is quite important. Again, this is because
of those security teams and those
questions, teams that are relying on being able to get that osquery
data in a timely manner.
Speaking of timely manager, the latency is important,
but the latency is perhaps a bit less
important than just getting the data at some
point. And so that's a monitoring integrity. We know that
we need to get those critical security events,
otherwise we can be missing out on threats within our infrastructure.
And then again, the compute cost can be a factor here.
But I'd say this is typically a much lower consideration
for fleet deployed,
as we're talking about just a handful of servers to
perhaps a dozen or a couple of dozen servers,
plus some MySQL and redis infrastructure dependencies.
So even if we have to do a pretty massive scale up here,
this compute cost is not going to be
very significant compared to even a few percentage endpoints
in an increase in the production compute costs.
So when I look at all of this together, I think this feels
like medium risk. So I'll talk now about some of the mechanisms
that we have in OS query for mitigating these risks and
for preventing the proverbial
chaos. So osquery has got something called the watchdog,
which is a multiple process worker watcher model.
So one process watches
the other OS query process, and checking for how
much cpu is being used, how much memory is being used.
And if those limits are exceeded,
then the work that the agent was doing
at that time is blocked for 24 hours.
So if we find ourselves in a situation
where unexpected high levels of load are
occurring, we can actually sort of self recover by saying that's
not work that we're going CTo do in the next 24 hours.
And this strategy helps to mitigate those
production availability and workstation availability or
usability issues, meaning we prioritize the primary
computing tasks of these systems.
Of course it's also worth looking at the trade offs and the downside of
this one is that blocking those queries
does potentially reduce the integrity of the monitoring
that we'd like to achieve. Another mechanism that's really
common to use with osquery, and this is actually one that's available
for lots of services because it's not specific to OS query,
is c groups. So of course we can ask
the Linux kernel to maintain strict limits on the
amount of cpu and memory utilized.
And this is something that is commonly used in combination
with the OS query watchdog. And we see this as
a way to both have something kind of application specific
that understands the concept of queries and knows
how to block them. That's the watchdog. Whereas C groups are kind
of a final backstop helping us be careful about
how much of the resources on a system we use.
And this really just mitigates the production availability.
Although acknowledging that some users do have
Linux workstations, mostly we're talking about productions here.
And so that downside is that this is only compatible with Linux.
We also think it's important to
do profiling and estimate the performance characteristics
of the queries that are deployed before
doing so. And OS query has some application specific
profiling tooling that's available. And we can see over here
on the right, an example of given
some input queries, OS query generates some
sort of benchmarks for how much user time,
system time, memory file accesses,
and those sort of metrics are used when executing these
queries. And that can give us a preview of how expensive this query
is and allow us to gain some predictability
before we go out utilizing those resources.
And so then this is a mitigation for the monitoring integrity because
we can be sure of which queries will and won't set off that watchdog.
And also we can understand in advance what
is the actual cost in terms of compute of deploying
this kind of thing. Then there's monitoring
that we can get in osquery as well.
And so this is something that fleet does in combination with
OS query is recording statistics for
the actual execution of the queries across all of the hosts
and then I've just provided an example
of here of how we render this kind of information within the fleet
UI. Just trying to give sort of a subjective assessment.
And this allows an operator to get
an idea of does the performance of this
query match what I expected in the real world?
And is this something that's worth me continuing to pursue?
This again mitigates the monitoring integrity and
the compute cost issues. Then over on the fleets
server side, of course, it's worth mentioning that we use
common server scalability practices. So multiple fleet server processes
run behind a load balancer. The MySQL and
redis dependencies are clustered with ready
to failover. And this is often, we'll often
deploy this across multiple regions or that sort of thing.
We'll utilize auto scaling for efficient infrastructure
sizing when this is deployed within the cloud. So the
AWS or GCP kind of auto scaling and
these common practices help to mitigate
the monitoring availability, the latency integrity and
the compute costs. But I think the downside
to be aware of in particular with some dependencies that are
less elastically scalable, like MySQL or
redis, these kind of fixed infrastructure dependencies is that if
these aren't properly sized, then you can really overestimate the
compute costs. Fleets also relies on
back pressure strategy that's supported by
the OS query agent. By the agents will
actually buffer data on each individual
endpoint, attempt CTO, write that data to the server
and it will not clear from the buffer on the local endpoint until
the server has acknowledged successful receipt of the
messages. I think this is a really great approach to
use because we have this scenario where the
servers are of course vastly outnumbered by the clients.
If we can push a very small amount of storage work
off to the clients, we can actually really benefit from
the availability characteristics that this
provides. And so this really mitigates
the monitoring integrity problem because
we know that except in extreme cases,
no matter what happens, that data will eventually make it
to the server and be processed properly. Of course,
under bad scenarios, this can increase latency and like
I said, the integrity can be compromised in extreme
cases when these internal buffers overflow. And of
course that's yet another level of consideration that
we have to take into account, is that we don't want to end up overwhelming
the agents by buffering too much data there.
So that old data is dropped. So back to this idea of
engineering resilience. What are the other challenges that we have
faced and how have we approached them? So a really big challenge
for us at fleet is this self
managed infrastructure and self managed deployment
of the fleets server. So all of the fleet software,
both the agents that are running on the individual endpoints
and the servers, the fleet servers plus their infrastructure
dependencies, they're all in the customer environment.
So we don't have control over those environments.
Those environments can end up being very inconsistent.
Deploys can be slow and are completely out of
our control. So we might have outdated versions of the software
running. And when we get into debugging scenarios, there's a
very slow feedback loop. So these
are some challenges that are introduced by that, and I'll talk about some
strategies that we use to manage that. First, I'll note
that it's really important, as much as possible,
to strive for consistency.
The more heterogeneous, the more different the deployments are,
the more edge cases that users and operators
will run into. And for us, this looks like things like MySQL
could be MySQL could be MySQL five, seven,
MySQL eight could also be MariaDB
or Aurora on AWS.
Redis can be running in cluster mode and redis sentinel
mode, or just in single host mode. And so there's
all of these different permutations of infrastructures
that could be encountered. And these are all things that we sort
of account for as chaotic factors
in our engineering work. So we
still want to strive for consistency as much as possible.
And we've achieved that a couple of ways.
One, by using infrastructure as code and
trying to push folks to deploy basically
a reference architecture of fleet.
And just as an example, this is kind of how we specify the reference
architectures down to a very specific version
number of mysql, and the exact number of
tasks, and the memory that each task should
have for the compute. And we found that the more that we can get
folks to adhere to these recommendations,
the easier it is for us to work with them, anticipate the
problems that they'll have, and reproduce problems that are
encountered in the wild. Testing is, of course, also really
important. And I think the biggest takeaway for us on the chaos
front is that automation is really important.
So we've identified a metric of how many experiments
can we run per week, because we see this
idea that the more experiments that we run, the more different
combinations of things we can try, things going wrong, things going right,
the more edge cases we'll encounter, and then the more issues
that we'll detect before they go into production or before
they are encountered in production. And this is also a place
where the infrastructure, as code really comes into play.
If we can spin up and down environments easily and we
can modify their parameters easily, then this will help us
increase this metric of experiments per week.
In terms of testing, the tooling is also really
important. So generic HTTP testing
tools probably won't hit your edge cases,
and they definitely don't know what the hot paths are for you
in production, but you can probably identify those or at
least make some guesses. And in our experience,
it's really useful to build custom tooling.
And so for us at fleet, thinking back to where
we want to focus our efforts in terms of optimizing
and mitigating risk, it's this agent server coordination.
And so for the fleets server, we've created
simulations of the OS query agent so that we can start up a
lot of agents, users them in different scenarios
and test how the server responds to these things,
so that we've identified that hot path, that's the
agent check ins and the processing of the received data.
We can simulate these agents efficiently,
meaning tens or hundreds of thousands
of simulated agents running on a single actual
virtual machine. And this is something that can
be done pretty easily with amazing support
for high concurrency HTTP
when using go, the go programming language.
Just for an example, here's a template
that we use to kind of simulate the data that's returned
from a host. And you can see that there are some templated
parameters in here where it says cache, bring host name and
uuid within the curly braces.
These tests are opportunities to inject
more randomness and see how the system
responds. So what happens when we run a bunch
of simulated agents that look very similar when we
run a bunch that look very different when they all start up at the
same time, or if they're all really spread out? These are all different factors
in how the system will behave. What happens if our network
access becomes compromised? If we can run these kind of simulations
on a relatively cheap infrastructure, and we can have good control over
those factors, then we hit more of those edge cases.
And then debugging plays a really important,
I call it dual role here between staging or load testing
and production. Debugging helps. Having good debugging
tooling helps us detect issues before we
ship them or before they become actual incidents. And it
also helps us CTO resolve incidents more quickly
approach that we found CTO be useful for.
This is what I like to call collect first,
ask questions later. And this is something that we
achieve with our fleet control debug archive
command. So I'll just show an example of
the execution of this command. With one command, we run
all of these different profiles and debugging
steps, finding out the allocations of memory,
the errors that have been taking place,
getting information about the concurrency profiling
the cpu do, getting information
from the database, including the
status of the underlying database, the locking that's
happening there. And then by doing this, when we simulate
a scenario that we find causes undesirable
behavior, we can easily get a whole snapshot of these things
and then take some time later to analyze
the data that was collected. On the other hand, we talked earlier about the
difficulty of the feedback loops when debugging in
a self managed environment. And this is also something
that really mitigates that when we have a single
command that we can ask our customers
and our open source users to run, and that gives us a whole bunch
of data that we can look at to minimize the round trips
there. So, to kind of sum up what we've discussed here,
I think it's really important to focus on where
your highest impact work is going to be,
get an intuitive sense of where your systems will break,
what the highest risks are, and how you can
exercise your systems and your code
to trigger those things, understand them,
and then have mitigations. Consistency is
super important in infrastructure, and it's also challenged
if you're in a self managed environment like we are.
But I encourage using whatever tools are at your
disposal, in particular infrastructure, as code,
to try to achieve consistency and
then testing automation, really important, because this will help
you up that experiment per week metric,
or whatever the similar metric is in your environment.
The more tests you can run, and the more easily you can run those tests,
the more you'll discover these strange edge cases,
and the more you'll anticipate the problems before you run into them in production.
And then debug tooling again is really important.
This is something that helps in those dual roles of both during staging
and testing and in production.
So hopefully that was helpful. Today,
CTO, understand about the work that we've been doing
to reduce risk and improve performance at fleets.
And thank you for listening to
my talk.