Transcript
This transcript was autogenerated. To make changes, submit a PR.
And welcome to adopting Graviton two how honeycomb reduce our infrastructure
spend by 40% on our highest volume service I'm Shelby spees.
I'm a developer advocate at Honeycomb, and you can find me on Twitter
at shelbyspeace. You might be here because you're considering migrating to
graviton two in your environment. Now. You don't want to jump into such a change
just because it's the new shiny. You should have a measurable outcome that you're trying
to achieve at Honeycomb. We'd originally heard about these new instance
types during Andy Jassey's keynote at Remnt 2019. They were
promising lower cost, better performance, and a reduced environmental impact.
So how is graviton two different the armbased architecture is more efficient.
It runs cheaper because more of the physical cpu did is dedicated
to just doing compute. It also has less power consumption, plus it's faster
because arm doesn't share execution units between threads across different virtual cpus.
This gives you less tail latency and variability in performance. Long story
short, we ended up making some waves as early adopters of AWS's graviton,
two instances in production. I'm here to share with you how we went about this
adoption, what we learned, and what you can do to make
large migrations safe in your own systems. For Honeycomb,
we wanted the cost and performance improvements of graviton two. Yes,
but it would only make sense for us if we could get those things without
hurting our reliability. So first we had to ask, is it even worth
the risk to being able to answer that? We needed to think about
what's important to honeycomb as a business. Honeycomb is a data storage
engine and analytics tool. We ingest our customers'telemetry
data, and then we enable fast querying on that data. At Honeycomb, we use
service level objectives, which represent a common language between engineering
and business stakeholders. We define what success means according to
the business and then measure it with our system telemetry
throughout the lifecycle of a customer. That's how we know how well our services are
doing, and that's how we measure the impact of changes. Slos are
for service behavior that has a customer impact. So at Honeycomb,
we want to ensure things like the in app homepage should load quickly with data,
that user run queries are returning results fast, and that
the customer data we're trying to ingest should get stored fast
and successfully. These are the sorts of things that our product managers
and our customer support teams frequently talk to engineering about. So once
we have our SLO defined that allows us to calculate our error budget.
How many bad events are we allowed to have in our time window?
Just like with a financial budget, the error budget gives us some wiggle room.
Finally, we calculate how fast we're burning through that error budget.
The steeper the burn rate, the sooner we're going to run out. So now we
start thinking, how soon are we going to hit zero? This allows us to alert
proactively if something is causing us to burn through our budget faster,
so fast that we're going to run out in a matter of hours. That's when
we should wake somebody up. And so now we're alerting on things that we've decided
are already important to the business, and we're doing it proactively. So thanks
to the work that we put in to measure how our services were doing,
we started having a period of pretty stable reliability. We're a
small startup, so once our reliability goals are met, we care a
lot about cost. And because we're a SaaS provider, our infrastructure
is our number two expenditure after people. So infrastructure
is something that scales with our production traffic. And since our sales goals involve landing
large customer accounts, we want to position ourselves to be able to support
the traffic increase associated with landing more big accounts.
Technical decisions are business decisions. It affects our ability to maneuver
in the market. So, having defined our goal and having confirmed our
ability to measure it, we needed to decide on how to proceed. How can we
safely experiment with an entirely different processor architecture? At Honeycombs,
we have multiple environments. We deploy the same code, the same git hash,
and it's running across all of those environments. Our production environment is where
customers send their data in order to observe their own systems.
And we have a second environment called Dog Food, where we send the telemetry from
production honeycomb. Like I said, it's not a staging environment. It runs the exact
same code as production. Dog food, allows us to query our production data
the way honeycomb's users interact with their production data.
It's a great way for honeycombers to build user empathy,
and it also allows us to test out experimental features on a safe audience.
So we'll often enable feature flags in dog food for testing internally.
Then we have a third environment called Kibble, and that's where we send our telemetry
from dog food. And so for the experimental features we've enabled in dog food,
that telemetry gets sent over to kibble. So kibble observes dog food,
dog's food observes prod. And one thing to know about honeycombs
on the outside. We're bee themed, but you might get the sense on the
inside we have a lot of dogs, so we have a number of different services,
and the biggest ones include shepherd, our ingest API service
retriever, our columnar storing query engine, and poodle, the front end
web application. So we really stuck to this theme.
For graviton two, we chose to try things out on shepherd because
it's the highest traffic, but it's also relatively straightforward,
it's stateless, and it only scales on cpu, and as a service it's
optimized for throughput first and then latency. So we have a place to
start. What's next? Well, we needed new base images, we needed to
check that our application code was compatible, and we needed to make sure that our
existing CI would produce build artifacts for ARm 64 honeycombs a
Go shop. So it turns out we just needed to set Arm 64 as a
compilation target in the go compiler. And the compiler can handle that for us,
even if it's not compiling on an arm box. And so we updated
our Circleci config to include a build step for the Arm 64 target.
I will say, though, you do need an arm machine to efficiently build
arm compatible docker images. At Honeycomb, we don't use docker for
any production workloads, just for internal branch deploys,
so that wasn't an issue for us. For other shops, if you're
on Java or Python, your binaries are already architecture independent.
But for example, if you're running c with some hand assembly,
you might need to update a few things. My teammate Liz
initially started out all this experimenting as a side project,
and with a few idle afternoons of work, she was already seeing compelling
results. So at that point we set up a b testing.
We started with one graviton instance in a shepherd autoscaling group in dog food,
and from there we could update our terraform code to spin up more graviton instances,
as we felt confident in each change. When we reached 20%,
we let that sit for a couple of weeks to observe, and here's what
we found. Our graviton instances here in the second
row saw lower latency overall AWS, well as less tail latency. The median
latency was more stable across different workloads, and about 10% faster
than the old architecture. On the old architecture, cpu utilization would
max out around 60%. On graviton we would get closer to
85%. So we got better utilization per cpu unit
and zooming out a bit. Here's the overall migration in dog food shepherd.
These graphs show our total VCPU usage at the top, and then
at the bottom, it's showing the number of dog food shepherd hosts. So you
can see the big cutover in midaugust. And from there, we did some tuning to
figure out that sweet spot where we're getting the best mileage out of those cpus.
This is my favorite graph so far. This is our cost reduction in dog food
food. So for our dog food shepherd service, we saw pretty
compelling results. And at that point, we decided we're ready to
roll out to production. So what happened next? We felt confident
about shepherd, so we migrated, prod shepherd, and saw a similar cost
reduction for retriever. We didn't care so much about reducing costs.
What we wanted to improve was performance. We care about fast querying.
So it turns out that we could opt to spend a little bit more to
get double the number of cores. Since each arm 64 core is
able to handle 50% more load than on equivalent intel
chips, we ended up with triple the performance. So once we were
already all in on graviton two, migrating retriever was a no brainer.
And retriever immediately saw a significant improvement in tail latency under
load. Those weekday bumps just totally flatten out on the P 99 graph.
Zooming out again, our traffic volume has increased significantly over
this past year. We're approaching triple the workloads on retriever compared to when
we started. But look, our tail latency is staying the
same. It's just holding steady. So that's fantastic.
One snag we did encounter early on was spot instance availability.
When we started scaling up in prod, we ended up using all the n
60 D instances available in spot. So we paused
our migration and Liz ended up reaching out to the graviton two team,
and they were able to shift capacity for us within a few hours. So then
we were back business. Another thing that happened is on our
Kafka cluster. Kafka sits between shepherd and Retriever, and it allows
us to decouple those two services and replay event streams for
ingest. So we were actually testing Kafka on Graviton two.
We were so early, we were testing it before even confluent had
tried it on the new architectures, and we're probably the first to use it for
production workloads. And we ended up changing too many variables at once. We wanted
to move to tiered storage on confluent Kafka to reduce the number of instances we
were running. And we also tried the architecture switch at the same time.
Plus, we introduced AWS, nitro, and all of these variables
at once. That was a mistake. So we've published a blog
post on this experience, as well as a full incident report. I highly
recommend that you go read it to better understand the decisions we made and what
we learned. So we've reverted the Kafka change. And we also have this
long tail of individual hosts and smaller clusters that we'd like to migrate.
But four of our five biggest services are fully on graviton two,
and here's what that looks like. Those services make up the vast majority of
our traffic, and we're really thrilled with the cost savings, the performing
improvements. Plus, it feels great to be able to say that we've reduced
our environmental impact as well. So here are some things I hope
that you can take with you. The most important thing to remember when
considering a significant technology migration is to have a goal in mind,
something that's measurable, so you know, whether or not your change was successful,
you need to be able to compare your experiments to a baseline.
Slos are a really great way to approach this. Another thing to keep in mind
is that there's always hidden risks. We're lucky to have Liz's expertise and
sense of ownership, and I think that's really important. Part of being
an early adopter is making sure you have that expertise in house. But we still
ran into some snags, like Amazon running out of graviton two spot instances.
So we're lucky that we were able to make friends with the team and talk
to them. But it does add more variables and potential silos.
We did have a lot of luck with terraform, cloud, and circleci.
It smoothed out a lot of the experimentation that would normally be manual clicking in
the console, and so we could point to individual changes and figure
out what to revert. But all of these hidden risks have a human impact.
And in general, it's important to take care of your people.
Incidents happened, and we're lucky that we had existing practices
that helped a lot. We encourage people to escalate when they need a
break, when they are starting to feel tired. We remind, or sometimes we
guilt people into taking time off work to make up for off hours.
Incident response. Another thing that came up recently is that
people had to responding to incidents. Couldn't cook dinners for themselves,
they couldn't cook meals, and they couldn't cook for their families. And so it was
almost this no brainer thing that once somebody said it, of course people should expense
meals when they're doing incident response for themselves and their families.
And so we made an official policy about that. And in general, I think it's
good to document and make official policy out of things that are often
unspoken agreements or assumptions so that everyone on your team
can benefit and feel very clear in those decisions.
One of our values at honeycomb is that we hire adults, and adults
have responsibilities outside of work. So you're not going to build a
sustainable, healthy, sociotechnical system, a sustainable,
healthy team, if you don't account for those responsibilities outside of
work. Take care of your people. Finally, optimize for safety,
ensure that people don't feel rushed, and remember that complexity
multiplies. So whatever you can do to isolate variables, create tight
feedback loops. And then just keep in mind that even as
much as you do, that things are going to intersect in unexpected ways.
Complex systems fail in really unexpected ways. And so just acknowledging
that sometimes things are going to take longer than you would like,
we plan to eventually migrate kafka over to the graviton two,
but we didn't do it in the time range that we wanted to. And that's
okay. So just keep that in mind. Another thing is that isolating variables
makes it easier for people to update their mental models. As changes go out,
it really helps to get everyone talking to each other. And so allowing time
for things to simmer and encouraging people to talk about these changes can
be really, really helpful for your overall reliability and resilience.
If you'd like to read our graviton two posts on the Honeycomb blog, you hear
a couple of links and we'll be posting more about it soon, and you can
download slides at honeycomb IO slash Shelby. Also, I'd love it if
you reached out to me on Twitter. That's all I have for today. Thank you
so much.